Principles and Practice of Clinical Trials
Principles and Practice of Clinical Trials
Curtis L. Meinert
Editors
Principles and
Practice of
Clinical Trials
Principles and Practice of Clinical Trials
Steven Piantadosi • Curtis L. Meinert
Editors
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
In memory of
Lulu, Champ, and Dudley
A Foreword to the Principles and Practice of
Clinical Trials
Trying to identify the effects of treatments is not new. The Book of Daniel (verses
12–15) describes a test of the effects King Nebuchadnezzar’s meat:
Prove thy servants, I beseech thee, ten days; and let them give us pulse to eat, and water to
drink. Then let our countenances be looked upon before thee, and the countenance of the
children that eat of the portion of the King’s meat: and as thou seest, deal with thy servants.
So he consented to them in this matter, and proved them ten days. And at the end of ten days
their countenances appeared fairer and fatter in flesh than all the children which did eat the
portion of the King’s meat.
When the dullness (thiqal) and the pain in the head and neck continue for three and four and
five days or more, and the vision shuns light, and watering of the eyes is abundant, yawning
and stretching are great, insomnia is severe, and extreme exhaustion occurs, then the patient
after that will progress to meningitis (sirsâm). . . If the dullness in the head is greater than the
pain, and there is no insomnia, but rather sleep, then the fever will abate, but the throbbing
will be immense but not frequent and he will progress into a stupor (lîthûrghas). So when you
see these symptoms, then proceed with bloodletting. For I once saved one group [of patients]
by it, while I intentionally neglected [to bleed] another group. By doing that, I wished to
reach a conclusion (ra’y). And so all of these [latter] contracted meningitis. (Tibi 2006)
But it was not until the beginning of the eighteenth century before the importance
of treatment comparisons was broadly acknowledged, for example, as in chances of
contracting smallpox among people inoculated with smallpox lymph versus those
who caught smallpox disease naturally (Bird 2018).
By the middle of the eighteenth century there were examples of tests with
comparison groups, for example, as described by James Lind in relation to his
scurvy experiment on board the HMS Salisbury at sea:
On the 20th of May 1747, I took twelve patients in the scurvy, on board the Salisbury at sea.
Their cases were as similar as I could have them. They all in general had putrid gums, the
spots and lassitude, with weakness of their knees. They lay together in one place, being a
proper apartment for the sick in the fore-hold; and had one diet common to all, viz.,
watergruel sweetened with sugar in the morning; fresh mutton-broth often times for dinner;
vii
viii A Foreword to the Principles and Practice of Clinical Trials
at other times puddings, boiled biscuit with sugar, etc; and for supper, barley and raisins,
rice and currants, sago and wine, or the like. Two of these were ordered each a quart of
cyder a-day. Two others took twenty-five gutts of elixir vitriol three times a day, upon an
empty stomach; using a gargle strongly acidulated with it for their mouths. Two others took
two spoonfuls of vinegar three times a day, upon an empty stomach; having their gruels and
their other food well acidulated with it, as also the gargle for their mouth. Two of the worst
patients, with the tendons in the ham rigid, (a symptom none of the rest had), were put under
a course of seawater. Of this they drank half a pint every day, and sometimes more or less as
it operated, by way of gentle physic. Two others had each two oranges and one lemon given
them every day. These they eat with greediness, at different times, upon an empty stomach.
They continued but six days under this course, having consumed the quantity that could be
spared. The two remaining patients, took the bigness of a nutmeg three times a-day, of an
electuary recommended by an hospital surgeon, made of garlic, mustard-seed, rad raphan,
balsam of Peru, and gum myrrh; using for common drink, barley-water well acidulated with
tamarinds; by a decoction of which, with the addition of cremor tartar, they were gently
purged three or four times during the course.
***
The consequence was, that the most sudden and visible good effects were perceived from
the use of the oranges and lemons; one of those who had taken them, being at the end of six
days fit for duty. (Lind 1753)
Lind did not make clear how his 12 sailors were assigned to the treatments in his
experiment. During the late nineteenth and early twentieth century, alternation (and
sometimes randomization) became used to create study comparison groups that
differed only by chance (Chalmers et al. 2011).
In 1937 assignment was discussed in Hill’s book, Principles of Medical Statistics,
in which he emphasized the importance of strictly observing the allocation schedule.
Implementation of this principle was reflected in concealment of allocation sched-
ules in two important clinical trials designed for the UK Medical Research Council
in the 1940s (Medical Research Council 1944; 1948). Sir Austin Bradford Hill’s
1937 book went into 12 editions, and his other writings, such as Statistical Methods
in Clinical and Preventive Medicine, helped propel the upward methodological
progression.
The United States Congress passed the Kefauver-Harris Amendments to the
Food, Drug, and Cosmetic Act of 1938 in 1962. The amendments revolutionized
drug development by requiring drug manufacturers to prove that a drug was safe and
effective. A feature of the amendments was language spelling out the nature of
scientific evidence required for a drug to be approved:
The term “substantial evidence” means evidence consisting of adequate and well-controlled
investigations, including clinical investigations, by experts qualified by scientific training
and experience to evaluate the effectiveness of the drug involved, on the basis of which it
could fairly and responsibly be concluded by such experts that the drug will have the effect it
purports or is represented to have under the conditions of its use prescribed, recommended,
or suggested in the labeling or proposed labeling thereof. (United States Congress 1962)
Postscript
When we started this effort, there was no COVID-19. Now we are living through a
pandemic caused by the virus leading us to proclaim in regard to trials, as Charles
Dickens did in his A Tale of Two Cities in a different context, “the best of times, the
worst of times.”
“The best of times” because never before has there been more interest and
attention directed to trials, even from the President. Everybody wants to know
when there will be a vaccine to protect us from COVID-19.
x A Foreword to the Principles and Practice of Clinical Trials
“The worst of times” because of the chaos caused by the pandemic in mounting
and doing trials and the impact of “social distancing” on the way trials are done now.
It is a given that the pandemic will change how we do trials, but whatever those
changes will be, trials will remain humankind’s best and most enduring answer to
addressing the conditions and maladies that affect us.
Acknowledgment
We are indebted to Sir Iain Chalmers for his review and critical input in reviewing
this piece. Dr. Chalmers is founder of the Cochrane Collaboration and first coordi-
nator of the James Lind Library.
References
Amberson JB Jr, McMahon BT, Pinner M (1931) A clinical trial of sanocrysin in
pulmonary tuberculosis. Am Rev Tuberc 24:401–435
Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, Pitkin R, Rennie D,
Schulz KF, Simel D, Stroup DF (1996) Improving the quality of reporting of
randomized controlled trials. The CONSORT statement. JAMA 276(8):637–639
Bird A (2018) James Jurin and the avoidance of bias in collecting and assessing
evidence on the effects of variolation. JLL Bulletin: Commentaries on the history
of treatment evaluation. https://www.jameslindlibrary.org/articles/james-jurin-
xiv A Foreword to the Principles and Practice of Clinical Trials
and-the-avoidance-of-bias-in-collecting-and-assessing-evidence-on-the-effects-
of-variolation/
Chalmers I, Dukan E, Podolsky SH, Davey Smith G (2011) The advent of fair
treatment allocation schedules in clinical trials during the 19th and early 20th
centuries. JLL Bulletin: Commentaries on the history of treatment evaluation.
https://www.jameslindlibrary.org/articles/the-advent-of-fair-treatment-allocation-
schedules-in-clinical-trials-during-the-19th-and-early-20th-centuries/
Chan AW, Tetzlaff JM, Altman DG, Laupacis A, Gøtzsche PC, Krleža-Jeric K,
Hróbjartsson A, Mann H, Dickersin K, Berlin JA, Doré CJ, Parulekar WR,
Summerskill WSM, Groves T, Schulz KF, Sox HC, Rockhold FW,
Drummond R, Moher D (2013) SPIRIT 2013 statement: defining standard pro-
tocol items for clinical trials. Ann Intern Med 158(3):200–207
Coronary Drug Project Research Group (1973) The Coronary Drug Project: design,
methods, and baseline results. Circulation 47(Suppl I):I-1-I-50
Curran WJ, Shapiro ED (1970) Law, medicine, and forensic science, 2nd edn. Little,
Brown, Boston
D’Agostino R, Sullivan LM, Massaro J (eds) (2007) Wiley encyclopedia of clinical
trials, 4 vols. Wiley, New York
DeAngelis CD, Drazen JM, Frizelle FA, Haug C, Hoey J, Horton R, Kotzin S,
Laine C, Marusic A, Overbeke AJPM, Schroeder TV, Sox HC, Van Der Weyden
MB (2004) Clinical Trial Registration: A statement from the International Com-
mittee of Medical Journal Editors. JAMA 292:1363–1364
Fisher RA, MacKenzie WA (1923) Studies in crop variation: II. The manurial
response of different potato varieties. J Agric Sci 13:311–320
Friedman LM, Furberg CD, DeMets DR (1981) Fundamentals of clinical trials, 5th
edn, [2015]. Springer, New York
Haggard HW (1932) The Lame, the Halt, and the Blind: the vital role of medicine in
the history of civilization. Harper and Brothers, New York
Hill AB (1937) Principles of medical statistics. Lancet
Hill AB (1962) Statistical methods in clinical and preventive medicine. Oxford
University Press, New York
Levine RJ (1988) Ethics and regulation of clinical research, 2nd edn. Yale University
Press, New Haven
Lind J (1753) A treatise of the scurvy (reprinted in Lind’s treatise on scurvy, edited
by CP Stewart, D Guthrie, Edinburgh University Press, Edinburgh, 1953). Sands,
Murray, Cochran, Edinburgh
Medical Research Council (1931) Clinical trials of new remedies (annotations).
Lancet 2:304
Medical Research Council (1944) Clinical trial of patulin in the common cold.
Lancet 16:373–375
Medical Research Council (1948) Streptomycin treatment of pulmonary tuberculo-
sis: a Medical Research Council investigation. Br Med J 2:769–782
Meinert CL, Tonascia S (1986) Clinical trials: design, conduct, and analysis. Oxford
University Press, New York (2nd edn, 2012)
Meinert CL, Tonascia S (1998) Controlled Clinical Trials. Encyclopedia of biosta-
tistics, vol 1. Wiley, New York, pp 929–931
A Foreword to the Principles and Practice of Clinical Trials xv
National Institutes of Health (1979) Clinical trials activity (NIH Clinical Trials
Committee; RS Gordon Jr, Chair). NIH Guide Grants Contracts 8 (# 8):29
National Institutes of Health (1981) NIH Almanac. Publ no 81-5. Division of Public
Information, Bethesda
National Institutes of Health (2003) NIH data sharing policy and implementation
guidance. http://grants.nih.gov/grants/policy/data_sharing/data_sharing_
guidance.htm
Office for Protection from Research Risks (1979) The Belmont Report. Ethical
principles and guidelines for the protection of human subjects of research,
18 April 1979
Patulin Clinical Trials Committee (of the Medical Research Council) (1944) Clinical
trial of Patulin in the common cold. Lancet 2:373–375
Piantadosi S (1997) Clinical trials: a methodologic perspective. Wiley, Hoboken (3rd
edn, 2017)
Pocock SJ (1983) Clinical trials: a practical approach. Wiley, New York
Stewart WH (1966) Surgeon general’s directives on human experimentation. https://
history.nih.gov/research/downloads/surgeongeneraldirective1966.pdf
Sutton HG (1865) Cases of rheumatic fever. Guy’s Hosp Rep 11:392–428
Taichman DB, Sahni P, Pinborg A, Peiperl L, Laine C, James A, Hong ST,
Haileamlak A, Gollogly L, Godlee F, Frizelle FA, Florenzano F, Drazen JM,
Bauchner H, Baethge C, Backus J (2017) Data sharing statements for clinical
trials: a requirement of the International Committee of Medical Journal Editors.
Ann Intern Med 167(1):63–65
Tibi S (2006) Al-Razi and Islamic medicine in the 9th century; J R Soc Med 99(4):
206–207
United States Congress (103rd; 1st session): NIH Revitalization Act of 1993,
42 USC § 131 (1993); Clinical research equity regarding women and minorities;
part I: women and minorities as subjects in clinical research, 1993
United States Congress (87th): Drug Amendments of 1962, Public Law 87-781, S
1522. Washington, Oct 10, 1962
Vozeh S (1995) The International Conference on Harmonisation. Eur J Clin
Pharmacol 48:173–175
Waterhouse B (1800) A prospect of exterminating the small pox. Cambridge Press,
Cambridge
Waterhouse B (1802) A prospect of exterminating the small pox (part II). University
Press, Cambridge
Zarin DA, Ide NC, Tse T, Harlan WR, West JC, Lindberg DAB (2007) Issues in the
registration of clinical trials. JAMA 297:2112–2120
Preface
The two of us have spent our professional lives doing trials; writing textbooks on
how to do them, teaching about them, and sitting on advisory groups responsible for
trials. We are pleased to say that over our lifetime trials have moved up the scale of
importance to now where people feel cheated if denied enrollment.
Clinical trials are admixtures of disciplines: Medicine, behavioral sciences, bio-
statistics, epidemiology, ethics, quality control, and regulatory sciences to name the
principal ones, making it difficult to cover the field in any textbook on the subject.
This reality is the reason we campaigned (principally SP) for a collective work
designed to cover the waterfront of trials. We are pleased to have been able to do this
in conjunction with Springer Nature, both as print and e-books.
There has long been a need for a comprehensive clinical trials text written at a
level accessible to both technical and nontechnical readers. The perspective is the
same as that in many other fields where the scope of a “principles and practice”
textbook has been defining and instructive to those learning the discipline. Accord-
ingly, the intent of Principles and Practice of Clinical Trials has been to cover,
define, and explicate the field in ways that are approachable to trialists of all types.
The work is intended to be comprehensive, but not encyclopedic.
xvii
Acknowledgments
xix
xx Acknowledgments
A special thanks to Gillian Gresham for her production of the appendices and her
efforts as Senior Associate Editor.
Volume 1
xxi
xxii Contents
Volume 2
Volume 3
104 CONSORT and Its Extensions for Reporting Clinical Trials . . . . 2073
Sally Hopewell, Isabelle Boutron, and David Moher
Appendix 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2475
Appendix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2477
Appendix 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2481
Appendix 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2489
Appendix 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2493
Appendix 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2499
Appendix 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2503
Appendix 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2509
Appendix 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2513
Appendix 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2515
Appendix 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2523
Appendix 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2525
Appendix 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2529
Appendix 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2535
Appendix 15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2557
Appendix 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2563
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2565
About the Editors
xxxi
xxxii About the Editors
Curtis L. Meinert
Department of Epidemiology
School of Public Health
Johns Hopkins University
Baltimore, MD, USA
Gillian Gresham
Department of Medicine
Cedars-Sinai Medical Center
Los Angeles, CA, USA
Steven N. Goodman
Stanford University School of Medicine
Stanford, CA, USA
xxxvii
xxxviii About the Section Editors
Eleanor McFadden
Frontier Science (Scotland) Ltd.
Kincraig, Scotland
O. Dale Williams
University of North Carolina at Chapel Hill
Chapel Hill, NC, USA
University of Alabama at Birmingham
Birmingham, AL, USA
Babak Choodari-Oskooei
MRC Clinical Trials Unit at UCL
Institute of Clinical Trials and Methodology, UCL
London, UK
About the Section Editors xxxix
Stephen L. George
Department of Biostatistics and Bioinformatics
Duke University School of Medicine
Durham, NC, USA
Tianjing Li
Department of Ophthalmology
School of Medicine
University of Colorado Anschutz Medical Campus
Colorado School of Public Health
Aurora, CO, USA
Karen A. Robinson
Johns Hopkins University
Baltimore, MD, USA
xl About the Section Editors
Nancy L. Geller
Office of Biostatistics Research
NHLBI
Bethesda, MD, USA
Winifred Werther
Amgen Inc.
South San Francisco, CA, USA
Christopher S. Coffey
University of Iowa
Iowa City, IA, USA
About the Section Editors xli
Mahesh K. B. Parmar
University College of London
London, England
Lawrence Friedman
Rockville, MD, USA
Contributors
xliii
xliv Contributors
Patrick Royston MRC Clinical Trials Unit at UCL, Institute of Clinical Trials and
Methodology, London, UK
Estelle Russek-Cohen Office of Biostatistics, Center for Drug Evaluation and
Research, U.S. Food and Drug Administration, Silver Spring, MD, USA
Laurie Ryan National Institutes of Health, National Institute on Aging, Bethesda,
MD, USA
Anna Sadura Canadian Cancer Trials Group, Queen’s University, Kingston, ON,
Canada
Ian J. Saldanha Department of Health Services, Policy, and Practice and Depart-
ment of Epidemiology, Brown University School of Public Health, Providence, RI,
USA
Amber Salter Division of Biostatistics, Washington University School of Medi-
cine in St. Louis, St. Louis, MO, USA
Marc D. Samsky Duke Clinical Research Institute, Durham, NC, USA
Division of Cardiology, Duke University School of Medicine, Durham, NC, USA
Frank J. Sasinowski University of Rochester School of Medicine, Department of
Neurology, Rochester, NY, USA
Willi Sauerbrei Institute of Medical Biometry and Statistics, Faculty of Medicine
and Medical Center - University of Freiburg, Freiburg, Germany
Roberta W. Scherer Department of Epidemiology, Johns Hopkins Bloomberg
School of Public Health, Baltimore, MD, USA
Pamela E. Scott Office of the Commissioner, U.S. Food and Drug Administration,
Silver Spring, MD, USA
Nicholas J. Seewald University of Michigan, Ann Arbor, MI, USA
Praharsh Shah University of Pennsylvania, Philadelphia, PA, USA
Linda Sharples London School of Hygiene and Tropical Medicine, London, UK
Pamela A. Shaw University of Pennsylvania Perelman School of Medicine, Phil-
adelphia, PA, USA
Dikla Shmueli-Blumberg The Emmes Company, LLC, Rockville, MD, USA
Ellen Sigal Friends of Cancer Research, Washington, DC, USA
Ida Sim Division of General Internal Medicine, University of California San
Francisco, San Francisco, CA, USA
Richard Simon R Simon Consulting, Potomac, MD, USA
Jennifer Smith Sunesis Pharmaceuticals Inc, San Francisco, CA, USA
Contributors liii
Lilly Q. Yue Center for Devices and Radiological Health, U.S. Food and Drug
Administration, Silver Spring, MD, USA
Rachel Zahigian Vertex Pharmaceuticals, Boston, MA, USA
Lijuan Zeng Statistics Collaborative, Inc., Washington, DC, USA
Xiaobo Zhong Ichan School of Medicine at Mount Sinai, New York, NY, USA
Tracy Ziolek University of Pennsylvania, Philadelphia, PA, USA
Part I
Perspectives on Clinical Trials
Social and Scientific History of Randomized
Controlled Trials 1
Laura E. Bothwell, Wen-Hua Kuo, David S. Jones, and
Scott H. Podolsky
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Early History of Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Refining Trial Methods in the Early Twentieth Century . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
The Role of Governments in the Institutionalization of Randomized Controlled Trials . . . . . . . 8
Historical Trial Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
RCTs and Evidence-Based Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
The Globalization of RCTs and the Challenges of Similarities and Differences in Global
Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Social and Scientific Challenges in Randomized Controlled Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
L. E. Bothwell (*)
Worcester State University, Worcester, MA, USA
e-mail: [email protected]
W.-H. Kuo
National Yang-Ming University, Taipei City, Taiwan
e-mail: [email protected]
D. S. Jones
Harvard University, Cambridge, MA, USA
e-mail: [email protected]
S. H. Podolsky
Harvard Medical School, Boston, MA, USA
e-mail: [email protected]
Abstract
The practice and conceptual foundations of randomized controlled trials have
been changed both by societal forces and by generations of investigators com-
mitted to applying rigorous research methods to therapeutic evaluation. This
chapter briefly discusses the emergence of key trial elements such as control
groups, alternate allocation, blinding, placebos, and finally randomization. We
then explore how shifting intellectual, social, political, economic, regulatory,
ethical, and technological forces have shaped the ways that RCTs have taken
form, the types of therapies explored, the ethical standards that have been
prioritized, and the populations included in studies. This history has not been a
simple, linear march of progress. We also highlight key challenges in the histor-
ical use of RCTs and the more recent expansion of concerns regarding competing
commercial interests that can influence trial design. As investigators continue to
advance the rigor of controlled trials amid these challenges, exploring the influ-
ence of historical contexts on clinical trial development can help us to understand
the forces that may impact trials today.
Keywords
History · Randomized controlled trial · Control groups · Fair allocation · Policy ·
Regulations · Ethics · Globalization · Ethnicity · Clinical trial
Introduction
Since the mid-twentieth century, clinical researchers have increasingly deployed ran-
domized controlled trials (RCTs) in efforts to improve the reliability and objectivity of
medical knowledge. RCTs have come to serve as authoritative standards of evidence for
the evaluation of experimental drugs and therapies, the remuneration for medical
interventions by insurance companies and governmental payers, and the evaluation of
an increasingly diverse range of social and policy interventions, from educational
programs to injury prevention campaigns. Yet, as researchers have increasingly relied
on the RCT as an evidentiary “gold standard,” critics have also identified myriad
challenges. This chapter highlights and adds to historical explorations of how RCTs
have come to serve such prominent roles in modern scientific (particularly clinical)
knowledge, considering the intellectual, social, political, economic, regulatory, ethical,
and technological contexts of this history. We also examine the history of the enduring
social and scientific challenges in the conduct, interpretation, and application of RCTs.
Trials comparing intervention and control groups are as old as the historical record
itself, appearing in the Hebrew Bible and in texts from various societies around the
world, albeit sporadically, for centuries (Lilienfeld 1982).The tenth century
1 Social and Scientific History of Randomized Controlled Trials 5
Persian physician, Abu Bakr Muhammad ibn Zakariyyaal-Razi, has been cele-
brated for conducting empirical experiments with control and intervention groups
testing contemporaneous medical practices such as bloodletting as a prevention for
meningitis (Tibi 2006). In the eighteenth century, Scottish surgeon James Lind
demonstrated the efficacy of citrus fruits over five alternative treatments for scurvy
among groups of sailors by following their responses to the ingested substances
under controlled conditions (Milne 2012). Loosely controlled trials, often
conducted by skeptics, increasingly appeared in the eighteenth and nineteenth
centuries to test therapies ranging from mesmerism to homeopathy to venesection
(Tröhler 2000).
These trials remained relatively scattered and dwarfed in the literature by case
reports that doctors published of their experiences with individual patients. Early
controlled trials had little apparent impact on therapeutic practice. Indeed, medical
epistemology through the nineteenth century tended to privilege the belief that
patients should be treated on an individual basis and that disease experiences were
not easily comparable among different patients (Warner 1986).
However, major shifts in the social and scientific structure of medicine in the late
nineteenth and early twentieth centuries created new opportunities and demands for
more rigorous clinical research methods. Hospitals expanded, providing settings for
more clinical researchers to compare treatment effects among numerous patients
simultaneously. Germ theory and developments in physiology and chemistry pro-
vided the stimulus for researchers to produce new vaccines and drugs that had never
been tested in patients. Charlatans also sought to capitalize from this wave of
discovery and innovation by marketing a host of poorly tested proprietary drugs of
dubious effectiveness. All these factors motivated scrupulous or skeptical clinical
investigators to pursue more sophisticated approaches to evaluate experimental
therapies. Simultaneously, public health researchers expanded their use of statistics,
bolstering empiricism in health research overall (Bothwell and Podolsky 2016;
Bothwell et al. 2016).
Among those interested in empirically testing the efficacy of remedies, the
question of controlling for the bias or enthusiasm of the individual arose, along
with related concerns about basing scientific knowledge on clinicians’ reports of
experiences with individual patients. In response, by the end of the nineteenth
century, several medical societies launched “collective investigations” that amal-
gamated numerous practitioners’ experiences using remedies among different
patients. The method was employed, for example, by the American Pediatric Society
in its 1896 evaluation of diphtheria antiserum, which incorporated input from 613
clinicians in 114 cities and towns and 3384 cases. The study demonstrated a 13%
mortality rate among treated patients (4.9% when treated on the first day of symp-
toms), far below the expected mortality baseline. This contributed to the uptake of
the remedy (Marks 2006). Still, some within the medical profession critiqued
“collective investigation” as an insufficiently standardized research method, while
numerous practicing clinicians complained that the method was a potentially elitist
infringement upon their patient care prerogatives. This dynamic would prove to be
an enduring tension between the clinical art of individualized patient care and an
aspiration to a generalizable medical science (Warner 1991).
6 L. E. Bothwell et al.
As medical research overall continued to become more empirical and scientific in the
early twentieth century, some researchers began to test remedies in humans much as
they would in the laboratory. They began to employ “alternate allocation” studies,
treating every other patient with a novel remedy, withholding it from the others, and
comparing outcomes. Dozens of alternate allocation studies appeared in the medical
literature in the early twentieth century (Chalmers et al. 2012; Podolsky 2015).
Reflecting the major threat of infectious diseases during this era, the majority of
alternate allocation trials assessed anti-infective therapies. For example, Waldemar
Haffkine, Nasarwanji Hormusji Choksy, and their colleagues conducted investigations
of plague remedies in India in the 1900s, German Adolf Bingel performed a double-
blinded study of anti-diphtheria antiserum in the 1910s, and a series of American
researchers investigated anti-pneumococcal antiserum in the 1920s. The researchers
who conducted these alternate allocation trials also introduced varying degrees of
statistical sophistication in their assessments of outcomes, ranging from simple quan-
titative comparisons and impressionistic judgments to the far rarer use of complex
biometric evaluations and tests of statistical significance (Podolsky 2006, 2009).
Many researchers espoused ethical hesitations toward designating control groups
in trials, as they often had more faith in the experimental treatment than the control
treatment, and therefore felt that it was unethical to allocate patients to a control
group. This was exemplified by Rufus Cole and his colleagues at the Hospital of the
Rockefeller Institute during the development of anti-pneumococcal antiserum. After
convincing themselves of the utility of antiserum based on early case series, the
researchers “did not feel justified as physicians in withholding a remedy that in our
opinion definitely increased the patient’s chances of recovery” (Cole, as cited in
Podolsky 2006). Amid a culture in medicine that tended to give clinical experimen-
tation less emphasis than physicians’ beliefs, values, and individual experiences
regarding treatment efficacy, clinical trials remained overshadowed in the pre-World
War II era by research based on a priori mechanistic justifications and case series, as
well as laboratory and animal studies.
Despite the minimal uptake of clinical trials with control groups, however,
those who saw trials as the optimal means of adjudicating therapeutic efficacy
became more sophisticated in their attempts to minimize biased assessments and
ensure fair allocation of patients to active versus control groups. Regarding bias,
patient suggestibility had long been acknowledged, with numerous researchers
employing sham treatments in the assessments of eighteenth- and nineteenth-
century unorthodox interventions like mesmerism (in France) and homeopathy
(in America, as well as in Europe) (Kaptchuk 1998; Podolsky et al. 2016). By the
early decades of the twentieth century, investigators began to increasingly use
sham control groups in their assessments of conventional pharmaceuticals, with
Cornell’s Harry Gold and colleagues using the existing clinical term “placebo” to
describe such sham control remedies in their assessment of xanthines for the
chest pain characteristic of angina pectoris in the 1930s (Gabriel 2014; Podolsky
et al. 2016; Shapiro and Shapiro 1997).
1 Social and Scientific History of Randomized Controlled Trials 7
In the 1950s and 1960s, the British and US governments alike spearheaded heavy
investments in academic medical research institutions. The British MRC, for
instance, played a key role in the institutionalization of the controlled clinical trial
(Lewontin 2008; Timmermann 2008). Academic clinical trials expanded substan-
tially in these countries in part through this support and a political culture of
investment in scientific research and institution-building in medicine (Bothwell
2014). Jonas Salk’s polio vaccine trial drew broad scientific interest, as did the
National Cancer Institute’s clinical trial expansion (Meldrum 1998; Keating and
Cambrosio 2012). As the 1950s also witnessed strong growth in industrial drug
research and development, some companies collaborated with public sector
researchers in devising clinical trials (Gaudilliere and Lowy 1998; Marks 1997).
Some surgeons also adopted the technique, initiating a series of randomized con-
trolled trials in the 1950s (Bothwell and Jones 2019).
Still, without a regulatory mandate to conduct rigorous trials, seemingly “well-
controlled” clinical studies remained a small proportion of clinical investigations in
the 1950s. For example, a 1951 study by Otho Ross of 100 articles entailing
therapeutic assessment in 5 leading American medical journals found that only
27% were “well controlled,” with 45% employing no controls at all (Ross 1951).
Within two decades of Ross’ evaluation, however, the US Food and Drug
Administration (FDA) established regulations that would dramatically shape the
subsequent history of RCTs (Carpenter 2010). The US federal government had
been gradually building and clarifying the FDA’s power to regulate drug safety
and efficacy since requiring accurate drug labeling with the Pure Food and
Drug Act of 1906. As the pharmaceutical industry burgeoned in the 1950s, the
scientific community and regulators observed with troubling frequency the use
of unproven, ineffective, and sometimes dangerous drugs that had not been
adequately tested before companies promoted their benefits. Yet the FDA
lacked the necessary statutory authority to strengthen testing requirements for
drug efficacy and safety. This changed following an international drug safety
crisis in 1961 in which the inadequately vetted sedative thalidomide was found
to cause stillbirths or devastating limb malformations among infants of women
who had taken the drug for morning sickness during pregnancy (Carpenter
2010). This took place just as Senator Estes Kefauver was in the midst of
extensive hearings and legislative negotiating regarding the excesses of phar-
maceutical marketing and the inability of the FDA to formally adjudicate drug
efficacy. Broad public concern galvanized the political support necessary in
1962 for the passage of the Kefauver-Harris amendments to the Federal Food,
Drug, and Cosmetic Act. These established a legal mandate for the FDA to
require drug producers to evaluate their products in “adequate and well-con-
trolled investigations, including clinical investigations, by experts qualified by
scientific training and experience to evaluate the effectiveness of the drug
involved” (FDA 1963).
1 Social and Scientific History of Randomized Controlled Trials 9
By 1970, after prevailing in a legal battle with Upjohn Pharmaceuticals over the
methodological requirements of drug safety and efficacy studies, the FDA
established that RCTs (ideally, double-blinded, placebo-controlled) should be car-
ried out to fulfill the mandate of “adequate and well-controlled” drug studies. With
this decision, the FDA formally placed RCTs at the regulatory and conceptual center
of drug evaluation in America (Carpenter 2010; Podolsky 2015). While the FDA
seems to have spearheaded this regulatory specification for RCTs in part as a result of
a litigious culture in the American pharmaceutical industry, the global scientific and
regulatory community had come to a general consensus on the public health benefits
of high standards for drug trials. Regulators in Japan and the European Union soon
established similar trial requirements. As it worked to comply with these regulations,
the pharmaceutical industry, which had grown substantially since World War II,
became a major international sponsor of RCTs (Bothwell et al. 2016). By the 1990s,
industry replaced national governments as the leading funder of RCTs: governments
continued to fund substantial numbers of RCTs, but the sheer volume of pharma-
ceutical studies led to a larger proportion of overall published RCTs reporting drug
company funding than any other source. Pharmaceutical research grew more rigor-
ous in this process, but critics also raised concerns about conflicts of interest and the
shaping of biomedical knowledge through industry-sponsored trials, a problem that
has persisted in different manifestations in ensuing decades (as described later in this
chapter) (Bothwell 2014).
constraints that impeded their ability to either fully understand or freely and inde-
pendently elect to participate in trials. They contended that external review of
research was thus crucial to ensure that study designs were fair and informed consent
would be meaningfully achieved (Bothwell 2014). Clinical research directors and
investigators themselves also increasingly recognized the legal and ethical need for
expanded policies on peer review of study protocol ethics (Stark 2011).
All of these concerns escalated in the early 1970s as more research scandals came
to public light. Scientists, ethicists, and the public reeled when news broke of the 40-
year Tuskegee study of untreated syphilis among African American men. Investiga-
tors deceived study participants and withheld treatment long after antibiotics had
become available to cure the disease. In response to outcry over this tragedy, the US
Department of Health and Human Services passed Title 45 Code of Federal Regu-
lations, Part 46, in 1974, clarifying new ethical guidelines, formalizing institutional
review boards, and expanding their use in clinical trials and other human subjects
research. Since the United States was a leading global sponsor of RCTs at this time,
these ethical requirements had a sizable impact on the conduct of RCTs overall
(Bothwell 2014).
As RCT use expanded, ethicists began to clarify core challenges specifically
related to randomized allocation of patients to treatments. Critics continued to
raise concerns that withholding a promising treatment from patients simply in the
name of methodological rigor prioritized scientific advancement over patient care.
They argued that RCTs were not necessarily in the best short-term interests of
patients, since patient allocation to control and intervention arms could prevent
clinicians from fulfilling their obligations to administer what they believed to be
promising experimental therapies to all patients (Bothwell and Podolsky 2016).
Proponents of RCTs countered that randomized allocation to experimental and
control groups was essential to determine whether promising experimental treat-
ments would live up to the hopes of their proponents, or whether they would prove
less effective or be accompanied by unacceptable adverse events (Bradford Hill
1963). Growing numbers of researchers favored the latter stance, and in subsequent
decades, the notion was formalized as the principle of equipoise. This principle
stipulated that in situations of genuine uncertainty over whether a new treatment is
superior to the existing treatment, it is ethically acceptable for physicians to ran-
domly assign patients to either control or intervention arms (Freedman 1987).
Critics of the principle of equipoise noted that investigators often did not possess
a state of genuine uncertainty regarding whether an experimental treatment was
preferable to an existing treatment. Rather, researchers often had a sense that the
experimental treatment was favorable based on early case series or pilot studies.
Responding to this ethical confusion, Benjamin Freedman proposed the more
specific principle of “clinical equipoise” in 1987, stipulating that investigators may
continue to randomly allocate patients to different arms in trials only when there is
genuine uncertainty or honest professional disagreement among a community of
expert practitioners as to which treatment in a trial is preferable. According to this
interpretation, carefully conducted RCTs often would be necessary to determine the
actual efficacy and adverse events associated with medical interventions (Freedman
1 Social and Scientific History of Randomized Controlled Trials 11
1987). Most researchers now accept the rationale of clinical equipoise. Yet, among
each new generation of investigators there are those with ethical hesitations about
allocating trial participants to arms thought to be inferior.
RCTs continued to expand in the closing decades of the twentieth century as part of
broader trends toward more quantitative and empirical research methods in medi-
cine. New technologies continuously broadened the evidentiary foundation from
which medical knowledge could be developed. The introduction of computers into
medical investigations in the 1960s and 1970s increased researchers’ efficiency in
collecting and processing large quantities of data from multiple study sites, facili-
tating the conduct of RCTs and the dissemination of trial results. Alongside the
growing availability of data, critics increasingly questioned medical epistemology
for relying too heavily on theory and expert opinion without evidence from con-
trolled clinical experiments on statistically significant numbers of patients. In all
fields of medicine, critics deployed RCTs to assess both new, experimental treat-
ments and existing therapies that had become widespread despite never having been
rigorously tested. Numerous RCTs revealed that popular medical interventions were
ineffective or even harmful, leading to their discontinuation (Cochrane 1972).
By the early 1980s, scientists widely considered RCTs the “gold standard” of
clinical research (Jones and Podolsky 2015). Large-scale multi-site RCTs grew
exponentially in the published literature and were highly influential in medical
knowledge and clinical research methodologies (Bothwell 2014). Academic pro-
grams also developed to explore and critique empirical research methods. In the
1990s, Canadian medical researchers Gordon Guyatt and David Sackett coined the
term “evidence-based medicine” to refer to the application of current best evidence
to decisions about individual patient care. Advocates of evidence-based medicine
developed a pyramid illustrating a general hierarchy of research design quality, with
expert opinion and case reports at the lowest level, various observational designs at
intermediate levels, and RCTs at the pinnacle as the optimal study design. RCTs, in
turn, could be incorporated into meta-analyses and systematic reviews. In 1993, Iain
Chalmers led in the creation of the Cochrane Collaboration (now called Cochrane),
an international organization designed to conduct systematic reviews by synthesiz-
ing large quantities of medical research evidence in order to inform clinical decision-
making (Daly 2005). Internet expansion also facilitated wider access to information
on evidence-based medicine. In 2000, the NIH established ClinicalTrials.gov, a
publicly accessible online registry of clinical trials, with concomitant and now
legal requirements to register trials before initiation. Recent legislation also requires
the reporting of results from all registered trials on the site, regardless of outcome, so
that physicians, scientists, and patients can access more complete data from
unpublished trials. This has provided a counterpoint to publication bias while also
allowing valuable comparisons between predefined and published trial endpoints.
12 L. E. Bothwell et al.
The database has been a powerful tool to shift the power imbalance between those
who conduct RCTs and those who use their results.
The recent history of RCTs and evidence-based medicine has been characterized in
part by expanded trial globalization. This has often reflected commercial interests
and has raised new ethical, political, and regulatory questions. Pharmaceutical
companies want to complete increasingly demanding and complicated RCTs as
quickly as possible. They have realized that they can do so by recruiting sufficient
numbers of subjects on an international scale (Kuo 2008). Since the late 1970s,
contract research organizations (CROs), now a multibillion-dollar annual industry,
have grown to serve this demand. As for-profit entities, CROs have been broadly
critiqued for questionable practices such as offshoring growing numbers of trials to
middle-income countries, oftentimes studying fairly homogenous demographic
groups, a move which has raised skepticism regarding their ability to measure
treatment effects among diverse patients. CROs have also targeted research settings
with looser regulatory oversights and weaker systems of institutional ethical review,
raising concerns about the rights of research subjects. Access to a tested treatment
after a trial ends has also been a critical ethical concern (Petryna 2009).
At the same time, health policymakers hope to create a harmonized regulatory
platform to reduce redundant clinical trials and broaden accessibility to the latest
medications for the people who need them. Policy initiatives arose in the early 1980s
to address this from different perspectives and on different levels. The World Health
Organization’s (WHO) International Conference of Drug Regulatory Authorities
(ICDRA) was the first attempt to establish common regulations to help drug regu-
latory authorities of WHO member states strengthen collaboration and exchange
information. Other regional and bilateral harmonization efforts were motivated by
commercial concerns. For example, pharmaceuticals were selected as a topic for
trade negotiation at the first US-Japan Market-Oriented-Sector Selective talk in 1986
for the potential for higher sales of pharmaceuticals in Japan, then the second-largest
national market in the world. It was followed by expert meetings on the technical
requirements for drug approval, including RCT designs (Kuo 2005).
The International Conference on Harmonisation of Technical Requirements for
Registration of Pharmaceuticals for Human Use (ICH) was created in 1990. Initially
designed to incorporate Japan into global pharmaceutical markets with fewer regu-
latory hurdles, the ICH is a communication platform to accelerate the harmonization
of pharmaceutical regulations. It started with only members from the United States,
the European Union (EU), and Japan and carefully limited its working scope to
technical issues. The outcomes were guidelines for safety, efficacy, quality, and drug
labeling integrated into the regulations of each participating country/region after the
ICH reached consensus. ICH guidelines thus further established global recognition
1 Social and Scientific History of Randomized Controlled Trials 13
for RCTs and helped to standardize approaches for the generation of medical
evidence (Kuo 2005).
Following Japan, other East Asian states quickly recognized the importance of
the ICH and began following ICH guidelines. Korea and Taiwan aggressively
established sizable regulatory agencies and national centers for clinical trials. Even
Japan made infrastructural changes to reform clinical trial protocols and use common
technical documents (Chikenkokusaikakenkyukai [study group on the globalization
of clinical trials] 2013). In 1999, the ICH founded the Global Cooperation Group
(GCG) to serve as a liaison to other countries affected by these guidelines, but it did
not permit policy contributions from non-ICH member regions. It was not until 2010
that the ICH opened technical working groups to active participation from experts in
non-ICH member regions and countries of the ICH GCG. In addition to the founding
members of Japan, the United States, and the EU, the ICH invited five additional
regulatory members – Brazil, China, Korea, Singapore, and Taiwan.
The process of incorporating East Asian RCTs into ICH standards raised issues
concerning the generalizability of research findings. Researchers and policymakers
hoped to establish clinical trial designs and good clinical practices (GCP) that would
clarify the influence of ethnic factors on any physiological and behavioral differ-
ences in trial test populations in East Asian RCTs. In the end, a technically vague
concept of “bridging” was created to make sense of how to extend the applicability
of clinical data to a different ethnic population by conducting additional trials with
fewer subjects than originally required by local authorities (Kuo 2009, 2012). It is
important to note that several ICH guidelines deal with additional differences among
subjects (such as age), but the extent of these guidelines varies. For example, the ICH
sets no independent guideline regarding inclusion or measurement of gender in
clinical trials.
While RCTs have continued to evolve and grow more standardized and globally
inclusive, social, economic, political, and internal scientific challenges have contin-
ued to complicate both the application of RCTs and the construction of evidence-
based medicine (Bothwell et al. 2016; Timmermans and Berg 2003). Additionally,
critics have identified the growing impact of commercial interests on the overall
ecology of medical evidence. As the pharmaceutical industry has sponsored growing
numbers of RCTs since the late 1960s, it has tended to sponsor trials of drugs with
substantial potential for use among wealthier populations rather than prioritizing
treatments that can transform global public health, such as antibiotics or vaccines for
infectious diseases endemic to low-income regions (Bothwell et al. 2016). Pharma-
ceutical sponsors also have expanded markets by strategically deploying RCTs to
establish new drug indications for existing products through trials that claim slightly
new therapeutic niches, rather than developing innovative original therapies
(Matheson 2017). Researchers conducting industry-sponsored trials have been cri-
tiqued for being more susceptible to bias, as comparative analyses have revealed that
14 L. E. Bothwell et al.
industry-sponsored trials are more likely to reveal outcomes favoring the product
under investigation than publicly funded trials (Bourgeois et al. 2010). Critics have
noted that some industry-funded researchers have designed trials in ways that are
more likely to reveal treatment effects by selecting narrow patient populations likely
to demonstrate favorable results, rather than patients who represent a drug’s ultimate
target population (Petryna 2009).
Growing interest in assessing treatments in applied clinical settings has also given
rise to variations on RCTs. Pragmatic trials, which have been widely discussed and
debated, have been proposed more recently as tools to examine medical interven-
tions in the context of their application in clinical practice. Proponents have
suggested that pragmatic trials would be most useful during the implementation
stage of an intervention or in the post-marketing phase of drug evaluation, once
phase 3 trials have been completed (Ford and Norrie 2016). Similarly, as the quantity
of therapeutics has expanded over time, researchers also have increasingly
conducted randomized comparative effectiveness trials using existing treatments
rather than placebos in control arms of trials. The expansion of comparative effec-
tiveness RCTs has responded to a clinical demand for more detailed information not
just validating individual therapies but comparing different treatments in current use
to guide clinical decision-making (Fiore and Lavori 2016).
New challenges have also emerged in relation to establishing trial endpoints. For
example, trial sponsors have pursued surrogate endpoints – intermediate markers
anticipated to correlate with clinical outcomes – to achieve statistically significant
trial results more quickly. Such trials, however, do not generate comprehensive data
on the clinical outcomes experienced by patients. These approaches have had value,
such as expediting the evaluation of initial data on the effects of HIV treatments so
that more patients could access promising experimental therapies more quickly
(Epstein 1996). However, critics have also warned of the shortcomings of the partial
data that surrogate endpoints can yield (Bothwell et al. 2016), and there have been
important examples of drugs that “improved” the status of a biomarker while leading
to worsened clinical outcomes (e.g., torcetrapib raised HDL levels but also increased
the risk of mortality and morbidity of patients via unknown mechanisms) (Barter et
al. 2007). Some advocates of trial efficiency have also promoted adaptive methods
that alter trial design based on interim trial data. The US FDA has examined
methodological issues in adaptive designs with their Guidance for Industry on
adaptive trials, describing certain methodological challenges in adaptive designs
that remain unresolved (US FDA 2018).
While academic critics have identified limitations of RCTs, drug and device
industries have used these critiques for their own purposes. In recent years, some
industry representatives have embraced criticism of RCTs and evidence-based
medicine, seemingly with a goal of undermining major twentieth century attempts
to demand clinical trial rigor in the assessment of new therapies. Decriers of
regulation have contended that standards for clinical trials are unnecessarily narrow
and exacting, increasing research costs and slowing the delivery of new therapies to
1 Social and Scientific History of Randomized Controlled Trials 15
the market. They make this argument even as other critics argue that regulators such
as the FDA have approved some experimental products too rapidly and without
sufficient evidence, resulting in poorer patient health outcomes (Kesselheim and
Avorn 2017; Ostroff 2015). It is likely that debates over trial design and evidence
standards will persist: competing interests continue to have stakes in how medical
therapies are regulated and tested. Additionally, academic researchers have compet-
ing interests to publish trials with significant results when such publications are
criteria for professional advancement (Calabrese and Roberts 2004).
Finally, researchers have continued to face challenges in conducting RCTs of
treatments that are less amenable to controlled experimentation. While investigators
have long conducted pharmaceutical RCTs comparing active pills and placebos, it
has been more challenging to conduct RCTs in certain other areas of medicine, such
as surgery. Surgeons, who had long espoused the goal of establishing a rational basis
for surgical practice, had recognized the value of control groups in the eighteenth
century and had implemented alternate allocation, and then randomization, in the
twentieth century. In 1953, for instance, surgeons in New York began a study of
surgical and medical management of upper GI bleeding. They began with alternate
allocation but switched to randomization for the majority of the study (1955–1963)
“to achieve statistically sound conclusions” (Enquist et al., as cited in Jones 2018).
By the late 1950s surgeons had randomized patients to tests of many different
surgical procedures (Bothwell and Jones 2019). However, RCTs have not become
as influential in surgery as they have in pharmaceutical research. Part of this has been
a matter of regulation: the FDA does not require RCTs before surgeons start using a
new procedure (unless the new procedure relies on a new device for which the FDA
deems an RCT appropriate). Surgical epistemology and methodology also pose
challenges for RCTs. Since the success of an operation can often seem self-evident,
surgeons have been reluctant to randomize patients between radically different
modes of therapy (e.g., to a medical vs. surgical treatment for a particular problem;
see Jones 2000). There are also few procedures for which surgeons can perform a
meaningful sham operation, forcing many surgical trials to be done without blinding.
Additionally, variations in practitioner skill can confound trial results: a surgical
RCT is not simply a test of the procedure per se, but a test of the procedure as done
by a specific group of surgeons whose skills and techniques might or might not
reflect those of other surgeons. These challenges have limited the use of RCTs within
surgery. When surgical RCTs have been performed historically, it often has not been
to validate and introduce a new operation, but rather to test an existing operation
which surgeons or nonsurgical physicians have begun to doubt. While RCTs remain
an important part of knowledge production in surgery, surgeons have continued to
rely extensively on other modes of knowledge production, including case series and
registry studies. These problems are not unique to surgery. Challenges have also
emerged for RCTs in other medical fields, such as psychotherapy, in which practi-
tioners may have significant degrees of variation in treatment approaches (Bothwell
et al. 2016; Jones 2018).
16 L. E. Bothwell et al.
The foundations of modern RCTs run deep through centuries of thinkers, physicians,
scientists, and medical reformers committed to accurately measuring the effects of
medical interventions. Clinical trials have taken different forms in different historical
social contexts, growing from isolated, small controlled experiments to massive
multinational trials. The shifting burden of disease and the interests of trial sponsors
have influenced the types of questions investigated in trials – from infectious diseases
in the early twentieth century to chronic diseases, particularly those affecting wealthier
populations, in contemporary society. Shifts in trial funding and regulatory and ethical
policy landscapes have dramatically shaped the historical trajectory of RCTs such that
trial design, study location, ethical safeguards for research subjects, investigator
accountability, and even the likelihood of favorable trial results have all been
influenced by political and economic pressures and contexts. This has not been a
linear story of progress. Advances in trial rigor, ethics, and inclusiveness have occurred
alongside the emergence of new challenges related to the commercialization of
research and pressures to lower regulatory standards for evidence. Many RCTs today
have grown so complex and institutionalized that persistent challenges may seem
ingrained. However, the history of clinical trials offers numerous examples of how
science has been dramatically transformed through the work of individuals committed
to rigorous investigations over other competing interests.
Key Facts
1. The historical foundations of RCTs run deep – across time, different societies, and
different contexts, investigators have endeavored to create controlled experiments
of interventions to improve human health.
2. Social contexts of research – from physical trial settings to funding schemes and
regulatory requirements – have significantly impacted the design and scale of
trials, the types of questions asked, trial ethics, research subject demographics,
and the objectives of trial investigators.
3. The history of RCTs has involved both advances and setbacks: it has not been a
linear story of progress. Recent history has revealed persistent challenges for
RCTs as well as expanding concerns such as commercial interests in trials that
will need to be carefully considered moving forward.
Cross-References
Permission Segments of this chapter are also published in Bothwell, L., and Podolsky, S.
“Controlled Clinical Trials and Evidence-Based Medicine,” in Oxford Handbook of American
Medical History, ed. J. Schafer, R. Mizelle, and H. Valier. Oxford: Oxford University Press,
forthcoming. With kind permission of Oxford University Press, date TBA. All Rights Reserved.
References
Barter PJ et al (2007) Effects of torcetrapib in patients at high risk for coronary events. N Engl J
Med 357:2109–2112
Bothwell LE (2014) The emergence of the randomized controlled trial: origins to 1980. Disserta-
tion, Columbia University
Bothwell LE, Jones DS (2019) Innovation and tribulation in the history of randomized controlled
trials in surgery. Ann Surg. https://doi.org/10.1097/SLA.0000000000003631
Bothwell LE, Podolsky SH (2016) The emergence of the randomized, controlled trial. N Engl J Med
375:501–504
Bothwell LE, Greene JA, Podolsky SH, Jones DS (2016) Assessing the gold standard – lessons
from the history of RCTs. N Engl J Med 374(22):2175–2181
Bourgeois FT, Murthy S, Mandl KD (2010) Outcome reporting among drug trials registered in
clinical Trials.gov. Ann Intern Med 153:158–166
Calabrese RL, Roberts B (2004) Self-interest and scholarly publication: the dilemma of researchers,
reviewers, and editors. Int J Educ Manag 18:335–341
Carpenter D (2010) Reputation and power. Princeton University Press, Princeton
Chalmers I (2005) Statistical theory was not the reason that randomisation was used in the
British Medical Research Council’s clinical trial of streptomycin for pulmonary tuberculo-
sis. In: Jorland G et al (eds) Body counts. McGill-Queen’s University Press, Montreal,
pp 309–334
Chalmers I, Dukan E, Podolsky S, Smith GD (2012) The advent of fair treatment allocation
schedules in clinical trials during the 19th and early 20th centuries. J R Soc Med 105(5). See
also JLL Bulletin: Commentaries on the history of treatment evaluation. http://www.jameslin
dlibrary.org/articles/the-advent-of-fair-treatment-allocation-schedules-in-clinical-trials-during-
the-19th-and-early-20th-centuries/. Accessed 17 Mar 2019
Chikenkokusaikakenkyukai [Study group on the globalization of clinical trials] (2013) ICH-GCP
Nabgeita: Kokusaitekishitenkaranihonnochiken wo kangaeru. (ICH-GCP navigator: consider-
ations of clinical trials in Japan from an international perspective). Jiho, Tokyo
Cochrane AL (1972) Effectiveness and efficiency: random reflections on the health services.
Nuffield Provincial Hospitals Trust, London
Daly J (2005) Evidence-based medicine and the search for a science of clinical care. University of
California Press, Berkeley
Epstein S (1996) Impure science: AIDS, activism, and the politics of knowledge. University of
California Press, Berkeley
Fiore L, Lavori P (2016) Integrating randomized comparative effectiveness research with patient
care. N Engl J Med 374:2152–2158
Ford I, Norrie J (2016) Pragmatic trials. N Engl J Med 375:454–463
Freedman B (1987) Equipoise and the ethics of clinical research. N Engl J Med 317:141–145
Gabriel JM (2014) The testing of Sanocrysin: science, profit, and innovation in clinical trial design,
1926–1931. J Hist Med Allied Sci 69:604–632
Gaudilliere JP, Lowy I (1998) The invisible industrialist: manufactures and the production of
scientific knowledge. Macmillan, London
Hill AB (1963) Medical ethics and controlled trials. Br Med J 5337:1043–1049
Jones DS (2000) Visions of a cure: visualization, clinical trials, and controversies in cardiac
therapeutics, 1968–1998. Isis 91:504–541
18 L. E. Bothwell et al.
Jones DS (2018) Surgery and clinical trials: the history and controversies of surgical evidence. In:
Schlich T (ed) The Palgrave handbook of the history of the surgery. Palgrave Macmillan,
London, pp 479–501
Jones DS, Podolsky SH (2015) The history and fate of the gold standard. Lancet 9977:1502–1503
Jones DS, Grady C, Lederer SE (2016) ‘Ethics and clinical research’ – the 50th anniversary of
Beecher’s bombshell. N Engl J Med 374:2393–2398
Kaptchuk TJ (1998) Intentional ignorance: a history of blind assessment and placebo controls in
medicine. Bull Hist Med 72:389–433
Keating P, Cambrosio A (2012) Cancer on trial. University of Chicago Press, Chicago
Kesselheim AS, Avorn J (2017) New ‘21st century cures’ legislation: speed and ease vs science. J
Am Med Assoc 317:581–582
Kuo W-H (2005) Japan and Taiwan in the wake of bio-globalization: drugs, race and standards.
Dissertation, MIT
Kuo W-H (2008) Understanding race at the frontier of pharmaceutical regulation: an analysis of the
racial difference debate at the ICH. J Law Med Ethics 36:498–505
Kuo W-H (2009) The voice on the bridge: Taiwan’s regulatory engagement with global pharma-
ceuticals. East Asian Science, Technology and Society: an International Journal 3:51–72
Kuo (2012) Transforming states in the era of global pharmaceuticals: visioning clinical research in
Japan, Taiwan, and Singapore. In: Rajan KS (ed) Lively capital: biotechnologies, ethics, and
governance in global market. Duke University Press, Durham, pp 279–305
Lewontin RC (2008) The socialization of research and the transformation of the academy. In:
Hannaway C (ed) Biomedicine in the twentieth century: practices, policies, and politics. IOS
Press, Amsterdam, pp 19–25
Lilienfeld L (1982) The fielding H. Garrison lecture: ceteris paribus: the evolution of the clinical
trial. Bull Hist Med 56:1–18
Marks HM (1997) The progress of experiment: science and therapeutic reform in the United States,
1900–1990. Cambridge Univ Press, Cambridge
Marks HM (2000) Trust and mistrust in the marketplace: statistics and clinical research, 1945–1960.
Hist Sci 38:343–355
Marks HM (2006) ‘Until the sun of science . . . the true Apollo of medicine has risen’: collective
investigation in Britain and America, 1880–1910. Med Hist 50:147–166
Matheson A (2017) Marketing trials, marketing tricks – how to spot them and how to stop them.
Trials 18:105
Meldrum ML (1998) A calculated risk: the Salk polio vaccine field trials of 1954. Br Med J
7167:1233–1236
Milne I (2012) Who was James Lind, and what exactly did he achieve? J R Soc Med 105:503–508.
See also JLL Bulletin: Commentaries on the history of treatment evaluation, (2011). http://www.
jameslindlibrary.org/articles/who-was-james-lind-and-what-exactly-did-he-achieve/. Accessed
30 Jan 2019
Ostroff SM (2015) ‘Responding to changing regulatory needs with care and due diligence’ –
remarks to the regulatory affairs professional society. United States Food and Drug Adminis-
tration, Baltimore
Petryna AP (2009) When experiments travel: clinical trials and the global search for human
subjects. Princeton University Press, Princeton
Podolsky SH (2006) Pneumonia before antibiotics: therapeutic evolution and evaluation in twen-
tieth-century America. Johns Hopkins University Press, Baltimore
Podolsky SH (2009) Jesse Bullowa, specific treatment for pneumonia, and the development of the
controlled clinical trial. J R Soc Med 102:203–207. See also JLL Bulletin: Commentaries on the
history of treatment evaluation, (2008). http://www.jameslindlibrary.org/articles/jesse-bullowa-
specific-treatment-for-pneumonia-and-the-development-of-the-controlled-clinical-trial/. Accessed
17 Mar 2019
Podolsky SH (2015) The antibiotic era: reform, resistance, and the pursuit of a rational therapeutics.
Johns Hopkins University Press, Baltimore
1 Social and Scientific History of Randomized Controlled Trials 19
Podolsky SH, Jones DS, Kaptchuk TJ (2016) From trials to trials: blinding, medicine, and honest
adjudication. In: Robertson CT, Kesselheim AS (eds) Blinding as a solution to bias: strength-
ening biomedical science, forensic science, and law. Academic Press, London, pp 45–58
Porter TM (1996) Trust in numbers: the pursuit of objectivity in science and public life. Princeton
University Press, Ewing
Ross OB (1951) Use of controls in medical research. J Am Med Assoc 145:72–75
Shapiro AK, Shapiro E (1997) The powerful placebo: from ancient priest to modern physician.
Johns Hopkins University Press, Baltimore
Stark L (2011) Behind closed doors: irbs and the making of ethical research. Univ of Chicago Press,
Chicago
Tibi S (2006) Al-Razi and Islamic medicine in the 9th century. J R Soc Med 99:206–207. See also
James Lind Library Bulletin: Commentaries on the History of Treatment Evaluation, (2005). http://
www.jameslindlibrary.org/articles/al-razi-and-islamic-medicine-in-the-9th-century/. Accessed
17 Mar 2019
Timmermann C (2008) Clinical research in post-war Britain: the role of the Medical Research
Council. In: Hannaway C (ed) Biomedicine in the twentieth century: practices, policies, and
politics. IOS Press, Amsterdam, pp 231–254
Timmermans S, Berg M (2003) The gold standard: the challenges of evidence-based medicine and
standardization in health care. Temple University Press, Philadelphia
Tröhler U (2000) To improve the evidence of medicine: the 18th century British origins of a critical
approach. Royal College of Physicians of Edinburgh, Edinburgh
United States Food and Drug Administration (1963) Proceedings of the FDA conference on the
Kefauver-Harris drug amendments and proposed regulations. United States Department of
Health, Education, and Welfare, Washington, DC
United States Food and Drug Administration, Center for Drug Evaluation and Research, Center for
Biologics Evaluation and Research (2018) Adaptive design clinical trials of drugs and biologics:
guidance for industry (draft guidance). In: United States Department of Health. Education, and
Welfare, Rockville
Warner JH (1986) The therapeutic perspective: medical practice, knowledge, and identity in
America, 1820–1885. Harvard University Press, Cambridge, MA
Warner JH (1991) Ideals of science and their discontents in late nineteenth-century American
medicine. Isis 82:454–478
Evolution of Clinical Trials Science
2
Steven Piantadosi
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
The Scientific Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Some Key Evolutionary Developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Governance Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Computerization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Statistical Advances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A Likely Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Final Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Abstract
The art of medicine took two millennia to establish the necessary groundwork for
clinical trials which embody the scientific method for making fair comparisons of
treatments. This resulted from a synthesis of opposing approaches to the acqui-
sition of knowledge. Establishment of clinical trials in their basic form in the last
half of the twentieth century continues to be augmented by advances in disparate
fields such as research ethics, computerization, research administration and
governance, and statistics.
Keywords
Design · Design evolution
S. Piantadosi (*)
Department of Surgery, Division of Surgical Oncology, Brigham and Women’s Hospital, Harvard
Medical School, Boston, MA, USA
e-mail: [email protected]
Introduction
Clinical trials have been with us for about 80 years, using the famous sanocrysin
tuberculosis trial as the dawn of the modern era (Amberson et al. 1931). Trials have
evolved in various applications and in response to pressures from regulation, ethics,
economics, technology, and the changing needs of therapeutic development. Clinical
trials are dynamic elements of scientific medicine and have never really been broken,
though nowadays everyone seems to know how to fix them.
Neither the science nor the art of trials is static. Perhaps they may eventually
be replaced by therapeutic inferences based on transactional records from the
point of care, as some people expect. However, most of us who do trials envision
only their relentless application. Evolution of clinical trials manifests in compo-
nents such as organization, technology, statistical methods, medical care, and
science (Table 1). Any of these topics is probably worthy of its own evolutionary
history.
Every trialist would have their own list of the most important developments that
have aided the scope and validity of modern clinical trials. In this discussion I take
for granted and omit three mature experiment design principles covered elsewhere in
this book: control of random variation using replication, bias control using random-
ization and masking, and control of extraneous effects using methods such as
After Hippocrates, two rival schools of Greek medicine arose, both finding justifi-
cation in his writings and teachings. One was the Dogmatist (later the rationalist)
school, with philosophical perspectives strengthened by the teachings of Plato and
Aristotle. Medical doctrines of Diocles, Praxagoras, and Mnesitheus helped to form
24 S. Piantadosi
it (Neuburger 1910). The Dogmatist view of diagnosis and therapeutics was based
on pathology and anatomy and sought causes for illness.
The empiric school of medicine arose between 270 and 220 B.C. largely as a
reaction to the rigid components of Dogmatist teachings, with underpinnings
founded in Skeptic philosophers. Empiric medical doctrines followed the teachings
of Philinus of Cos, Serapion of Alexandria, and Glaucias of Nicomedia (Neuburger
1910). Empirics used a “tripod” in their approach to treatment: 1) their own
experience (autopsia), 2) knowledge obtained from the experience of others (his-
tory), and 3) similarity with other conditions (analogy). A fourth leg was later added
to the tripod: 4) inference of previous conditions from present symptoms (epilogism)
(Neuburger 1910; Robinson 1931).
Empirics taught that the physician should reject theory, speculation, abstract
reasoning, and the search for causes. Physiology and pathology of the time were
held in low esteem, and books were written opposing rationalist anatomical doc-
trines. Thus, empirics were guided almost entirely by experience (King 1982;
Kutumbiah 1971). Regarding the search for causes, Celsus (25 B.C.–A.D. 50) stated
clearly the empiricist objections:
Those who are called “empirici” because they have experience, do indeed accept evident
causes as necessary; but they contend that inquiry about obscure causes and natural
actions is superfluous, because nature is not to be comprehended . . . Even in its
beginnings, they add, the art of medicine was not deduced from such questionings, but
from experience. . . . It was afterwards, . . . when the remedies had already been discov-
ered, that men began to discuss the reasons for them: the art of medicine was not a
discovery following upon reasoning, but after the discovery of the remedy, the reason for
it was sought out. (Celsus 1809)
But truly every science has both a speculative and a practical side. So has medicine. . . .
When, in regard to medicine, we say that practice proceeds from theory, we do not mean that
there is one division of medicine by which we know, and another, distinct therefrom, by
which we act. We mean that these two aspects belong together - one deals with the basic
principles of knowledge; the other with the mode of operation of these principles. The
former is theory; the latter is applied knowledge. (Gruner 1930)
The mere empiricists who do not think scientifically are greatly in error. . . . He who puts his
life in the hands of a physician skilled in his art but lacking scientific training is not unlike the
mariner who puts his trust in good luck, relying on the sea winds which know no science to
2 Evolution of Clinical Trials Science 25
steer by. Sometimes they blow in the direction the seafarer wants them to blow, and then his
luck shines upon him; another time they may spell his doom. (Muntner 1963)
Rationalist ideas were adopted more broadly in science and the mariner metaphor
was a popular one. For example, Leonardo da Vinci (1452–1519) defended the value
of theory in scientific thinking by saying:
Those who are enamored of practice without science are like a pilot who goes into a ship
without rudder or compass and never has any certainty where he is going. Practice should
always be based on a sound knowledge of theory. (da Vinci 1510)
It is evident that theory is absurd and fallacious, always useless and often in the highest
degree pernicious. The annals of medicine afford the most striking proof, that it hath in all
ages been the bane and disgrace of the healing art.
And by thus treading occasionally in unbeaten tracks [the rationalist] enlarges the boundaries
of science in general and adds new discoveries to the art of medicine. In a word, the
rationalist has every advantage which the empiric can boast, from reading, observation
and practice, accompanied with superior knowledge, understanding, and judgment.
By the twentieth century, the scientific method had embraced rationalism and its
response to new knowledge in the physical and biological sciences. The develop-
ment of basic biological science both contributed to and was supported by the
rationalist tradition. Clinical medicine remained somewhat more empirical, but
26 S. Piantadosi
The fact that disease is only in part accurately known does not invalidate the scientific method
in practice. In the twilight region probabilities are substituted for certainties. There the
physician may indeed only surmise, but, most important of all, he knows that he surmises.
His procedure is tentative, observant, heedful, responsive. Meanwhile the logic of the process
has not changed. The scientific physician still keeps his advantage over the empiric. He studies
the actual situation with keener attention; he is freer of prejudiced prepossession; he is more
conscious of liability to error. Whatever the patient may have to endure from a baffling disease,
he is not further handicapped by reckless medication. In the end the scientist alone draws the
line accurately between the known, the partly known, and the unknown. The empiricist fares
forth with an indiscriminate confidence which sharp lines do not disturb. Investigation and
practice are thus one in spirit, method, and object. (Flexner 1910)
and
Modern medicine deals, then, like empiricism, not only with certainties, but also with
probabilities, surmises, theories. It differs from empiricism, however, in actually knowing
at the moment the logical quality of the material which it handles. . . . The empiric and the
scientist both theorize, but logically to very different ends. The theories of the empiric set up
some unverifiable existence back of and independent of facts . . . the scientific theory is in the
facts, summing them up economically and suggesting practical measures by whose outcome
it stands or falls. (Flexner 1910)
This last quote may seem somewhat puzzling because it states that empirics do, in
fact, theorize. However, as the earlier quote from Celsus suggested, the theories of
the empiric are not useful devices for acquiring new knowledge.
There is no sharp demarcation when rationalist and empiricist viewpoints became
cooperative and balanced. The optimal mixture of these philosophies continues to
elude some applications even in clinical trials. R.A. Fisher’s work on experimental
design (Fisher 1925) might be taken as the beginning of the modern synthesis because
it placed statistics on a comparable footing with the maturing biological sciences,
providing for the first time the tools needed for the interoperability of theory and data.
But the modern form of clinical trials would take another 25 years to evolve.
Biological and inferential sciences have synergized and co-evolved since the
middle of the twentieth century. The modern synthesis has yielded great understand-
ing of disease and effective treatments, in parallel with appropriate methods of
evaluation. Modern understanding of disease and treatment are flexible enough to
accommodate such diverse contexts as molecular biology, chronic disease, psycho-
social components of illness, infectious organisms, quality of life, and acupuncture.
Modern inferential science, a.k.a. statistics, is applied universally in science. The
scientific method can reject theory based on empirical data but can also reject data
based on evidence of poor quality, bias, or inaccuracy. An interesting exception to
this implied order is the elaborate justification for homeopathy based on empiricism,
2 Evolution of Clinical Trials Science 27
for example, by Coulter (1973, 1975, 1977). It illustrates the ways in which the
residual proponents of purely empirical practices justify them and why biological
theory is a problem for such practices.
In the remainder of this chapter, we look beyond the historical trends that converged
to allow scientific medicine to evolve. Clinical trials have both stimulated and
benefitted from key developments in the recent 80 years. These include ethics,
governance models, computerization, multicenter collaborations, and statistical
advances.
Ethics
Key landmarks in the history of ethics behind biomedical research are well known to
clinical trialists because it is a required part of their research training. Historical
mistakes have led to great awareness and mechanisms for the protection of research
subjects.
With respect to clinical trials specifically, we might take the evolutionary steps in
ethics to be Nuremberg, Belmont, and data and safety monitoring boards (DSMB).
Some might take a more granular view of ethics landmarks, but this short list has
proved to be beneficial to clinical trials. The foundational importance of Nuremberg
and Belmont needs no further elaboration here. DSMBs are important because they
operationalize some of the responsibilities in institutional review boards (IRB)
which otherwise could be overwhelmed without delegating the work of detailed
interim oversight.
Ethics principles can be in tension with one another. Resolution of those tensions
is part of the evolution of ethics in clinical trials. A good example in recent years has
been the debate over content and wording of the Helsinki Declaration regarding the
proper use of placebo control groups (Skierka and Michels 2018). Another evolu-
tionary step might be visible in the growing use of “central” IRBs. They are strongly
motivated by efficiency but essentially discard the “institutional” local spirit of IRBs
as originally chartered.
The modern platform for clinical trials would not exist without public trust
founded on ethics principles and review and the risk-benefit protections it affords
participants. Ethics is therefore as essential as the underlying biomedical knowledge
and clinical trials science.
Governance Models
Multicenter clinical trial collaborations are common today. Their complexity has
help improve the management of all trials. They were not as feasible in the early
28 S. Piantadosi
history of clinical trials because they depend on technologies like data systems, rapid
communications, and travel that have improved greatly in the last 80 years. Aside
from technologies, administrative improvements have also made them feasible. The
multicenter model of trial management is not monolithic. Some such collaborations
are relatively stable such as the NCI National Clinical Trials Network (NCTN) which
has existed in similar form for decades, even following its “reorganization” in the
last decade. Other collaborations are constituted de novo with each major research
question. That model is used often by the NHLBI and many commercial entities.
Infrastructure costs are high, and a multicenter collaboration must have an
extensive portfolio to make the ongoing investment worthwhile. This has been
true of cancer clinical trials for many years. Multicenter collaborations overcome
the main shortcoming of single-institution trials which is low accrual. They add
broad investigator expertise at the same time. The increased costs associated with
them are in governance, infrastructure, and management.
The governance model for multicenter projects relies on committees rather than
individuals, aside from a Principal Investigator (PI) or Co-PIs. For example, there
may be Executive (small) and Steering (larger) Committees. Other efforts scale
similarly such as data management, pathology or other laboratory review, publica-
tion, and biostatistics. See Meinert (2013) for a concise listing of committee respon-
sibilities. Multicenter collaborations seem to function well even when the various
components are geographically separate largely due to advances in computerization.
Computerization
Several waves of computer technology have washed over clinical research in the last
50 years. The first might be described as the mainframe era, during which data
systems and powerful and accessible statistical analysis methods began, both of
which yielded great benefit to clinical trials. The idea that much could be measured
and stored in databases suggested to some in the 1960s and 1970s that designed
experiments might be replaced by recorded experience. In fact, data system tech-
nology probably did more to advance clinical trials than was appreciated at the time.
Even so, the seemingly huge volume and speed of data storage in the mainframe era
was trivial by today’s standards.
Microcomputers and their associated software comprised the next wave. They
created decentralized computing, put unprecedented power in individual hands, and
led to the Internet. These technologies allowed breakthroughs in data systems and
communication which also greatly facilitated clinical trials. In this period, many
commercial sponsors realized the expense of maintaining clinical trial support
infrastructure in-house. Those support services could be outsourced more econom-
ically to specialist contract research organizations (CROs). This model appears to
accomplish the dual aims of maintaining skilled support but paying only for what is
needed at the appropriate time.
A third era is occurring presently and might be described as big data or big
computation. Increasing speed, storage, computing power, and miniaturization
2 Evolution of Clinical Trials Science 29
parallels a therapeutic focus on the individual patient. Rapid data capture and
transfer of images, video, and genomic data is the rule. Miniaturization is leading
to wearable or even ingestible sensors. A major problem now is how to store,
summarize, and analyze the huge amount of data available. These developments
are changing the course of clinical trials again. Designs can be flexible and their
performance tested by simulation. Outcomes can be measured directly rather than
reported or inferred. Trials can incorporate individual markers before, during, and
after treatment.
Some are expecting a fourth wave of computerization, which might be called true
interoperability of data systems. Lack of interoperability by design can be seen in the
ubiquitous need for human curation of data sources to meet research needs. As good
as they can be, case report forms (CRFs) for clinical trials illustrate the problem.
Unstructured data sources must be curated in CRFs to render them computable. Even
when data sources are in electronic form, most are not interoperable. This potential
wave of computerization will be described below in a brief discussion of the future.
Statistical Advances
Statistics like all other fields of science has made great progress in the last 80 years in
both theory and application. Statistics is not a collection of tricks and techniques but
is the science of making reliable inferences in the presence of uncertainty. It is our
only tool to do so. For that reason, it has found application in every discipline. One
might reasonably claim today that there is no science without the integration of
statistical methods. The history of statistics has been fleshed out, for example, by
Stigler (1980, 1986, 1999), Porter (1986), and Marks (1997).
Statistics is not universally viewed as a branch of mathematics, but probability,
upon which statistics is based, is. However, statistics uses the same methods of
deductive reasoning and proof as in mathematics. Statistics and mathematics live
together in a single academic department in many universities, for example.
Clinical trials and experimental design broadly have stimulated applied statistics
and are a major application area. Especially in the domain of data analyses, trials
have derived enormous benefit from advances in both statistical theory and methods.
The modern wave of “data scientists” may not realize that fundamental tools such as
censored data analysis and related nonparametric tests, proportional hazards and
similar nonlinear regression methods, pharmacokinetic modeling, bootstrapping,
missing data methods, feasible Bayesian methods, fully sequential and group
sequential methods, meta-analysis, and dozens of other major advances have come
during the era of the modern clinical trial.
Aside from the solid theoretical foundations for these and other methods, com-
puter software and hardware advances have further supported their universal imple-
mentation. Procedure-oriented languages evolved in this interval and have moved
from mainframes to personal computers while making the newest statistical methods
routine components of the language. Importantly all those languages have integrated
methods for data transformations aside from methods for data analyses. Clinical
30 S. Piantadosi
trials could not have their present form without these technological advances. For
example, consider a single interim analysis on a large randomized trial as reviewed
by the DSMB. The data summaries of treatment effects, multiple outcomes, and
formal statistical analyses could not take place in the narrow time windows required
without the power of the methods and technologies listed.
A byproduct of both computerization and analysis methods is data integration.
This means the ability to assemble disparate sources of data, connect them to
facilitate analysis, and show the results of those analyses immediately. Although
impressive, clinical trials like many other research applications have minimal need
for up to the minute data analysis. Snapshots that are days or weeks old are typically
acceptable for research purposes. Live surveys of trial data could be useful in review
of data quality or side effects of treatment.
A Likely Future
This collective work is intended to assist the science of clinical trials by defining its
scope and content. Much progress necessary in the field remains beyond what
scholars can write about. For example, clinical trials still do not have a universal
academic home. One must be found. The biostatistics elements of the field have been
found in Public Health for 100 years. Biostatisticians have contributed to Public
Health so well that most academic institutions function as though that discipline
should be present only in Public Health. Often the view is reciprocated by bio-
statisticians inside Public Health who think they must own the discipline every-
where. The result is that there is little biostatistics formally in therapeutic
environments despite the huge need there.
Aside from this, biostatistics is not the only component of clinical trials. Training,
management, infrastructure, and various clinical and allied sciences are also essen-
tial. Where does trial science belong in the academic setting? The answer is not
perfectly clear even after 80 years of clinical trials. Options might include depart-
ments with various names such as clinical investigation, quantitative science, or
medical statistics. In any case, clinical trial science cannot be kept at a distance
organizationally and still expect these collaborations to function effectively.
The immediate future of medicine emphasizes economic value. There can be no
conversation regarding value unless we also understand efficacy, which is the
domain of clinical trials. Organizations that either conduct or consume efficacy
evidence to provide high value medical care need internal expertise in clinical trials.
Without it, they cannot participate actively in understanding value. It seems likely
that this need will be as important in the future as any particular clinical or basic
science.
Despite historical progress in computerization, we lack true interoperability of
medical records. Electronic health records (EHR) seem to be optimistically misun-
derstood by the public and politicians alike, who view them as solutions to many
problems that realistically have not been solved yet. EHRs are essentially electronic
paper created to assist billing, and they have only parchment-like interoperability.
2 Evolution of Clinical Trials Science 31
EHRs, like paper, can be sent and read by new caregivers, but this constitutes the
lowest form or interoperability. Much of the data contained is unstructured and
useless for research without extensive and expensive human curation – hence the
need for CRFs. We also know that skilled humans curate EHRs imperfectly, so we
can’t expect augmented intelligence or natural language processing to fix this
problem until EHRs improve. Lack of data model standardization is a major hurdle
on that evolutionary path.
Despite the current research-crushing limitations of EHRs, many people have
begun to talk about “real-world” data, evidence, or trials which are derived from
those sources. This faddish term has a foothold but is bad for several reasons. It
implies that 80 years of clinical trials have somehow not reflected practical findings
or benefits, which is contrary to the evidence. It also implies that use of EHRs that
contain happenstance data, i.e., not designed for purpose, will suffice for therapeutic
inferences. This has not been true in the past and remains unproven for current
ambitions. In any case, a less catchy but more accurate term is “point of care,”
indicating data recorded in source documents when and where care is delivered
compared to secondary or derived documents like CRFs.
When EHRs evolve to hold essential structured data, data models become
standardized, and point of care data become available in adequate volume and
quality to address research queries across entire healthcare systems; some questions
badly addressed by current clinical trials can be asked. These include what happens
in subgroups of the population not represented well in traditional clinical trials, how
frequent are rare events not seen in relatively small experimental cohorts, and what
outcomes are most likely in lengthy longitudinal disease and complex treatment
histories that require multiple changes in therapy. We cannot yet know if we will get
accurate answers to these and similar questions using point of care data simply because
the inferences will be based on uncontrolled or weakly controlled comparisons.
Final Comments
Clinical trials are quintessential science. They hardwire the scientific method in their
design and execution and demand cooperation between theory and data. Trials
evolved as science inside science only after rationalism and empiricism began to
cooperate and following necessary advances in both statistical theory and biological
understanding of disease.
A populist view of science and medicine is that investigative directions are
substantially determined by whoever is doing the research. This makes it subject
to personal or cultural biases and justifies guidance by political process. While some
science is investigator initiated (which does not escape either peer review, sponsor
oversight, or accountability), the truth is that science and medicine are obviously
guided mostly by economic concerns. Government sponsorship places funding in
topic areas that are pursued by scientists. Much research is sponsored by commercial
pharmaceutical and device companies, themselves guided nowadays almost solely by
32 S. Piantadosi
Medicine is destined to escape empiricism little by little, and it will escape in the same way
as all the other sciences, by the experimental method. (Bernard 1865)
Key Facts
Cross-References
▶ Leveraging “Big Data” for the Design and Execution of Clinical Trials
▶ Multicenter and Network Trials
▶ Responsibilities and Management of the Clinical Coordinating Center
▶ Social and Scientific History of Randomized Controlled Trials
References
Amberson JB, McMahon BT, Pinner M (1931) A clinical trial of sanocrysin in pulmonary
tuberculosis. Am Rev Tuberc 24:401–435
Anon. Editorial (1965) Thomas Percival (1740-1804) codifier of medical ethics. JAMA 194(12):
1319–1320
Bernard C (1865) Introduction a l’Etude de la Medicine Experimentale. J. B. Bailliere et Fils, Paris
2 Evolution of Clinical Trials Science 33
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Clinical Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Trial Versus Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Pilot Study Versus Feasibility Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Name of Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Name of the Experimental Variable: Treatment Versus Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Name for Groups Represented by Experimental Variable: Study Group,
Treatment Group, or Arm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Persons Studied: Subject, Patient, or Participant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Trial Protocol Versus Manual of Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Blocking Versus Stratification and Quotafication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Open . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Controlled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Placebo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Randomization Versus Randomized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Registration Versus Enrollment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Single Center Trial Versus Multicenter Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Multicenter Versus Cooperative Versus Collaborative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Principal Investigator (PI) Versus Study Chair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Clinical Investigator Versus Investigator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Steering Committee Versus Executive Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Data Monitoring Versus Data Monitoring Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Random Versus Haphazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Primary Versus Secondary Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Outcome Versus Endpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Treatment Failure Versus Treatment Cessation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
C. L. Meinert (*)
Department of Epidemiology, School of Public Health, Johns Hopkins University, Baltimore,
MD, USA
e-mail: [email protected]
Abstract
There are dozens of types of trials with specialized vocabularies, but the feature
common to all is that they are comparative and focused on differences. If the trial
involves just one treatment group, then the focus is on change from enrollment. If
the trial is controlled, then the focus is on differences between the treatment
groups in outcomes during and at the end of the trial.
Keywords
Language · Usage conventions · Clinical trials · Randomized trials
Introduction
Clinical Trial
The term “clinical trial” can mean any of the following: 1. The first use of a treatment
in human beings. 2. An uncontrolled trial involving treatment of people followed
over time. 3. An experiment done involving persons for the purpose of assessing the
safety and/or efficacy of a treatment, especially such an experiment involving a
clinical event as an outcome measure, done in a clinical setting, and involving
persons having a specific disease or health condition. 4. An experiment involving
the administration of different study treatments in a parallel treatment design to a
defined set of study subjects done to evaluate the efficacy and safety of a treatment in
ameliorating or curing a disease or health condition; any such trial, including those
involving healthy persons, undertaken to assess safety and/or efficacy of a treatment
or health care procedure (Meinert 2012). A publication type in the National Library
of Medicine indexing system defined as: Pre-planned clinical study of the safety,
efficacy, or optimum dosage schedule of one or more diagnostic, therapeutic, or
prophylactic drugs, devices, or techniques in humans selected according to pre-
determined criteria of eligibility and observed for predefined evidence of favorable
and unfavorable effects (National Library of Medicine 1998).
“Clinical” as an adjective means related to the sickbed or to care given in a clinic.
The use of the term should be limited to trials involving people with medical
conditions and even then usually can be dropped except where necessary to avoid
confusion with other kinds of trials, like in vitro trials or trials involving animals.
Trial, when done to test or assess treatments, should be used rather than the less
informative term study. Study can mean all kinds of things. Trial conveys the essence
of what is being done.
To qualify as a trial there should be a plan – protocol. Trials may be referred to as
studies and studies as trials. For example, Ambroise Paré’s (Packard 1921) experi-
ence on the battlefield in 1537 in regard to use of a digestive medicament for
treatment of gunshot victims has been referred to as a clinical trial, but is a misuse
of the term because Paré resorted to the medicament when his supply of boiling oil
ran out. No protocol.
Ironically, ClinicalTrials.gov (aka, CT.gov), a registration site created specifically
for registration of trials, does not use the term, opting instead for “interventional study.”
Name of Trial
The most important words in any publication of results from a trial are the few
represented in the title of the manuscript. If it is your trial the words are your choice.
Choose wisely. The name you choose will be used hundreds of times. Steer clear of
names with special characters or symbols.
Avoid use of unnecessary or redundant terms like “controlled” in “randomized
controlled trial”; “randomized” is sufficient to convey “control.”
Include the term “trial.” Avoid using surrogate terms instead of “trial,” like
“study” or “project.”
Include currency terms like “randomized” and “masked” when appropriate.
Include terms to indicate the disease or condition being treated and the treatment
being used, for example, as in Alzheimer’s Disease Anti-inflammatory Prevention
Trial (ADAPT Research Group 2009).
If you are looking for publications of results from trials, do not expect to find
them by screening titles. A sizeable fraction of results publications do not have
“trial” in the title.
The most important variable in trials is the regimen or course of procedures applied
to persons to produce an effect. If you have to choose one name, what will it be?
Treatment or intervention?
“Treat” as a noun (Merriam Webster; online dictionary) means: 1a: the act or
manner or an instance of treating someone or something; b: the techniques or actions
customarily applied in a specified situation; 2a: a substance or technique used in
treating; b: an experimental condition.
“Intervene” as a verb (Merriam Webster; online dictionary) means: 1: to occur,
fall, or come between points of time or events; 2: to enter or appear as an irrelevant or
extraneous feature or circumstance; 3a: to come in or between by way of hindrance
or modification; b: to interfere with the outcome or course especially of a condition
or process (as to prevent harm or improve functioning); 4: to occur or lie between
two things; 5: to become a third party to a legal proceeding begun by others for the
protection of an alleged interest; b: to interfere usually by force or threat of force in
another nation’s internal affairs especially to compel or prevent an action.
There is no perfect choice, but “treatment” comes closer to what one wants to
communicate than intervention.
3 Terminology: Conventions and Recommendations 39
Any of the three work, and are used; study group or treatment group preferred
though “arm” is often the term of choice in cancer trials.
A frequently used label for persons studied is “research subject,” “study subject,”
or simply “subject.” The advantage of the labels lies in their generic nature, but the
characterization lacks “warmth” as conveyed in a usage note for “subject” as taken
from Meinert: The primary difficulty with the term for persons being studied in the
setting of trials has to do with the implication that the persons are research
objects. The term carries the connotation of subjugation and, thus, is at odds
with the voluntary nature of the participation and requirements of consent. In
addition, it carries the connotation of use without benefit; a misleading connota-
tion in many trials and, assuredly, in treatment trials. Even if such a connotation is
correct, the term suggests a passive relationship with study investigators when, in
fact, the relationship is more akin to a partnership involving active cooperation.
Avoid by using more humanistic terms, such as person, patient, or participant
(Meinert 2012).
Patient versus subject? The terms imply different relationships and ethics under-
lying interactions. “Patient” implies a therapeutic doctor-patient relationship. “Sub-
ject” is devoid of that connotation.
Limit “patient” or “study patient” to settings involving persons with an illness or
disease and a doctor-patient relationship. Avoid in settings involving well people or
when there is a need to avoid connotations of illness or of medical care by using a
medically neutral term, such as study participant.
40 C. L. Meinert
into strata after enrollment for a subgroup analysis. Avoid confusion when both
forms of stratification are used in a trial by referring to this form of stratification as
post-stratification.
The purpose of stratification is to ensure that treatment assignments are balanced
across strata. To be useful, the stratification variable has to be related to the outcome
of interest. If it is not, there is no statistical gain from stratification.
Stratification and blocking treatment assignments serve different purposes.
Blocking is done to ensure that the assignment ratio for the trial is satisfied at points
in time over the course of enrollment; stratification is done to ensure the compara-
bility of the treatment groups with regard to the stratification variable(s).
Likewise stratification and quotafication are different. Stratification merely
ensures the mix of people with regard to the stratification variable is the same across
treatment groups. The trialist may carry out treatment comparisons by the stratifica-
tion variable but is not under any obligation to do so.
quotafication v – The act or process of imposing a quota requirement on the mix
of persons enrolled in a trial. Not to be confused with stratification. The purpose of
stratification is to ensure that the different treatment groups in a trial have the same
proportionate mix of people with regard to the stratification variable(s).
Quotafication is to ensure a study population having a specified mix with regard to
the variables used for quotafication.
For example, quotafication for gender would involve enrolling a specified number
of males and females and randomizing by gender, that is, with gender also as a
stratification variable. The mix of persons enrolled in a trial is determined by the mix
of persons seen and ultimately judged eligible for enrollment. Hence, the numbers
ultimately represented in the various strata will be variables having values known
only after completion of enrollment. The imposition of a sample size requirement for
one or more of the strata by imposition of quota requirements will extend the time
required for recruitment and should not be imposed unless there are valid scientific
or practical reasons for doing so.
Open
Open has various meanings in the context of trials as seen below, including one
being a euphemism for unrandomized trials.
open trial n – 1. A trial in which the treating physician, some other person in a
clinic, or the study participant selects the treatment to be administered. 2. A trial in
which treatment assignments are known in advance to clinic personnel or patients,
e.g., schemes where the schedule of assignments is posted in the clinic or as in
systematic schemes, such as odd-even methods of treatment assignment, where the
scheme is known. 3. A trial in which treatments are not masked; nonmasked trial. 4.
A trial still enrolling. 5. A trial involving an open sequential design. Usage note:
Avoid by use of appropriate descriptors to make meaning clear. Use nonmasked in
the sense of defn 3. If used in the sense of defns 4 or 5 make certain the term is not
taken to denote conditions described in defns 1, 2, or 3.
42 C. L. Meinert
open label adj – [trials] Of or relating to a trial in which study treatments are
administered in unmasked fashion. Usage note: Avoid; use unmasked.
Controlled
Placebo
placebo n – [ME, fr L, I shall please, fr placēre to please; the first word of the first
antiphon of the service for the dead, I shall please the Lord in the land of the living, fr
Roman Catholic vespers] 1. A pharmacologically inactive substance given as a
substitute for an active substance, especially when the person taking or receiving it
is not informed whether it is an active or inactive substance. 2. Placebo treatment 3.
A sugar-coated pill made of lactose or some other pharmacologically inert substance.
4. Any medication considered to be useless, especially one administered in pill form.
5. Nil treatment 6. An ineffective treatment. Usage note: Subject to varying use.
Avoid in the sense of defns 4, 5, and 6; not to be used interchangeably with sham.
The use of a placebo should not be construed to imply the absence of treatment.
Virtually all trials involve care and investigators conducting them are obligated to
meet standards of care, regardless of treatment assignment and whether masked or
not. As a result, a control treatment involving use of a placebo is best thought of as a
care regimen with placebo substituting for one element of the care regimen. Labels
such as “placebo patient” or “placebo group” create the impression that patients
assigned to receive placebos are left untreated. The labels (in addition to being
wrong in the literal sense of usage) are misleading when placebo treatment is in
addition to other treatments, as usually the case.
placebo control n – 1. Placebo-control treatment 2. A treatment involving the use
of a placebo.
placebo effect n – 1. The effect produced by a placebo; assessed or measured
against the effect expected or observed in the absence of any treatment. 2. The effect
produced by an inactive control treatment. 3. The effect produced by a control
treatment considered to be nil. 4. An effect attributable to a placebo. rt: sham effect
Usage note: Limit usage to settings involving the actual use of a placebo. Avoid in
the sense of defns 2 and 3 when the control treatment does not involve a placebo.
placebo group n – 1. Placebo-assigned group 2. Placebo-treated group 3. A
group not receiving any treatment (avoid).
Consent
Usually the modifier “informed” is more an expression of hope than of fact. Its use is
best reserved for settings in which there are steps built into the consent process to
ensure an informed decision based on evidence of comprehension of what is
involved, or for settings in which the decision can be demonstrated to have been
informed; otherwise use consent.
trials, that point when treatment assignment is revealed to clinic personnel. Not to be
confused with registration.
A trial is single center if all activities involved in conducting the trial are housed
within the same institution. A trial is multicenter if it has two or more enrollment
sites.
single-center trial n – 1. A trial performed at or from a single site: (a) Such a trial,
even if performed in association with a coalition of clinics in which each clinic
performs its own trial, but in which all trials focus on the same disease or condition
(e.g., such a coalition formed to provide preliminary information on a series of
different approaches to the treatment of hypertension by control or reduction); (b) A
trial not having any clinical centers and a single resource center, e.g., the Physicians’
Health Study (Henneken and Eberlein 1985; Physicians’ Health Study Research
Group Steering Committee 2012). 2. A trial involving a single clinic; with or without
satellite clinics or resource centers. 3. A trial involving a single clinic and a center to
receive and process data. 4. A trial involving a single clinic and one or more resource
centers.
multicenter trial n – 1. A trial involving two or more clinical centers, a common
study protocol, and a data center, data coordinating center, or coordinating center to
receive, process, and analyze study data. 2. A trial involving at least one clinical
center or data collection site and one or more resource centers. 3. A trial involving
two or more clinics or data collection sites.
The usual line of demarcation between single and multicenter is determined by
whether or not there is more than one treatment or data collection site. Hence, a trial
having multiple centers may still be classified as a single-center trial if it has only one
treatment or data collection site.
for example, often as in some multicenter trials. In general, is best avoided in favor of
“study chair.”
Investigator is the generic name applied to anyone in a research setting who has a
key role in conducting the research or some aspect of the research.
In the context of trials, clinical investigator refers to persons with responsibilities
for enrolling and caring for persons enrolled in the trial. Avoid as a designation when
used to the exclusion of others having investigator status, for example, as in settings
also involving nonclinical investigators as in data coordinating centers.
be hard put to know if the outcome focused onin the analysis is “primary” as
represented in definitions above.
blind, blinded adj – Being unaware or not informed of treatment assignment; being
unaware or not informed of course of treatment.
3 Terminology: Conventions and Recommendations 49
mask, masked adj – Of, relating to, or being a procedure in which persons (e.g.,
patients, treaters, or readers in a trial) are not informed of certain items of informa-
tion, e.g., the treatment represented by a treatment assignment in a clinical trial.
Preferred to blind.
mask n – A condition imposed on an individual (or group of individuals) for the
purpose of keeping that individual (or group of individuals) from knowing or
learning of some condition, fact, or observation, such as treatment assignment, as
in single-masked or double-masked trials.
The term “blind,” as an adjective descriptor in relation to treatment administra-
tion, is more widely used in trials than its counterpart descriptor of “mask” or
“masked.” The shortcoming of “blind” as a descriptor in relation to treatment
administration has do with unfortunate connotations (e.g., as in “blind stupidity”)
and the fact that the characterization can be confusing to study participants (e.g., in
vision trials where loss of vision or blindness is an outcome measure). For these
reasons, mask is preferred to blind.
Lost to Followup
Dropout
long-term trials will have provisions for reinstating persons classified as dropouts
if and when they return to a study clinic for required data collection. Avoid in the
sense of defn 3 in relation to a single visit or contact in the absence of other
reasons for regarding someone as a dropout. Use other language, such as missed
visit or missed procedure, to avoid the connotation of dropout. The term should
not be confused with lost to followup, noncompliant, withdrawal, or endpoint. A
dropout need not be lost to followup if one can determine outcome without
seeing or contacting the person (as in some forms of followup for survival) but
will be lost to followup if the outcome measure depends on data collected from
examinations of the person. Similarly, the act of dropping out need not affect
treatment compliance. A person will become noncompliant upon dropping out in
settings where doing so results in discontinuation of an active treatment process.
However, there may be no effect on treatment compliance in settings where the
assigned treatment is administered only once on enrollment and where that
treatment is not routinely available outside the trial. Similarly, the term should
not be confused with or used as a synonym for withdrawal, since its meaning is
different from that for dropout.
Withdrawals
The design variable is the variable used for determining sample size in planning a
trial.
It is usually the same as the primary outcome measure but not always. For
example, the design variable could be difference in blood pressure after a specified
period of treatment but the outcome of primary interest could be cardiovascular
deaths.
3 Terminology: Conventions and Recommendations 51
Followup for persons enrolled in a trial may end at the same time regardless of when
they were enrolled (common closing date design) or may end on a per person basis
after a specified period of time after enrollment (anniversary closing date design).
End of trial is when all enrollment and data collection activities cease.
Bias
an estimate of a statistic from its true value. Usage note: Distinguish between uses in
which bias (defns 1 or 2) is being proposed in a speculative sense as opposed to an
actual instance of bias. Usages in the latter sense should be supported with evidence
or arguments to substantiate the claim. Usages in the former sense should be
preceded or followed by appropriate modifiers or statements to make clear that the
user is speculating. Similarly, since most undifferentiated uses (in the sense of defns
1 or 2) are in the speculative sense, prudent readers will treat all uses as being in that
sense, unless accompanied by data, evidence, or arguments to establish bias as a fact.
Not to be confused with systematic error. Systematic error can be removed from
finished data; bias is more elusive and not easily quantified.
selection bias n – 1. A systematic inclination or tendency for elements or units
selected for study (persons in trials) to differ from those not selected. 2. Treatment-
related selection bias Usage note: The bias defined by defn 1 is unavoidable in most
trials because of selective factors introduced as a result of eligibility requirements for
enrollment and because of the fact that individuals may decline enrollment. The
existence of the bias does not affect the validity of treatment comparisons so long as
it is the same across treatment groups, for example, as when treatment assignments
are randomized.
treatment-related selection bias n – Broadly, bias related to treatment assign-
ment introduced during the selection and enrollment of persons into a trial; often due
to knowing treatment assignments in advance of issue and using that information in
the selection process. The risk of the bias is greatest in unmasked trials involving
systematic assignment schemes (e.g., one in which assignments are based on order
or day of arrival of persons at a clinic). It is nil in trials involving simple
(unrestricted) randomization, but can arise in relation to blocked randomization if
the blocking scheme is known or deduced. For example, one would be able to
correctly predict one-half of the assignments before use in an unmasked trial of two
study treatments arranged in blocks of size two, if the blocking is known or deduced.
The chance of the bias operating, even if the blocking scheme is simple, is minimal
in double-masked trials.
A nominal stop is when all treatment and data collection procedures in the trial cease
or are stopped. The usual reason for a nominal stop is when the trial has been
completed as planned. Nominal stops can also be the result of loss of funding or
because of orders to stop by the funding agency or from a regulatory agency.
Early stops may pertain to all persons enrolled in the trial as in nominal stops or
only to a subset. The stop may be due to a clinical hold issued by a regulatory agency
or may be due to evidence that a treatment is harmful or ineffective. Typically, early
stops are the results of actions taken by investigators based on interim looks at
accumulating data over the course of the trial.
54 C. L. Meinert
Summary
References
ADAPT Research Group (2009) Alzheimer’s disease anti-inflammatory prevention trial: design,
methods, and baseline results. Alzheimers Dement 5:93–104
Day S (1999) Dictionary for clinical trials. Wiley, Chichester, 217pp
Henneken CH, Eberlein K (1985) For the physicians’ health study research group: a randomized
trial of aspirin and β-carotene among U.S. physicians. Prev Med 14:165–168
Meinert CL (2012) Clinical trials dictionary: terminology and usage recommendations, 2nd edn.
Wiley, Hoboken
National Library of Medicine (1998) Medical subject headings – annotated alphabetic list: 1998.
National Library of Medicine, Bethesda
Packard FR (1921) Life and times of Ambroise Paré, 1510–1590. Paul B Hoeber, New York
Physicians’ Health Study Research Group Steering Committee (2012) Preliminary report:
findings from the aspirin component of the ongoing Physicians’ health study. New Engl J
Med 318:262–264
Porta M (ed) (2014) Dictionary of epidemiology, 5th edn. Oxford University Press, New York,
376pp
Upton G, Cook I ( 2014) Dictionary of statistics. Oxford University Press, New York, 496pp
Clinical Trials, Ethics, and Human
Protections Policies 4
Jonathan Kimmelman
Contents
Origins of Research Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Conception of Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Design of Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Risk/Benefit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Justification of Therapeutic Procedures: Clinical Equipoise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Justification of Demarcated Research Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Riskless Research, High Risk Research, Comparative Effectiveness Trials, and Ethics . . . . 61
Maximizing Efficiencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Justice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Fair Subject Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Inclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Trial Inception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Respect for Persons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Elements of Valid Informed Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Research Without Informed Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Informed Consent Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Independent Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Conduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Trials and Ethics Across Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Publication and Results Deposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Methods and Outcome Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
The Afterlife of Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
J. Kimmelman (*)
Biomedical Ethics Unit, McGill University, Montreal, QC, Canada
e-mail: [email protected]
Abstract
Clinical trials raise two main sets of ethical challenges. The first concerns
protecting human beings when they are used in scientific experiments. The
second concerns protecting the welfare of downstream users of medical evidence
generated in trials. The present chapter reviews core ethical standards and prin-
ciples governing the conception, design, conduct, and reporting of clinical trials.
This review concludes by suggesting that even the most technical decisions about
design and reporting embed numerous moral judgments about how to serve the
interests of research subjects and downstream users of medical evidence.
Clinical trials are experiments on human beings. They involve two elements
that make them ethically sensitive undertakings. First, the very reagent used in the
experiment – the human being – has what philosophers call moral status. That is,
human beings are sentient and self-aware, they have preferences and plans, and
they have a capacity for suffering. Human beings are thus entitled to having their
interests respected and protected when they are themselves the research reagents.
Second, clinical trials are aimed at supporting decision-making in health care.
Life and death decisions are ultimately based on the evidence we generate from
human experiments. Human beings who use that evidence deserve protection
from scientific findings that are misleading, incomplete, or biased.
In what follows, I provide a very condensed overview of the ethics of clinical
trials. Most writings on the ethics of clinical trials have centered on the protection
of human volunteers, treating issues of research integrity as an afterthought, if at
all. Two core claims ground this review – claims that perhaps differentiate it from
similar overviews of human research ethics. The first is that scientific integrity is a
complement to human protections; even the most technical decisions about
design and analysis are laden with implicit ethical judgments. The second is
that clinical trials present ethical challenges across their full life cycle – not
merely in the brief window when a trial is open for enrollment, and where
human protection regulations are directed. This review is thus organized
according to the life cycle of a clinical trial.
Keywords
Clinical trials · Research ethics · Human protections · Medical evidence ·
Research oversight
The moral and regulatory framework for protecting human subjects has its origins in
the aftermath of World War II. The US prosecution of 23 Nazis in the so-called Nazi
Doctors’ trial led to the first formalized policy on human research ethics, the
“Nuremberg Code” (Annas and Grodin 1995). At least in North America, however,
4 Clinical Trials, Ethics, and Human Protections Policies 57
the Nuremberg Code went largely unheeded for two decades. Revelations of various
research abuses (Beecher 1966), including those surrounding the Tuskegee Syphilis
study (an observational study of African American men with middle to late stage
syphilis that had run continuously from 1932) led the US Congress to establish the
National Commission for the Protection of Research Subjects of Biomedical and
Behavioral Research. A key task for this committee was to articulate the basic moral
principles of human research. The product of this effort came to be known as the
Belmont Report, and regulations established in the USA from this effort include 45
CFR 46 and the Food and Drug Administration’s (FDA) equivalent, 21 CFR 50. The
first section of the former, which covers the general requirements of human pro-
tections, is sometimes called the “Common Rule”; it was revised in 2018 (Menikoff
et al. 2017). Various other jurisdictions and bodies have articulated their own
policies as well, including the World Health Organization (1982 with several
revisions since) (World Health Organization and Council for International Organi-
zations of Medical Sciences 2017), the World Medical Association (1964 with
numerous revisions since) (General Assembly of the World Medical Association
2014), and the Canadian Tri-council (1998 with one revision) (Canadian Institutes of
Health Research et al. 2018). Though policies around the world vary around the
edges, they share a core consensus in expressing principles and policies articulated
by the Belmont Report. These principles are respect for persons (implemented by
obtaining informed consent or restricting risk for persons lacking capacity); benef-
icence (implemented by independent establishment of a favorable balance of risk/
benefit); and justice (implemented by comparing a trial population to the target
population of the knowledge).
The Belmont Principles provide orientation points for thinking about the ethics of
a trial. But they are not exhaustive (Kimmelman 2020). Nor do regulations address
all duties attending to the conduct of clinical trials. I will return to these gaps
periodically.
Conception of Trials
The Belmont Report contains a suggestive and widely overlooked statement: “Rad-
ically new procedures of this description should, however, be made the object of
formal research at an early stage in order to determine whether they are safe and
effective.” Nevertheless, there are no regulations or policies that govern the choice of
what research questions to address in trials. As far as ethical oversight and drug
regulation is concerned, there is no moral distinction between a trial testing a me-too
drug for male pattern baldness and a trial testing a promising new treatment for
pediatric glioma.
Yet clearly, the resources available for research are finite, and some research
questions are more deserving of societal investment than other questions. This is
illustrated by the often quoted claim that 90% of the world’s resources are committed
58 J. Kimmelman
to addressing health issues that afflict only 10% of the world population (the so-
called “10–90% gap”) (Flory and Kitcher 2004). It is also illustrated by the historic
and unfair exclusion of certain populations from medical research (such as children,
women, persons living in economically deprived settings, racial minorities, pregnant
women, and elderly populations), or the persistence of medical uncertainty surround-
ing widely deployed treatments (e.g., the value of PCI for treatment of angina) (Al-
Lamee et al. 2018).
Four general considerations might be offered for selecting research hypotheses.
First, researchers should direct their attention towards questions that are unresolved.
This may seem obvious. However, many trials address questions that have already
been adequately resolved. One particularly striking example was the persistence of
placebo-controlled trials testing the drug aprotinin, long after its efficacy had been
decisively established (Fergusson et al. 2005). Drug companies often run trials that
are primarily aimed at promoting a drug rather than testing an unresolved hypothesis
(Vedula et al. 2013). Second, researchers should only test hypotheses that are
sufficiently mature. For example, researchers generally ought not to initiate phase
1 trials unless there are compelling preclinical studies to motivate them; they should
generally not pursue phase 3 studies if there is insufficient grounds to settle on a dose
or treatment schedule for testing (again, there are many examples of trials that have
been launched absent compelling evidentiary grounds) (Kimmelman and Federico
2017). The best way to ground a claim that a medical hypothesis merits evaluation in
clinical trials is with a systematic review (Savulescu et al. 1996; Chalmers and
Nylenna 2014; Nasser et al. 2017); some jurisdictions require systematic review
before trial conduct (Goldbeck-Wood 1998).
Third, researchers should prioritize clinical questions that are likely to have the
greatest impact on health and well-being. To some extent researchers’ priorities are
constrained by their field, logistical considerations, and funding options. Neverthe-
less, they can exercise some discretion within these constraints. All else being equal,
researchers ought to favor trials involving conditions that cause greater morbidity or
mortality (whether because of prevalence or intensity of morbidity) and that afflict
unfairly disadvantaged or excluded populations.
Finally, researchers should not initiate trials unless there are reasonable pros-
pects for findings being incorporated into downstream decisions. For example,
phase 1 trials should generally meet a more demanding review standard if there is
no sponsor to carry encouraging findings forward in a phase 2 trial. Many
exploratory trials suggesting the promise of approved drugs in new indications
are never advanced into more rigorous clinical trials (Federico et al. 2019).
Research that is not embedded within a coordinated research program presents a
variety of problems, including a concern about disseminating potentially biased
research findings that are incorporated into clinical practice guidelines. The pre-
sent author has argued that abortive research programs raise questions about the
social value of many exploratory trials – especially in the post-approval research
context (Carlisle et al. 2018).
4 Clinical Trials, Ethics, and Human Protections Policies 59
Design of Trials
Risk/Benefit
All major policies on research ethics require that risks are favorably balanced against
benefits to society in the form of improved knowledge and benefit to subjects (if
any). Benefits include direct medical benefits of receiving medical interventions
tested in trials (if any) and those expected by addressing a research hypothesis.
Inclusion benefit (the benefit patients might receive from extra medical attention they
receive when entering trials, regardless of treatment assignment) is generally con-
sidered irrelevant for establishing a favorable risk benefit (King 2000).
How, then, are researchers and oversight bodies to operationalize the notion of a
favorable risk/benefit balance? The Belmont Report urges an assessment of risk and
benefit that is systematic, evidence based, and explicit. One of the most useful
approaches is “component analysis” (Weijer and Miller 2004). Clinical trials typi-
cally involve a mix of potentially burdensome exposures, including treatment with
unproven drugs, venipuncture, imaging or diagnostic procedures, and/or tissue
biopsies. Component analysis involves dividing a study into its constituent pro-
cedures, and evaluating the risk/benefit for each individual procedure. Importantly,
benefits associated with one procedure cannot be used to “purchase” risk for other
procedures. For example, the burdens of a painful lumbar puncture cannot be
justified by appealing to the therapeutic advantages offered by access to a novel
treatment in a trial.
In performing component analysis, procedures can be sorted into two categories.
Some procedures, like withholding an established effective treatment, implicate care
obligations and are thus termed “therapeutic procedures.” Other procedures, like
venipunctures to monitor a metabolite, are performed solely to advance a research
objective; these are called “demarcated research procedures.” Each has a separate
process for justifying risk.
treatment does not deprive that patient of a standard of care that patient would
otherwise receive. On the other hand, asking a patient with relapse remitting multiple
sclerosis to forgo disease-modifying treatment for a year would fall below standard
of care. Placebo-controlled trials of this duration in multiple sclerosis would gener-
ally be unethical (Polman et al. 2008).
In one concept, clinical equipoise captures three imperatives for risk/benefit in
research. First, it preserves a physician’s duty of care when they participate in
research. As such, physicians can recruit patients without compromising their
fiduciary obligations to patients. Second, the principle of clinical equipoise estab-
lishes a standard for acceptable risk in clinical trials by benchmarking risk/benefit
in trials to generally accepted standards in medicine. Third, clinical equipoise
establishes a standard for scientific value. A trial is only ethical insofar as it is a
necessary step towards resolving uncertainty in the expert community. This has
subtle implications. For example, it means small, underpowered trials are ethically
questionable (unless they are a necessary step towards resolving medical uncer-
tainty, as in the case of phase 2 trials, or designed expressly to be incorporated in
future meta-analyses) (Halpern et al. 2002), since “positive” but underpowered
trials might be sufficient to encourage further trials, but will generally be insuffi-
cient to convince the expert medical community about a treatment’s advantage
over standard of care.
Though clinical equipoise was first articulated in the context of randomized trials
testing new interventions, it can logically be extended to single armed studies that
use historical controls as comparators. The concept of clinical equipoise is not
without critics (reviewed in London 2007), and its operationalization – like many
ethical concepts – can pose challenges (e.g., how much residual uncertainty is
necessary for a trial to be ethical). Just the same, no other concept comes close to
binding the moral dimensions of trials to their methodology and the obligations of
those who conduct them.
Both extremes of risk in research pose challenges to the assessment and evaluation of
risk in research. Some trials, like early phase trials testing novel strategies, or trials of
aggressive treatments in pre-symptomatic patients, present high degrees of risk and
uncertainty. Many patients are willing to undertake extraordinary levels of risk, and
for patients who have exhausted treatment options, a “standard of care” may be
difficult to define for establishing clinical equipoise. Some might argue that, in such
circumstances, investigators and ethics review committees should defer to well-
informed preferences of research volunteers. However, risk in trials can impact
others outside of trials (e.g., third parties), or undermine public confidence (Hope
and McMillan 2004). For example, a major debacle or a series of negative trials in a
novel research arena can undermine support for parallel research efforts, as occurred
with gene therapy in the late 1990s. Though ethics polices and oversight systems do
not generally instruct investigators to consider how their trials might affect parallel
investigations, some commentators argue that researchers bear duties to steward
research programs and refrain from activities that might damage them (London et al.
2010).
At the other extreme are seemingly riskless studies. One category of riskless
studies is “seeding trials”: trials that involve well-characterized drugs and that are
aimed primarily at marketing by habituating doctors to prescribing them (Andersen
et al. 2006) or by generating a publication that can function legally as an advertise-
ment through reprint circulation (US Department of Health and Human Services,
Food and Drug Administration 2009; Federico et al. 2019) rather than resolving a
scientific question. Most human protections policies have little to say about seeding
trials, because their risks are so low that little to no scientific value is needed to
justify them. Seeding trials are nevertheless an obvious breach of scientific integrity
(Sox and Rennie 2008; London et al. 2012). Such studies not only sap scarce human
capital for research, but subvert the aims of science (which is aimed at belief change
through evidence, not habituation or attentional manipulation) and undermine the
credibility of the research enterprise.
Comparative effectiveness and usual care–randomized trials represent a second
category of seemingly riskless studies. In these studies, patients are randomly
assigned to standards of care in order to determine whether one standard of care is
better or noninferior (many such studies do not use any demarcated research pro-
cedures). Even when such studies use primary endpoints like mortality or major
morbidity, they are often viewed as “riskless” insofar as all patients are receiving the
same treatments within trials that they would receive outside of trials (Lantos and
Feudtner 2015). Whether “usual care” randomized trials are necessarily minimal risk
is hotly debated among research ethicists. The present author would argue that they
should not be understood as minimal risk (Kane et al. 2020). First, by using a morbid
primary endpoint, researchers are openly declaring they are uncertain as to whether
one standard is better than another on a clinically meaningful measure. Second, it is
impossible to exclude the possibility that a patient who opts to enter a usual care
62 J. Kimmelman
Maximizing Efficiencies
Justice
Inclusion
A second major salient for expansion of the justice principle in the 1990s was
inclusion. By the 1990s, it had become increasingly clear that certain populations
had been excluded – often systematically – from research, unfairly depriving these
populations of medical evidence for managing their conditions. These populations
have variously included gay men, African Americans, women, children, pregnant
women, and the elderly (Palmowski et al. 2018).
Major policy reforms in the USA at funding agencies and with drug regulation
have encouraged greater inclusion of (and analysis of subgroups for) children (US
Department of Health and Human Services, Food and Drug Administration 2005),
women (Elahi et al. 2016), and racial minorities (US Department of Health and
Human Services, Food and Drug Administration 2016). Though ethical review of
trial protocols does not typically focus on inclusion and representativeness, it is now
widely recognized that, absent a compelling scientific or policy rationale, clinical
trial investigators should strive to maximize the representativeness of the
populations they recruit into trials – particularly in later phase trials that are aimed
64 J. Kimmelman
at directly informing regulatory approvals and/or health care. Even with these policy
reforms, there are suggestions that certain populations continue to be underrepre-
sented in clinical research relative to incidence of disease in these populations
(Dickmann and Schutzman 2017; Ghare et al. 2019). Studies that do enroll diverse
populations often do not report stratified analyses, potentially frustrating the aim of
broader inclusion.
Trial Inception
Only after a trial has been deemed to fulfill the above expectations is informed
consent relevant. All major policies require that investigators offer prospective
research subjects the opportunity to consider a study’s risks, burdens, and benefits
against their preferences, values, and goals. This consent, expressed at the outset of
screening and enrollment, must be ongoing for the duration of a clinical trial.
Valid informed consent is said to consist of three core elements (Faden and
Beauchamp 1986). The first is capacity. Prospective research participants must
have the cognitive and emotional resources to render informed judgments about
trial participation. Generally, capacity is a clinical judgment. In cases where there are
concerns, there are tools for assessing capacity to participate in research. Some
populations, like children, trauma victims, or persons with dementia, lack compe-
tence to provide informed consent. Under such circumstances, there are other pro-
visions for respecting persons (see below).
The second element of valid informed consent is understanding. Prospective
research subjects must receive, comprehend, and appreciate information that is
material to their decision to enroll in a trial. Information includes (but is not limited
to): risk/benefit, study procedures, study purpose, and alternatives to participation.
There is a very large literature showing that patients often report inability to recall
basic information about study features. In particular, many patients struggle to
accurately understand the probability of benefit (therapeutic overestimation)
(Horng and Grady 2003) or the way research participation may constrain ability to
pursue individualized care (therapeutic misconception) (Appelbaum et al. 1987).
The third element of a valid informed consent is voluntariness. Prospective research
participants should be free of controlling influences, such as coercion (i.e., threatening
to make an individual worse off, or threatening to withhold something that is owed to
the individual) or undue manipulation (i.e., alterations to choice architecture, disclosure
processes, or interactions) that encourage compliance. Some forms of manipulation are
considered ethical – at least for certain routine research settings. Healthy volunteer
phase 1 studies often use financial payment to manipulate an individual’s enrollment in
a trial. Key to judging whether a manipulation is “undue” is whether it involves an
4 Clinical Trials, Ethics, and Human Protections Policies 65
offer that is disrespectful (Grant and Sugarman 2004) (i.e., offering to pay an individual
to override a moral commitment) or whether an offer is irresistible. Compensation, that
is, covering expenses associated with lost wages, parking, or travel, is different from
inducement and does not involve manipulation.
There are several circumstances where human research can be ethically and legally
conducted without valid informed consent of research subjects. One is in studies
involving persons lacking decisional competence. Generally, three protections are
established for such populations. First, demarcated research risk is limited to min-
imal risk or minor increase over minimal risk (though policies vary). Second,
surrogate consent is sought from parents or guardians (in the case of children) or
from a designated agent (e.g., an individual designated as such in an advanced
directive, or a family member) for incapacitated adults. Third, where applicable,
assent (i.e., agreement and cooperation) is sought from the research subject.
A second circumstance where consent can be waived, at least in some jurisdic-
tions, is emergency research (e.g., testing trauma or resuscitation trials). As above,
demarcated research procedures cannot exceed minimal risk. Because surrogate
consent cannot typically be obtained in emergency research, there are provisions
for public disclosure and community consultation before such studies are launched
(Halperin et al. 2007).
Many proposals have circulated about expedited or waived informed consent,
particularly in the context of usual care trials. One such example is Zelen’s consent,
which pre-randomizes patients and bypasses informed consent for those assigned to
treatments that are identical to those they would receive had they not enrolled in a trial.
This particular design is generally subject to strong ethical criticisms, since patients who
are randomly assigned to standard treatments are denied the opportunity to consent to
research participation (Hawkins 2004). However, other similar trial designs have been
proposed that, according to some, correct these ethical deficiencies while preserving
expedience. The reader is directed elsewhere for discussions (Flory et al. 2016).
One of the main vehicles for informed consent is the informed consent document
(ICD). ICDs typically contain a description of key disclosure elements of a study.
ICDs are widely criticized for their readability, their length, and their ineffectiveness
in supporting understanding among research participants. However, ICDs can be
better understood as a supplement for face-to-face discussions, which are much more
effective at achieving understanding (Flory and Emanuel 2004). They also provide
Institutional Review Boards (IRB, described in the next section) a proxy of what will
be covered in these discussions. Effort spent fussing over wording, if redirected
towards an appraisal of risk and benefit, would probably be a better investment for
members of IRBs.
66 J. Kimmelman
Independent Review
As Henry Beecher noted in his 1966 exposé (Beecher 1966), physicians harbor
divided loyalties when they conduct clinical research. Judgments about risk and
benefit, informed consent, and fair subject selection are refereed prospectively by
submitting trial protocols to independent review bodies (in the USA, these commit-
tees are called Institutional Review Boards or IRBs; in Canada they are called
“Research Ethics Boards” or REBs). Various policies stipulate the composition of
REBs, as well as the range of issues they should (or should not) consider. Different
models of REB review have emerged in the past decade or so, including for profit
REB review, centralized review mechanisms, and specialized review mechanisms
for fields like gene therapy or cancer trials (Levine et al. 2004).
Before trial launch, REB approval must be obtained. All design elements and
planned analyses for the trial should be pre-specified in a trial protocol. The protocol
should have been reviewed for scientific merit. And main design details, hypotheses,
and planned analyses should be registered prospectively in a public database like
clinicicaltrials.gov.
Conduct
Trials occur over time. New information emerges from within a trial as it unfolds, or
from concurrent research or adverse events documented outside of trials. This
information can alter a study’s risk/benefit balance, necessitating an alteration of
study design, reconsenting, and sometimes halting a trial. Similarly, slow recruit-
ment can compromise a study’s risk/benefit balance, since under-accrual can stymie
the ability of a trial to achieve the quantum of social value that was projected during
ethical review, and the options available outside the trial can change. Accordingly,
risk/benefit must always be monitored as a study proceeds. In studies involving
higher levels of risk, this duty typically devolves to investigators and data safety
monitoring boards (DMSBs). DSMBs confront myriad policy, ethical, and statistical
challenges; the reader is directed elsewhere for further discussions of trial monitor-
ing (DeMets et al. 2005).
Reporting
Once a trial is complete, the fulfillment of the risk/benefit established at trial outset
requires that results be disseminated to relevant knowledge users. Until recently,
there were few expectations and regulations on trial reporting. Numerous studies
have shown that many clinical trials are never published. For example, the present
author’s own work showed that only 37% of pre-license trials of drugs that stalled in
4 Clinical Trials, Ethics, and Human Protections Policies 67
clinical development were published within 5 years of trial closure (Hakala et al.
2015). Many policies, like Declaration of Helsinki or Canada’s Tricouncil Policy
Statement articulate a requirement of deposition of results for all clinical research.
The US FDA also requires deposition of clinical trial results in clinicaltrials.gov
within 12 months of completion of primary endpoint collection (US Department of
Health and Human Services 2016). However, many clinical trials are exempt from
this requirement, including phase 1 trials as well as trials testing nonregulated
products (e.g., surgeries, psychotherapies (Azar et al. 2019), or any research not
pursued as part of an IND). Some funders, institutions, and journals have policies
intended to address these gaps.
The main way findings in trials are disseminated is through publication. Trials
should also provide a frank and transparent description of methods and results.
This entails at least three considerations. First, methods should be described in
sufficient detail to support valid inferences. Methods in the report should be consis-
tent with the study protocol. Second, results should be reported in full, and consistent
with planned analyses. For example, all planned subgroup analyses should be
reported; any new subgroup analyses should be labeled as post hoc analyses.
Third, study reports should explain limitations and what new results mean in the
context of existing evidence. There is a wealth of literature showing these three
aspirations are not always fulfilled in trials. Regarding the first, systematic reviews
show that many trials do not adequately describe methods such as how allocation
was concealed or how randomization sequences were generated (Turner et al. 2012),
or have reported primary outcomes that are inconsistent with those stated in trial
protocols (Mathieu et al. 2009). Regarding complete reporting, safety outcomes are
often not well reported in trials (Phillips et al. 2019). Lack of balance in reports is
suggested by the frequent use of “spin” in trial reports (Boutron et al. 2014), or by the
selective presentation of positive subgroup analyses in study abstracts (Kasenda
et al. 2014).
safeguards are in place (Mello et al. 2018). Model safeguards for patient privacy and
research integrity are described elsewhere (Mello et al. 2013).
Synthesis
The above review is, by necessity, cursory and leaves many dimensions of clinical
trial ethics unaddressed. These include questions about protecting third parties like
caregivers in research (Kimmelman 2005), the ethics of incidental findings (Wolf
et al. 2008), and ancillary care obligations (Richardson and Belsky 2004). New trial
methodologies like adaptive trial designs (Bothwell and Kesselheim 2017) or clus-
ter-randomized trials (Weijer et al. 2011) pose challenges for implementing the
ethical standards described above.
It is tempting to view human protections and research ethics as a set of considerations
that are only visited once a clinical trial has been designed and submitted for ethical
review. However, decisions about what hypotheses to test, how to test them, how the
trial is conducted, and how to report results are saturated with ethical judgments. Most of
these judgments occur absent clear regulatory guidance, or outside the gaze of research
ethics boards. In that way, every scientist participating in the conception, design,
reporting, and uptake of clinical research is practicing research ethics.
Cross-References
▶ ClinicalTrials.gov
▶ Clinical Trials in Children
▶ Consent Forms and Procedures
▶ International Trials
▶ Reporting Biases
References
Al-Lamee R, Thompson D, Dehbi H-M et al (2018) Percutaneous coronary intervention in stable
angina (ORBITA): a double-blind, randomised controlled trial. Lancet 391:31–40. https://doi.
org/10.1016/S0140-6736(17)32714-9
Andersen M, Kragstrup J, Søndergaard J (2006) How conducting a clinical trial affects physicians’
guideline adherence and drug preferences. JAMA 295:2759–2764. https://doi.org/10.1001/
jama.295.23.2759
Annas GJ, Grodin MA (eds) (1995) The Nazi doctors and the Nuremberg Code: human rights in
human experimentation, 1st edn. Oxford University Press, New York
Appelbaum PS, Roth LH, Lidz CW et al (1987) False hopes and best data: consent to research and
the therapeutic misconception. Hast Cent Rep 17:20–24
Azar M, Riehm KE, Saadat N et al (2019) Evaluation of journal registration policies and prospec-
tive registration of randomized clinical trials of nonregulated health care interventions. JAMA
Intern Med. https://doi.org/10.1001/jamainternmed.2018.8009
4 Clinical Trials, Ethics, and Human Protections Policies 69
Bauchner H, Golub RM, Fontanarosa PB (2016) Data sharing: an ethical and scientific imperative.
JAMA 315:1238–1240. https://doi.org/10.1001/jama.2016.2420
Beecher HK (1966) Ethics and clinical research. N Engl J Med 274:1354–1360. https://doi.org/10.
1056/NEJM196606162742405
Bothwell LE, Kesselheim AS (2017) Thereal-world ethics of adaptive-design clinical trials. Hast
Cent Rep 47:27–37. https://doi.org/10.1002/hast.783
Boutron I, Altman DG, Hopewell S et al (2014) Impact of spin in the abstracts of articles reporting
results of randomized controlled trials in the field of cancer: the SPIIN randomized controlled
trial. J Clin Oncol 32:4120–4126. https://doi.org/10.1200/JCO.2014.56.7503
Canadian Institutes of Health Research, Natural Sciences and Engineering Research Council of
Canada, Social Sciences and Humanities Research Council of Canada, Secretariat on Respon-
sible Conduct of Research (Canada) (2018) Tri-Council policy statement: ethical conduct for
research involving humans
Carlisle B, Federico CA, Kimmelman J (2018) Trials that say “maybe”: the disconnect between
exploratory and confirmatory testing after drug approval. BMJ 360:k959. https://doi.org/10.
1136/bmj.k959
Chalmers I, Nylenna M (2014) A new network to promote evidence-based research. Lancet
384:1903–1904. https://doi.org/10.1016/S0140-6736(14)62252-2
Crouch RA, Arras JD (1998) AZT trials and tribulations. Hast Cent Rep 28:26–34. https://doi.org/
10.2307/3528266
DeMets DL, Furberg CD, Friedman LM (eds) (2005) Data monitoring in clinical trials: a case
studies approach, 2006 edition. Springer, New York
Dickmann LJ, Schutzman JL (2017) Racial and ethnic composition of cancer clinical drug trials:
how diverse are we? Oncologist. https://doi.org/10.1634/theoncologist.2017-0237
Dixon-Woods M, Jackson C, Windridge KC, Kenyon S (2006) Receiving a summary of the results
of a trial: qualitative study of participants’ views. BMJ 332:206–210. https://doi.org/10.1136/
bmj.38675.677963.3A
Elahi M, Eshera N, Bambata N et al (2016) The Food and Drug Administration Office of Women’s
Health: impact of science on regulatory policy: an update. J Women’s Health 25:222–234.
https://doi.org/10.1089/jwh.2015.5671
Faden RR, Beauchamp TL (1986) A history and theory of informed consent. Oxford University
Press, New York
Federico CA, Wang T, Doussau A et al (2019) Assessment of pregabalin postapproval trials and the
suggestion of efficacy for new indications: a systematic review. JAMA Intern Med 179:90–97.
https://doi.org/10.1001/jamainternmed.2018.5705
Fergusson D, Glass KC, Hutton B, Shapiro S (2005) Randomized controlled trials of aprotinin in
cardiac surgery: could clinical equipoise have stopped the bleeding? Clin Trials 2:218–232.
https://doi.org/10.1191/1740774505cn085oa
Flory J, Emanuel E (2004) Interventions to improve research participants’ understanding in
informed consent for research: a systematic review. JAMA 292:1593–1601. https://doi.org/10.
1001/jama.292.13.1593
Flory JH, Kitcher P (2004) Global health and the scientific research agenda. Philos Public Aff
32:36–65. https://doi.org/10.1111/j.1467-6486.2004.00004.x
Flory JH, Mushlin AI, Goodman ZI (2016) Proposals to conduct randomized controlled trials
without informed consent: a narrative review. J Gen Intern Med 31:1511–1518. https://doi.org/
10.1007/s11606-016-3780-5
Freedman B (1987) Equipoise and the ethics of clinical research. N Engl J Med 317:141–145.
https://doi.org/10.1056/NEJM198707163170304
General Assembly of the World Medical Association (2014) World Medical Association Declara-
tion of Helsinki: ethical principles for medical research involving human subjects. J Am Coll
Dent 81:14–18
Ghare MI, Chandrasekhar J, Mehran R et al (2019) Sexdisparities in cardiovascular device
evaluations: strategies for recruitment and retention of female patients in clinical device trials.
JACC Cardiovasc Interv 12:301–308. https://doi.org/10.1016/j.jcin.2018.10.048
70 J. Kimmelman
Glickman SW, McHutchison JG, Peterson ED et al (2009) Ethical and scientific implications of the
globalization of clinical research. N Engl J Med 360:816–823. https://doi.org/10.1056/
NEJMsb0803929
Goldbeck-Wood S (1998) Denmark takes a lead on research ethics. BMJ 316:1185. https://doi.org/
10.1136/bmj.316.7139.1185j
Grant RW, Sugarman J (2004) Ethics in human subjects research: do incentives matter? J Med
Philos 29:717–738. https://doi.org/10.1080/03605310490883046
Hakala A, Kimmelman J, Carlisle B et al (2015) Accessibility of trial reports for drugs stalling in
development: a systematic assessment of registered trials. BMJ 350:h1116. https://doi.org/10.
1136/bmj.h1116
Halperin H, Paradis N, Mosesso Vet al (2007) Recommendations for implementation of community
consultation and public disclosure under the Food and Drug Administration’s “exception from
informed consent requirements for emergency research”: a special report from the American
Heart Association Emergency Cardiovascular Care Committee and Council on Cardiopulmo-
nary, Perioperative and Critical Care: endorsed by the American College of Emergency Physi-
cians and the Society for Academic Emergency Medicine. Circulation 116:1855–1863. https://
doi.org/10.1161/CIRCULATIONAHA.107.186661
Halpern SD, Karlawish JHT, Berlin JA (2002) The continuing unethical conduct of underpowered
clinical trials. JAMA 288:358–362
Hawkins JS (2004) The ethics of Zelen consent. J Thromb Haemost 2:882–883. https://doi.org/10.
1111/j.1538-7836.2004.00782.x
Hey SP, Kimmelman J (2014) The questionable use of unequal allocation in confirmatory trials.
Neurology 82:77–79. https://doi.org/10.1212/01.wnl.0000438226.10353.1c
Hope T, McMillan J (2004) Challenge studies of human volunteers: ethical issues. J Med Ethics
30:110–116. https://doi.org/10.1136/jme.2003.004440
Horng S, Grady C (2003) Misunderstanding in clinical research: distinguishing therapeutic mis-
conception, therapeutic misestimation, and therapeutic optimism. IRB 25:11–16
International Council for Harmonisation (1996) Guideline for good clinical practice. https://
database.ich.org/sites/default/files/E6_R2_Addendum.pdf
Kane PB, Kim SYH, Kimmelman J (2020) Whatresearch ethics (often) gets wrong about minimal
risk. Am J Bioeth 20:42–44. https://doi.org/10.1080/15265161.2019.1687789
Kasenda B, Schandelmaier S, Sun X et al (2014) Subgroup analyses in randomised controlled trials:
cohort study on trial protocols and journal publications. BMJ 349:g4539. https://doi.org/10.
1136/bmj.g4539
Kimmelman J (2005) Medical research, risk, and bystanders. IRB Ethics Hum Res 27:1
Kimmelman J (2020) What is human research for? Reflections on the omission of scientific integrity
from the Belmont Report (accepted). Perspect Biol Med 62(2):251–261
Kimmelman J, Federico C (2017) Consider drug efficacy before first-in-human trials. Nature
542:25–27. https://doi.org/10.1038/542025a
Kimmelman J, Weijer C, Meslin EM (2009) Helsinki discords: FDA, ethics, and international drug
trials. Lancet 373:13–14. https://doi.org/10.1016/S0140-6736(08)61936-4
King NMP (2000) Defining and describing benefit appropriately in clinical trials. J Law Med Ethics
28:332–343. https://doi.org/10.1111/j.1748-720X.2000.tb00685.x
Lantos JD, Feudtner C (2015) SUPPORT and the ethics of study implementation: lessons for
comparative effectiveness research from the trial of oxygen therapy for premature babies.
Hast Cent Rep 45:30–40. https://doi.org/10.1002/hast.407
Levine C, Faden R, Grady C et al (2004) “Special scrutiny”: a targeted form of research protocol
review. Ann Intern Med 140:220–223
Loder E, Groves T (2015) The BMJ requires data sharing on request for all trials. BMJ 350:h2373.
https://doi.org/10.1136/bmj.h2373
London AJ (2007) Clinical equipoise: foundational requirement or fundamental error? In: The
Oxford handbook of bioethics. Oxford University Press, Oxford, pp 571–596
4 Clinical Trials, Ethics, and Human Protections Policies 71
London AJ, Kimmelman J, Emborg ME (2010) Beyond access vs. protection in trials of innovative
therapies. Science 328:829–830. https://doi.org/10.1126/science.1189369
London AJ, Kimmelman J, Carlisle B (2012) Rethinking research ethics: the case of postmarketing
trials. Science 336:544–545. https://doi.org/10.1126/science.1216086
Mathieu S, Boutron I, Moher D et al (2009) Comparison of registered and published primary
outcomes in randomized controlled trials. JAMA 302:977–984. https://doi.org/10.1001/jama.
2009.1242
Mello MM, Francer JK, Wilenzick M et al (2013) Preparing for responsible sharing of clinical trial
data. N Engl J Med 369:1651–1658. https://doi.org/10.1056/NEJMhle1309073
Mello MM, Lieou V, Goodman SN (2018) Clinical trial participants’ views of the risks and benefits
of data sharing. N Engl J Med. https://doi.org/10.1056/NEJMsa1713258
Menikoff J, Kaneshiro J, Pritchard I (2017) Thecommon rule, updated. N Engl J Med 376:613–615.
https://doi.org/10.1056/NEJMp1700736
Nasser M, Clarke M, Chalmers I et al (2017) What are funders doing to minimise waste in research?
Lancet 389:1006–1007. https://doi.org/10.1016/S0140-6736(17)30657-8
Palmowski A, Buttgereit T, Palmowski Y et al (2018) Applicability of trials in rheumatoid arthritis
and osteoarthritis: a systematic review and meta-analysis of trial populations showing adequate
proportion of women, but underrepresentation of elderly people. Semin Arthritis Rheum. https://
doi.org/10.1016/j.semarthrit.2018.10.017
Partridge AH, Winer EP (2002) Informing clinical trial participants about study results. JAMA
288:363–365. https://doi.org/10.1001/jama.288.3.363
Phillips R, Hazell L, Sauzet O, Cornelius V (2019) Analysis and reporting of adverse events in
randomised controlled trials: a review. BMJ Open 9:e024537. https://doi.org/10.1136/bmjopen-
2018-024537
Polman CH, Reingold SC, Barkhof F et al (2008) Ethics of placebo-controlled clinical trials in
multiple sclerosis: a reassessment. Neurology 70:1134–1140. https://doi.org/10.1212/01.wnl.
0000306410.84794.4d
Richardson HS, Belsky L (2004) The ancillary-care responsibilities of medical researchers. An
ethical framework for thinking about the clinical care that researchers owe their subjects. Hast
Cent Rep 34:25–33
Savulescu J, Chalmers I, Blunt J (1996) Are research ethics committees behaving unethically?
Some suggestions for improving performance and accountability. BMJ 313:1390–1393. https://
doi.org/10.1136/bmj.313.7069.1390
Sox HC, Rennie D (2008) Seeding trials: just say “no”. Ann Intern Med 149:279–280
Turner L, Shamseer L, Altman DG et al (2012) Consolidated standards of reporting trials (CON-
SORT) and the completeness of reporting of randomised controlled trials (RCTs) published in
medical journals. Cochrane Database Syst Rev 11:MR000030. https://doi.org/10.1002/
14651858.MR000030.pub2
U.S. Department of Health and Human Services (2016) 42 CFR 11: clinical trials registration and
results information submission
U.S. Department of Health and Human Services, Food and Drug Administration (2005) Guidance
for industry: how to comply with the Pediatric Research Equity Act
U.S. Department of Health and Human Services, Food and Drug Administration(2009) Good reprint
practices for the distribution of medical journal articles and medical or scientific reference
publications on unapproved new uses of approved drugs and approved or cleared medical devices
U.S. Department of Health and Human Services, Food and Drug Administration(2016) Collection
of race and ethnicity data in clinical trials
Vedula SS, Li T, Dickersin K (2013) Differences in reporting of analyses in internal company
documents versus published trial reports: comparisons in industry-sponsored trials in off-label
uses of gabapentin. PLoS Med 10:e1001378. https://doi.org/10.1371/journal.pmed.1001378
Weijer C, Miller PB (2004) When are research risks reasonable in relation to anticipated benefits?
Nat Med 10:570. https://doi.org/10.1038/nm0604-570
72 J. Kimmelman
Weijer C, Grimshaw JM, Taljaard M et al (2011) Ethical issues posed by cluster randomized trials in
health research. Trials 12:100. https://doi.org/10.1186/1745-6215-12-100
Wolf SM, Paradise J, Caga-anan C (2008) The law of incidental findings in human subjects
research: establishing researchers’ duties. J Law Med Ethics 36(361–383):214. https://doi.org/
10.1111/j.1748-720X.2008.00281.x
World Health Organization, Council for International Organizations of Medical Sciences (2017)
International ethical guidelines for health-related research involving humans. CIOMS, Geneva
Yordanov Y, Dechartres A, Porcher R et al (2015) Avoidable waste of research related to inadequate
methods in clinical trials. BMJ 350:h809. https://doi.org/10.1136/bmj.h809
History of the Society for Clinical Trials
5
O. Dale Williams and Barbara S. Hawkins
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Background: 1967–1972 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Steps in the Creation of the Society for Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Meetings of Coordinating Center Personnel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Coordinating Center Models Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
National Conference on Clinical Trials Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Journals of the Society for Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
International Participation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Abstract
This chapter provides a synopsis of the events leading to the creation of the
Society for Clinical Trials (SCT). The Society was officially incorporated in
September 1978 and celebrated its 40th anniversary during its annual meeting
in New Orleans May 19–22, 2019.
O. D. Williams (*)
Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA
Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
e-mail: [email protected]
B. S. Hawkins
Johns Hopkins School of Medicine and Bloomberg School of Public Health, The Johns Hopkins
University, Baltimore, MD, USA
e-mail: [email protected]
Keywords
Clinical trials · Greenberg report · Coordinating center · CCMP · Models Project ·
National Conference · Directors
Introduction
The Society for Clinical Trials, Inc. (SCT) is a professional society for advocates,
designers, and practitioners of clinical trials, regardless of medical specialty or area of
expertise. SCT was incorporated in September 1978. It was created with the purpose:
• To promote the development and exchange of information for design and conduct
of clinical trials and research using similar methods
• To provide a forum for discussion of philosophical, ethical, legal, and procedural
issues involved in the design, organization, operations, and analysis of clinical
trials and other epidemiological studies that use similar methods (Society for
Clinical Trials Board of Directors 1980).
Background: 1967–1972
In the early 1970s, there was an expanding and evolving awareness that clinical trials
were going to play a vital and key role in the development and implementation
of improved strategies for addressing important public health and medical issues.
At that time, only a few large multicenter trials had been undertaken, including the
University Group Diabetes Program (UGDP) and the Coronary Drug Project (CDP),
both sponsored by the National Institutes of Health (NIH). Experience in these
projects highlighted the theoretical, organizational, and operational challenges
such studies presented. It was clear that clinical trials were a valuable tool for
evaluating and comparing interventions; however, there was significant concern
among sponsors and practitioners as to their cost, management, and duration before
they reached a conclusion regarding the effectiveness and safety of interventions
under evaluation. Donald S. Fredrickson (1924–2002; Director, National Heart
Institute 1966–1974; NIH Director 1975–1981) summarized these issues eloquently
in an address to the New York Academy of Science on January 23, 1968. In this
address, for which the full text is available2, he described field trials as indispensable
ordeals that are necessary for avoiding perpetual uncertainty.
In 1967, the National Heart Institute (later the National Heart, Lung, and Blood
Institute [NHLBI]) created the Heart Special Project Committee (chaired by Bernard
G. Greenberg, Chair of Biostatistics, University of North Carolina) to review the
approach to cooperative studies. The resulting report, known as the Greenberg
Report, was presented to the National Advisory Heart Council later in the same
year. The Greenberg Report (Heart Special Project Committee 1988) highlighted
three essential components:
5 History of the Society for Clinical Trials 75
So the stage was set for the creation of an organization of designers and others
engaged in multicenter clinical trials. Key steps in this process were:
It may be surprising that the first step listed is meetings of coordinating center
personnel. However, as noted above, the Greenberg Report indicated that coordi-
nating centers were essential components of cooperative studies, such as large-scale
multicenter clinical trials. The annual meetings were facilitated by NHLBI under the
leadership of Robert I. Levy. Levy understood that coordinating center capabilities
and the pool of relevant expertise needed to be expanded to meet current and future
needs for successful conduct of large studies. These meetings began in 1973 and
76 O. D. Williams and B. S. Hawkins
continued until 1981; they initiated a sequence of events that ultimately led to the
creation of the SCT.
The initial 1.5-day meeting of coordinating center personnel was held in May 1973
in Columbia, Maryland. More than 60 people participated, including 9 from NIH;
participants represented 11 institutions. This initial meeting was mostly a “show and
tell” by investigators from coordinating centers for major NHLBI-sponsored multi-
center studies:
Table 1 Locations and host organizations for meetings of personnel from clinical trial coordinat-
ing centers, 1973–1981
Year Meeting location Host organization
1973 Columbia, MD University of Maryland
1974 – –
1975 Plymouth, MN University of Minnesota
1976 Houston, TX University of Texas
1977 Chapel Hill, NC University of North Carolina
1978 Washington, DC George Washington University
1979 Boston, MA
1980 Philadelphia, PA Society for Clinical Trials
1981 San Francisco, CA Society for Clinical Trials
–, no meeting held
5 History of the Society for Clinical Trials 77
operations and cost, and survival analysis. Also, the Coordinating Center Models
Project was introduced. The guest speaker was Levy, who addressed “Decision
Making in Large-Scale Clinical Trials.” Organizers of these meetings already had
adopted the format of typical scientific society meetings.
Two important activities occurred in 1976. One was NHLBI funding of the
Coordinating Center Models Project (CCMP [1976–1979]). The CCMP purpose
was to study existing coordinating centers for large, multicenter trials with the aim of
establishing guidelines and standards for organization and operations of coordinat-
ing centers for future multicenter trials. The seven CCMP reports are available from
the National Technical Information Service (Coordinating Center Models Project
Research Group. Coordinating Center Models Project 1979a, b, c, d, e, f, 1980).
Also during 1976, Williams, Meinert, and others met with Robert S. Gordon, Jr.,
who was special assistant to the NIH Director (Frederickson) and other key leaders at
NIH. Later that year, a group consisting of Fred Ederer (National Eye Institute) and
Meinert, Williams and Harold P. Roth (National Institute of Arthritis, Metabolism,
and Digestive Disease) met with the NIH Clinical Trials Committee. This group
proposed that a professional society that addressed the general issues of clinical trials
be created and asked for the Committee’s support. The Committee members
expressed interest in the concept but indicated that evidence for widespread partic-
ipatory support was lacking. Instead, the Committee proposed holding a conference
to assess the level of support for such a society. As a result, a Planning Committee
was formed under the leadership of Roth.
Although thus far we have described activities in the USA sponsored primarily
by the NHLBI, other sponsors and practitioners of clinical trials were interested
in creating a forum for sharing experiences, methods, and related developments.
The National Cancer Institute (NCI) created the Clinical Trials Cooperative Group
Program in 1955; the National Cancer Act of 1971 enhanced the role of these
cooperative groups and their coordinating centers. In 1962, the U.S. Veterans
Administration (VA; now the U.S. Department of Veterans Affairs) established
four regional research support centers for the VA Cooperative Studies Program
(VA CSP) under the leadership of Lawrence Shaw (https://www.vascp.research.va.
gov/CSP/history.asp). In 1967 and 1970, findings from a major trial of antihyper-
tensive agents conducted by a VA Cooperative Studies Group were published
(Veterans Administration Cooperative Study Group on Antihypertensive Agents
1967, 1970). In 1972, two of the regional research support centers were designated
to house CSP coordinating centers to support “multicenter clinical trials that evalu-
ated novel therapies or new uses of standard treaments” (Streptomycin in Tubercu-
losis Trials Committee 1948). Two more CSP coordinating centers were established
during the next 6 years. In the United Kingdom (UK), the Medical Research Council
had sponsored randomized trials, following the landmark trial of streptomycin for
tuberculosis (Streptomycin in Tuberculosis Trials Committee 1948). Thus, the group
78 O. D. Williams and B. S. Hawkins
The 1977 National Conference on Clinical Trials Methodology was held in Building
1 on the NIH campus on October 3 and 4, 1977. Somewhat to the surprise of almost
everyone involved in its planning, the conference attracted more than 700 partici-
pants from around the USA who represented much of NIH and other current and
potential sponsors of clinical trials. Attendees were welcomed by Gordon and
Fredrickson; the program included presentations within broad topics:
One of the more important sessions was on communications, which addressed the
question “Should mechanisms be established for sharing among clinical trial inves-
tigators experience in handling problems in design, execution and analysis?” The
discussion leaders were Roth and Genell Knatterud, with Louis Lasagna, Meinert,
and Barbara Hawkins. Harold Schoolman and Fred Mosteller also contributed. The
conference and this session on communications played a key role in the creation
of the SCT. The conference proceedings were published 2 years later (Roth and
Gordon 1979).
Soon after this conference, in September, 1978, the Society for Clinical Trials was
incorporated. The members of the initial board of directors are listed in Table 2. One
Table 2 Members of the initial board of directors of the Society for Clinical Trials
Thomas C. Chalmers, MD, Chair, Mt. Sinai School of Medicine
Harold O. Conn, MD, Yale University School of Medicine and West Haven VA Hospital
Fred Ederer, MA, National Eye Institute, National Institutes of Health
Robert S. Gordon, Jr., MD, National Institutes of Health
Curtis L. Meinert, PhD, The Johns Hopkins University
Christian R. Klimt, MD, DrPH, University of Maryland and Maryland Medical Research Institute
Paul Meier, PhD, University of Chicago
Charles G. Moertel, MD, Mayo Clinic
Thaddeus E. Prout, MD, The Johns Hopkins University and Greater Baltimore Medical Institute
Harold P. Roth, MD, National Institute of Diabetes and Digestive Diseases, NIH
Maurice J. Staquet, MD, Representative of the International Society for Clinical Biostatistics
O. Dale Williams, PhD, University of North Carolina
5 History of the Society for Clinical Trials 79
of the first acts of the Board was to develop plans for the first meeting of the Society.
A Program Committee (Williams, Chair) was created; as noted above, this meeting
was planned and undertaken in conjunction with the seventh annual meeting of
coordinating center personnel. The result was a four-day meeting May 5–8, 1980, in
Philadelphia. Sponsors included NHLBI, NEI, National Institute for Neurologic
Conditions, Deafness, and Stoke (NINDS), National Institute for Addiction and
Infectious Diseases (NIAID), and Maryland Medical Research Institute. This impor-
tant meeting included three key presentations (Combined Annual Scientific Sessions
Society for Clinical Trials 1980):
The second annual meeting, also with Williams as Program Committee Chair,
was held in conjunction with the eighth and final annual meeting of coordinating
center personnel.
by Sage. Steven N. Goodman was the originating editor of the new journal from
2004 to 2013 and succeeded by Colin Begg in 2013.
International Participation
The Society has been enhanced by membership and meeting attendees from outside
the USA, essentially from its outset. In fact, Maurice J. Staquet was a member of the
initial board of directors as a representative of the International Society for Clinical
Biostatistics. Also, meetings have been held in other countries. The first was the
7th meeting, held in Montreal, Canada, May 1986. The 11th meeting was held
in Toronto, Canada, May 1990; the 12th was the first joint meeting with the
International Society for Clinical Biostatistics, held in Brussels, Belgium, July
1991; the 18th was the second meeting of the Society and the International Society
for Clinical Biostatistics, held in Boston, July 1997; the 21st was held in Toronto,
Canada, April, 2000; the 24th was the third joint meeting with the International
Society for Clinical Biostatistics, held in London, England, July 2003; the 28th was
held in Montreal, Canada, May, 2007; the 32nd in Vancouver, Canada, May, 2011;
the 37th was held in Montreal, Canada, May 2016; and the 38th was held in
Liverpool, England, May 2017.
By 1981, the Society for Clinical Trials was fully functional; Controlled Clinical Trials
was in publication. Beginning in 1980, scientific meetings of the Society have been
held annually. Programs of annual meetings and abstracts of contributed presentations,
along with other information, are online at https://www.sctweb.org; many have been
published in the official SCT journals. Although clinical procedures and information
technology have evolved since 1981, many of the practical issues persist; for example,
the methods for promoting and monitoring data quality have evolved but the need for
data of high quality remains. As in other areas of clinical trials methodology, the
practices to achieve a desired result may change over time, but the principles are
permanent. Thus, the goals of the SCT remain pertinent to today’s clinical trialists.
Many different individuals contributed importantly to the creation and early
operation of the SCT. Of those mentioned above, three are especially important.
Fredrickson saw the need for an entity such as the SCT and supported it from the
highest levels of NIH. Gordon was tireless in meeting with interested advocates for
a professional society for clinical trialists, eliciting and securing support across
NIH, and helping to lead the overall creation effort. Levy saw a compelling need
for enhanced coordinating center capability and made sure the meetings of coordi-
nating center personnel that led directly to the creation of the Society were supported
and well organized. Many other individuals deserve significant credit as well,
but these three are especially deserving of recognition for their important and
timely contributions.
5 History of the Society for Clinical Trials 81
References
Abstracts. Combined Annual Scientific Sessions Society for Clinical Trials and seventh annual
symposium for coordinating clinical trials. May 6–8, 1980, Marriott Inn, Philadelphia, PA
(1980) Control Clin Trials 1:165–178
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1979a, March 1) A study of coordinating centers in multicenter clinical trials. Design
and methods, vols 1 and 2. NTIS Accession No. PB82-143730 and PB82-143744. National
Technical Information Services, Springfield
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1979b, March 1) A study of coordinating centers in multicenter clinical trials. RFPs
for coordinating centers: a content evaluation. NTIS Accession No. PB82143702. National
Technical Information Services, Springfield
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1979c, August 1) A study of coordinating centers in multicenter clinical trials. Terminology.
NTIS Accession No. PB82-143728. National Technical Information Services, Springfield.
Bibliographic resource for clinical trials. April 1, 1980. NTIS Accession No. PB87-??????
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1979d, September 1) A study of coordinating centers in multicenter clinical trials. Phases of
a multicenter clinical trials. NTIS Accession No. PB82-143751. National Technical Information
Services, Springfield
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1979e, September 1) A study of coordinating centers in multicenter clinical trials. Enhancement
of methodological research in the field of clinical trials. NTIS Accession No. PB82-143710.
National Technical Information Services, Springfield
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1979f, June 1) A study of coordinating centers in multicenter clinical trials. CCMP manuscripts
presented at the annual symposia on coordinating clinical trials. NTIS Accession No. PB82-
143694. National Technical Information Services, Springfield
Coordinating Center Models Project Research Group. Coordinating Center Models Project
(1980, September 1) A study of coordinating centers in multicenter clinical trials. Coordinating
centers and the contract process. NTIS Accession No. PB87-139101? National Technical
Information Services, Springfield
Frederickson DS (1968) The field trial: some thoughts on the indispensable ordeal. Bull N Y Acad
Med 44(2):985–993
Heart Special Project Committee (1988) Organization, review, and administration of cooperative
studies (Greenberg report): a report from the Heart Special Project Committee to the National
Advisory Heart Council. Control Clin Trials 9:137–148. [Includes a list of members of the Heart
Special Project Committee]
https://www.vascp.research.va.gov/CSP/history.asp. Accessed 12 Aug 2019
Roth HP, Gordon RS (1979) Proceedings of the national conference on clinical trials methodology.
Clin Pharmacol Ther 25(5, pt 2):629–765
Society for Clinical Trials Board of Directors (1980) By-laws. Control Clin Trials 1(1):83–89
Streptomycin in Tuberculosis Trials Committee (1948) Streptomycin treatment of pulmonary
tuberculosis. Br Med J 2:769–782
Veterans Administration Cooperative Study Group on Antihypertensive Agents (1967) Effects of
treatment on morbidity in hypertension. Results in patients with diastolic blood pressures
averaging 115 through 129 mm Hg. JAMA 202(11):116–122
Veterans Administration Cooperative Study Group on Antihypertensive Agents (1970) Effects of
treatment on morbidity in hypertension. II. Results in patients with diastolic blood pressures
averaging 90 through 114 mm Hg. JAMA 213(7):1143–1152
Part II
Conduct and Management
Investigator Responsibilities
6
Bruce J. Giantonio
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Investigators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Overall Responsibilities of Investigators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Research Study Design and Conduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Conduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Safeguards to Protect Research Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Adverse Event Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Safety Oversight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
IRB Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Informed Consent Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Protocol Noncompliance and Research Misconduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Protocol Noncompliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Research Misconduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Regulations and Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
B. J. Giantonio (*)
The ECOG-ACRIN Cancer Research Group, Philadelphia, PA, USA
Massachusetts General Hospital, Boston, MA, USA
Department of Medical Oncology, University of Pretoria, Pretoria, South Africa
e-mail: [email protected]
Abstract
The research atrocities committed during World War II using human subjects
prompted the development of a body of regulations, beginning with the Nurem-
berg Code, to ensure that human subjects’ research is safely conducted and
prioritizes the rights of the individual over the conduct of the research. The
resultant regulations guiding human subjects’ research affect protocol design,
the selection of participants, safety reporting and oversight, and the dissemination
of research results. The investigator conducting research on human subjects must
be familiar with those regulations to meet his/her responsibility to protect the
rights and welfare of research participants.
Keywords
Belmont Report · The Common Rule · Delegation of tasks · Drug accountability ·
Good clinical practice · Informed consent process · Institutional review board
(IRB) · Investigator · Noncompliance · Scientific misconduct
Introduction
Definitions
Research
Investigators
The definitions of investigator used in the existing guidelines and policies vary but in
general describe any individual responsible for the conduct of research involving
human subjects. The conduct of clinical research can be limited to one research site
or performed across hundreds of sites, and the term investigator (or principal inves-
tigator) can apply to the person responsible either for the study as a whole, or for an
individual site. For purposes of clarity, and unless otherwise stated, we will use the
term investigator to encompass the investigator of single-site research, the lead
investigator of multi-site research, and site-specific investigators of multi-site research.
Design
Four of the seven requirements for the ethical conduct of clinical research
(Emanuel et al. 2000) relate to the design of the research: value, scientific validity,
fair subject selection, and a favorable risk-benefit ratio (Table 1).
To justify exposing participants to potential harm, the proposed research must
provide generalizable knowledge that contributes to the common good. There must
be uncertainty among the medical community, or “clinical equipoise,” for the
question being asked by the research, and the methodology for obtaining that
knowledge must be appropriate to the question and rigorously applied. Risks to
participants must be minimized, and subject selection must be done such that both
the risks and benefits of the research are fairly distributed with subjects excluded
only for valid safety or scientific reasons.
Conduct
interest, financial, or otherwise, that could influence their judgment for the inclusion
of subjects in the research or the interpretation of the findings.
Safety Oversight
research and expertise in the particular disease being studied, the majority of whom
are unaffiliated with the specific research project and are free from other conflicts of
interest.
Both IRBs and DSMCs are provided with safety data during the conduct of the
study. Any severe or unanticipated adverse event is to be reported in a timely manner
and according to requirements included in the protocol; all others are submitted
according to a preplanned review schedule. Data and Safety Monitoring Committees
review not only adverse effects but also outcomes data at preplanned intervals to
ensure that the continuation of the research is justified. The reports from Data and
Safety Monitoring Committees are usually submitted to the IRB of record for the
specific research project.
IRB Requirements
Unless exempt from review, investigators are responsible for interactions with the
IRB, including initial IRB approval, approval for any modifications to the research,
safety reporting, and, as required, continuing review of the research.
In order for the IRB to effectively evaluate and monitor a clinical research project,
investigators must provide the IRB of record with sufficient information to make
their determinations regarding the initiation of the clinical research and its ongoing
conduct. This includes new safety and outcomes information, deviations to study
specified or IRB requirements (section “Protocol Noncompliance and Research
Misconduct” below), as well as unanticipated problems involving risks to subjects
or others.
Additionally, the investigator is responsible for informing the IRB of record of
any significant new finding that emerges during the conduct of the research that
might affect a participant’s willingness to continue to participate in the research.
Based on the significance of the new findings, the investigator may be responsible
for providing the new findings to the registered participants, and modifications to the
study to account for the new findings may require a suspension in accrual.
It is important to note that the inclusion of vulnerable populations (such as
children, students, prisoners, and pregnant women) in clinical research may require
additional IRB-approved safeguards for their participation, and the investigator is
responsible for implementing those safeguards.
The informed consent process and the consent form are required to ensure that a
potential participant in a clinical research project has enough information about the
specific project, and about clinical research in general, to make an informed and
autonomous decision about their participation.
For clinical research that requires the participant’s informed consent, the inves-
tigator is responsible for ensuring that consent is obtained and documented as
92 B. J. Giantonio
approved by the IRB and according to federal, state, and local requirements. In
addition, each participant is to be provided a copy of the informed consent document
when written consent is required.
Investigators are required to allow for monitoring and auditing of the research by
the IRB of record, sponsors, and any applicable regulatory agencies.
The investigator is responsible for reporting to the IRB any instances of protocol
noncompliance and misconduct. Reporting to the sponsor and other regulatory
agencies may be required as well. In addition, the suspension or termination of an
IRB approval may also require reporting to the sponsor and to the appropriate
federal, state, and local regulatory agencies.
Protocol Noncompliance
Research Misconduct
The responsibilities of the clinical research investigator can cover the lifespan of the
study and are intended to protect rights and welfare of the research participants and
the integrity of the research itself. And while much of the work of clinical research is
delegated to others, the overall responsibility remains with the investigator. Non-
compliance with the requirements of the research protocol can compromise both
participant safety and study integrity.
Key Facts
Regulations and Policies upon which the investigator responsibilities are derived:
a. The Belmont Report: US Department of Health and Human Services, Office for
Human Subjects Research: Belmont Report. Ethical Principles and Guidelines for
the Protection of Human Subjects of Research.
http://www.hhs.gov/ohrp/humansubjects/guidance/belmont.html
b. US Code of Federal Regulations
eCFR – Code of Federal Regulations
The two relevant sections of the Code of Federal Regulations that apply to the
conduct of clinical research are Title 21 CFR Food and Drugs and 45 CFR Public
Welfare.
i. Title 21 CFR: Food and Drugs
https://gov.ecfr.io/cgi-bin/text-idx?SID=027d2d7bd97666fc9f896c580d5039dc
&mc=true&tpl=/ecfrbrowse/Title21/21tab_02.tpl
This section establishes many of the regulations concerning the investigation of
new agents and devices and forms the basis of the policies of the Food and Drug
Administration. Relevant sections of 21 CFR are:
21 CFR 312.50: General Responsibilities of Investigators
21 CFR 812.100: Responsibilities of Investigators: Biologics
21 CFR 812.110: Responsibilities of Investigators: Devices
21 CFR 11: Electronic records/Electronic signature
21 CFR 50: Protections of Human Subject
21 CFR 54: Financial Disclosure by Clinical Investigators
21 CFR 56: Institutional Review Boards
ii. Title 45 CFR: Public Welfare
Section 45 CFR 46: “Regulation for the Protection of Human Subjects in
Research”
https://gov.ecfr.io/cgi-bin/text-idx?SID=6ddf773215b32fc68af87b6599529417
&mc=true&node=pt45.1.46&rgn=div5
45 CFR 46 applies to all research involving human subjects that is conducted or
funded by US Department of Health and Human Services (DHHS) and has been
widely adopted as guidance for clinical research outside of that funded by DHHS.
Subpart A: The Common Rule
Subpart B: Additional protections for pregnant women, human fetuses and
neonates
Subpart C: Additional protections for prisoners
Subpart D: Additional protections for children
c. Good Clinical Practice Guidelines (GCP ICH-E6R2)
The International Council for Harmonisation of Technical Requirements for Phar-
maceuticals for Human Use (ICH) was established in 1990 with the stated mission to
achieve greater harmonization worldwide to ensure that safe, effective, and high-
quality medicines are developed and registered in the most resource-efficient manner.
GCP ICH-E6R2 is one of the products created by this organization to define
“good clinical practices” for clinical research. These guidelines are commonly used
6 Investigator Responsibilities 95
guide clinical research around the world. Of note, the use of the term “good clinical
practice” in this context is distinct from that applied to day-to-day patient care.
https://www.ich.org/products/guidelines/efficacy/efficacy-single/article/integrated-
addendum-good-clinical-practice.html
d. CIOMS International Ethical Guidelines for Biomedical Research
The Council for International Organizations of Medical Sciences (CIOMS) is an
international nongovernmental organization in an official relationship with World
Health Organization (WHO). The guidelines focus primarily on rules and principles
to protect humans in research and to reliably safeguard the rights and welfare of
humans.
https://cioms.ch/shop/product/international-ethical-guidelines-for-health-related-
research-involving-humans/
e. Additional Guidance Documents
Attachment C: https://www.hhs.gov/ohrp/sachrp-committee/recommendations/
2013-january-10-letter-attachment-c/index.html
FDA Guidance for Industry on Investigator Responsibilities: https://www.fda.
gov/downloads/Drugs/.../Guidances/UCM187772.pdf
Cross-References
References
Baer AR, Devine S, Beardmore CD, Catalano R (2011) Clinical investigator responsibilities. J
Oncol Pract 7:124–128. https://doi.org/10.1200/JOP.2010.000216
Emanuel EJ, Wendler D, Grady C (2000) What makes clinical research ethical? JAMA
283:2701–2711. https://doi.org/10.1001/jama.283.20.2701
Centers Participating in Multicenter Trials
7
Roberta W. Scherer and Barbara S. Hawkins
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Roles and Functions of Resource Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Coordinating Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Other Resource Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Clinical Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
The Implementation Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Participant Enrollment Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Treatment and Follow-Up Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Participant Closeout Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Post-Funding Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Abstract
Successful conduct of multicenter trials requires many different types of activities,
implemented by different types of centers. Resource centers are those involved in
planning the trial protocol, overseeing trial conduct, and analyzing and interpreting
trial data. They include clinical and data coordinating centers, reading centers, central
laboratories, and others. Clinical centers prepare for the trial at their setting and
R. W. Scherer (*)
Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health,
Baltimore, MD, USA
e-mail: [email protected]
B. S. Hawkins
Johns Hopkins School of Medicine and Bloomberg School of Public Health, The Johns Hopkins
University, Baltimore, MD, USA
e-mail: [email protected]
accrue, treat, and follow up study participants. Each center has specific responsibil-
ities, which are tied to the trial phase and wax and wan over the course of the trial.
Activities during the planning phase are mostly the purview of the clinical and data
coordinating centers, which are responsible for obtaining funding and designing a
trial that will answer the specific research question being asked. The initial design
phase and the protocol development and implementation phase see both resource
centers and clinical centers making preparations for the trial to be conducted. The
main responsibilities of clinical centers during the participant recruitment, treatment,
and follow-up phases are to recruit, randomize, treat, and follow study participants
and collect and transmit study data to the data coordinating center. The resource
centers manage drug or device distribution, receive and manage data, and monitor
trial progress. Clinical centers complete closeout visits during the participant closeout
phase, while resource centers complete final data management activities, data anal-
ysis, and interpretation. The termination phase finds investigators from all centers
involved in manuscript writing activities. Collaboration among all centers during all
phases is essential for the successful completion of any multicenter trial.
Keywords
Resource center · Clinical center · Coordinating center · Reading center · Central
laboratory · Multicenter trial · Trial phase
Introduction
In this chapter, we discuss the types of centers that form the organizational units of
a multicenter trial. As implied by the term multicenter, different types of centers are
typically required to conduct a multicentered trial, each with specific responsibilities
but which together perform all required functions. Resource centers are those with
expertise and experience in performing specific tasks and include groups such as
coordinating centers, data management centers, central laboratories, reading centers,
and quality control centers among others. Distinct from the resource centers are
clinical centers, whose main function is the accrual, treatment, and follow-up of
study participants, thus forming an integral part of any multicenter trial.
may be established to provide expertise to multiple clinical trials, and possibly other
types of research studies, or may be organized to serve a specific trial or group
of trials.
Coordinating Centers
All multicenter trials have at least one center with overall responsibility for
the scientific conduct of the trial. Two types of coordinating centers are common:
clinical coordinating centers and data or statistical coordinating centers. In multi-
national trials, there may be multiple regional or national coordinating centers
(Brennan 1983; Alamercery et al. 1986; Franzosi et al. 1996; Kyriakides et al.
2004; Larson et al. 2016).
for clinical coordinating centers are assigned to the (data) coordinating center. The
senior statistician for the trial typically is located at the data coordinating center and
may be the principal investigator (or a co-investigator) for the funding award to this
center. In some cases, an epidemiologist or a person with other related expertise may
be the principal investigator.
Typically, the expected principal investigator and/or the senior trial statistician
participates in the design of the trial and preparation of the trial protocol (Williford
et al. 1995).
Coordinating centers often serve as the trial communications center and informa-
tion source for investigators, trial leadership, other trial personnel, and sponsor.
Personnel at these centers provide expertise regarding research design and methods
and, often, experience gained from participation in other clinical trials. They oversee
the treatment allocation (randomization) process and serve as the scientific con-
science of the investigative group. This trial resource center has primary responsi-
bility for assembling and maintaining an accurate, complete, and secure trial
database and for analysis of data accumulated for the trial.
Typical responsibilities of data coordinating centers include:
• Select and implement the information technology methods that will be used for
the trial (McBride and Singer 1995).
• Design and implement the methods for collecting and recording the data required
to address the goals of the trial (Hosking et al. 1995).
• Design and implement the randomization schema and the methods for assigning
trial participants to treatment arms and communicating the assignment to partic-
ipants and trial personnel as required.
• Develop and monitor methods for masking trial personnel and participants to
treatment assignment as required to preserve the integrity of the trial.
• Develop methods for storing, managing, securing, and reporting accumulating
trial data reported by clinical centers, other resource centers, and committees
assigned responsibility for coding events or other aspects of participant
follow-up.
• Provide for regular communication with personnel at clinical centers and other
resource centers regarding protocol issues and data anomalies.
• Develop methods for assessing and reporting data quality, including data pro-
vided by participants, clinical center personnel, and personnel at other resource
centers (Gassman et al. 1995).
• Develop methods for reporting accumulated data for groups assigned to monitor
the progress of the trial, data quality, and comparative safety and effectiveness of
trial treatments (McBride and Singer 1995).
• Cooperate with external monitors of the coordinating center (Canner et al. 1987).
• Participate in preparation of manuscripts to report trial methods and outcomes.
In particular, the trial statisticians typically are responsible for performing all
data analyses included in reports, including selecting and describing appropriate
methods of statistical analysis and verifying all data reported and their
interpretation.
7 Centers Participating in Multicenter Trials 101
To meet these responsibilities, staff of the data coordinating center must include
personnel with expertise in several areas. There is no formula that applies to every
trial. Besides statistical and information technology expertise at various levels,
personnel typically include clerical and other types of personnel. It is essential that
coordinating center personnel be able to interact effectively with trial investigators
and personnel at other trial centers. When some of the trial data are collected
by coordinating center personnel, for example, through telephone interviews
for patient-reported outcomes or central long-term follow-up of participants for
outcomes (Peduzzi et al. 1987), the personnel may include telephone interviewers.
Regardless of the expertise or roles of individual coordinating center personnel,
all must be trained in the trial protocol and procedures.
The manuals of operations/procedures prepared for individual trials are useful
resources for identifying the many responsibilities assigned to data coordinating
centers, the ways in which personnel and investigators at those centers have met their
responsibilities, and the organizational structure of data coordinating centers.
In clinical trial networks, a single coordinating center may serve all trials,
or subgroups of personnel may be designated to participate in individual trials
(Blumenstein et al. 1995). A data coordinating center also may be created to
participate in a single multicenter trial. The organization of the coordinating center
depends on the trial setting and, often, the trial sponsor and funding source.
In the United Kingdom, clinical trials units (another name for coordinating centers)
currently undergo registration to assure that they meet requirements regarding (1)
expertise, continuity, and stability; (2) quality assurance; (3) information systems;
and (4) statistical input (McFadden et al. 2015). The importance of development and
documentation of standard operating procedures (SOPs) for these units is empha-
sized in the UK registration process (McFadden et al. 2015) and by others
(Krockenberger et al. 2008).
Because coordinating center responsibilities evolve during the course of a trial, it is
useful to consider changes in responsibilities by trial phase. Phases of a trial, adapted
from the Coordinating Center Models Project [CCMP], are defined in Table 1.
Common coordinating center responsibilities by trial phase are summarized in Table 2.
Other resource centers required for an individual trial depend on the goals of the trial
and the need for standardization of trial procedures. Resource centers may be created
102 R. W. Scherer and B. S. Hawkins
Table 1 Phases of a multicenter clinical trial. (Adapted from Coordinating Center Models Project
Report No. VI)
Planning and pilot phase: Ends with submission of funding application(s) to sponsor
Initial design phase: Ends with funding for the trial
Protocol development and implementation phase: Ends with initiation of participant
recruitment
Participant enrollment phase: Ends with completion of participant recruitment
Treatment and follow-up phase: Ends with initiation of participant closeout
Participant closeout phase: Ends with completion of participant closeout
Termination phase: Ends with termination of funding for the trial
Post-funding phase: Ends with completion of all trial activities and publications
Comment
Phases of an individual trial are not always clearly defined regarding beginning and ending dates
or events. Activities of a phase may overlap with those of another; hence the end of each phase is
defined above by an “event.” Activities of the trial centers vary by trial phase; in fact, the
coordinating center investigators must be constantly planning for the next phase
to serve an individual trial or serve multiple trials that have similar needs. Some
of the responsibilities assigned to resource centers have been:
Table 2 Common responsibilities of [data] coordinating centers and clinical centers by trial phase.
(Adapted from the Coordinating Center Models Project and other sources)
[Data] coordinating centers Clinical centers
Planning and pilot phase
Participate in literature review/meta-analyses
to assess and document need for contemplated
trial
Participate in design and analysis of pilot Conduct pilot studies
studies to assess feasibility of trial design and
methods
Meet with investigators expected to direct
other resource centers
Visit one or more clinical centers expected to Engage in discussions regarding possible
participate in trial participation in trial
Initial design phase
Estimate required sample size
Outline the data collection schedule
Outline quality assurance and monitoring
procedures
Outline data analysis plans
Outline data intake and editing procedures
Prepare funding proposal for the [data] Review proposed clinical center budget
coordinating center
Participate in preparation of qualifications of
clinical centers and the selection process
Work with the proposing study chair to
coordinate the overall funding application
package
Coordinate development of a draft manual of Provide input on manual of procedures if asked
procedures for the trial
Protocol development and implementation phase
Register or assure registration of trial in a Determine feasibility of integrating trial into
clinical trials registry clinic setting
Develop patient treatment allocation Constitute study team and select primary
(randomization) procedures clinical coordinator
Develop computer software and related Complete Good Clinical Practice and ethics
procedures for receiving, processing, editing, training
and analyzing study data
Develop and test study data forms and methods
used for completion and submission by clinical
center personnel
Oversee development of interfaces for data Organize infrastructure including telephone,
transmission between individual resource computer and Internet, courier services
centers and clinical centers
Coordinate drug or device distribution Procure equipment, including refrigerators or
freezers, lockable cabinets, and any special
equipment
(continued)
104 R. W. Scherer and B. S. Hawkins
Table 2 (continued)
[Data] coordinating centers Clinical centers
Train clinical center personnel in the data Attend training meeting
collection and transmission process
Implement training certification in the trial Complete certification requirements
protocol at clinical centers
Distribute study forms and related study
material for use in next phases of the trial
Designate one or more coordinating center Institute interdepartmental communication
investigators to serve on each trial committee pathways, including pharmacy, radiology,
laboratory, etc.
Modify and refine manual of procedures and Organize filing system and binders, including
distribute to all trial centers those for essential documents
Develop and document internal procedures for Organize space, including interview or exam
coordinating center operations and rooms, storage areas for confidential materials,
responsibilities and drugs or devices
Act as a repository for official records of the
trial: minutes of meetings, committee reports,
etc.
When agreed with sponsor, reimburse clinical Complete budgeting and contractual
centers and other resource centers and others negotiations
based on the funding award
Participate in creating the application for local Incorporate trial protocol and informed consent
and/or study-wide institutional review boards/ statement template into local ethics board
ethics committees template and submit for approval
Develop and implement dedicated website
with trial information suitable for access by
multiple stakeholders
Participant recruitment phase
Develop templates for recruitment materials Develop recruitment materials or use templates
developed by coordinating center; submit to
local ethics review board
Implement recruitment activities
Administer treatment assignments. Screen potential study participants; complete
Periodically check (1) baseline comparability eligibility testing on potential study
of treatment arms and (2) characteristics of participants
participants versus target population and
eligibility criteria
Implement editing procedures to detect data Complete all baseline data collection forms to
deficiencies determine eligibility of potential study
participant
Develop monitoring procedures and prepare Obtain formal informed consent from study
data reports to summarize performance of participant and complete randomization
participating clinical centers with patient process
recruitment
Develop monitoring and reporting procedures
to detect evidence of adverse or beneficial
effects of trial treatments
(continued)
7 Centers Participating in Multicenter Trials 105
Table 2 (continued)
[Data] coordinating centers Clinical centers
Respond to requests for reports and data
analyses from within the trial organization
Implement and lead quality assurance and
monitoring program
Schedule and participate in site visits to clinical Participate in site visits, and respond to queries
centers and other resource centers by site visitors
Prepare progress reports for trial sponsor
Prepare, or collaborate in preparing, any
requests for continued or supplemental funding
by the sponsor
Prepare a manuscript to describe the trial
design
Participant treatment and follow-up phase
Monitor drug or device distribution Administer treatment as assigned by
randomization, including accountability
activities for study medications or devices
Monitor treatment adherence Complete required documents, or provide
materials related to treatment adherence
Prepare periodic reports of the data concerning
adverse and beneficial effects of trial
treatments
Monitor and report adverse events to sponsors Identify and report all serious adverse events as
as required required by trial and federal agencies and local
ethics committees
Prepare periodic reports of the performance of Identify and resolve all protocol deviations
all trial centers
Schedule all study visits, including any
logistical issues (e.g., travel for participant(s),
scheduling radiology or surgery, etc.)
Evaluate data handling procedures and modify Complete all study follow-up visits and
as necessary associated data collection forms, and transmit
data to data coordinating center
Analyze baseline and related data for Respond to data queries from coordinating
publication, as appropriate center
Prepare materials for investigator meetings Attend all investigator group meetings
Prepare summary of trial results for individual
participants for use in closeout discussion and
final data collection
Develop and test data forms for patient
closeout phase
Initiate/lead searches for participants lost to
follow-up by clinical centers
Work with trial leadership and investigators to
develop a publication plan
Coordinate participant closeout process
(continued)
106 R. W. Scherer and B. S. Hawkins
Table 2 (continued)
[Data] coordinating centers Clinical centers
Complete annual progress reports for ethics
review boards
Participant closeout phase
Collect participant closeout data Complete closeout study visit
Coordinate and monitor progress with
participant closeout
Monitor adherence to closeout procedures
Develop plans for final checks on Complete all remaining data queries from
completeness and accuracy of trial database coordinating center
Develop and test analysis programs for any
additional data summaries or analyses
Develop plan for final disposition of final trial
database and accumulated materials
Participate in reorganization of trial for final Complete all center closeout activities,
phases, including disengagement of clinical including final reports
centers
Continue to participate in preparation of
manuscripts to disseminate trial findings and
methods
Coordinate and monitor progress with trial Participate in manuscript writing, as required
manuscripts
Termination phase
Perform final quality checks of trial database
Implement plans for documentation and
disposition of final database and other trial
records
Advise clinical center personnel regarding Archive or dispose of all study documents as
disposition of local trial records required by sponsor
Continue activities regarding preparation and Participate as co-author on study publications,
publication of manuscripts if requested
Monitor collection and disposal of unused Complete accountability of study medications
study medications and supplies or devices and all study materials
Undertake final efforts to determine or confirm
the vital status of all trial participants
Provide writing teams with data analyses and Present study findings at a conference, as
summaries needed to complete manuscripts requested
Circulate manuscripts for review by trial Review final manuscripts
investigators prior to submission for
publication
Cooper et al. 1986; Rosier and Demoen 1990; Davey 1994; Davis 1994; Dijkman
and Queron 1994; Harris 1994; Wilkinson 1994; Lee and Hill 2004; Sheintul et al.
2004; Strylewicz and Doctor 2010). The integration of local and central roles in an
international multicenter trial has been described by Nesbitt et al. (2006).
7 Centers Participating in Multicenter Trials 107
Table 3 Example of resource centers and selected responsibilities in the Collaborative Ocular
Melanoma Study
Study chairman’s office
Organize planning meetings of prospective center investigators
Identify potential clinical centers and resource centers
Interact with sponsor to assess feasibility of support
Prepare core study funding applications in collaboration with coordinating center investigators
Design and develop informational materials for prospective study participants
Design and disseminate materials to inform community oncologists and ophthalmologists
about the trials
Schedule meetings of investigators and committees; develop meeting agendas in collaboration
with coordinating center investigators and committee chairs
Monitor adherence of investigators to study policy for presentation and publication of study
data
Participate in preparation of manuscripts to disseminate trial findings
Advise the coordinating center investigators regarding issues outside their areas of expertise
Coordinating center
Coordinate preparation and submission of funding applications to sponsor
Coordinate study communications
Enroll and randomize eligible participants
Maintain the COMS Manual of Procedures
Develop methods for and coordinate data collection
Maintain the COMS database
Monitor data quality at other centers and internally
Analyze and report accumulating data to appropriate groups
Maintain study documents
Coordinate preparation of manuscripts to report trial findings
Archive the final database and documentation after study completion
Ophthalmic echography reading center
Train clinical center echographers in study methods
Confirm diagnosis of choroidal melanoma based on photographs from baseline echographic
examination
Monitor quality of echography
Measure tumor height to monitor changes after brachytherapy
Assess topographic features of tumors
Photograph reading center
Train clinical center photographers in study methods
Monitor quality of photography
Confirm diagnosis of choroidal melanoma based on characteristics observed on photographs
and fluorescein angiograms of tumor
Describe changes in boundaries of tumor base
Describe retinopathy following brachytherapy
Pathology center
Describe tumor characteristics based on external and microscopic examinations of enucleated
eyes
(continued)
108 R. W. Scherer and B. S. Hawkins
Table 3 (continued)
Provide technical processing of enucleated eyes sent from clinical centers
Coordinate activities of the Pathology Review Committee
Radiological physics center
Participate in development of radiotherapy protocols with radiation oncologist study co-chair
Assess accuracy of clinical center calculation of radiation dose, and notify clinical center in case
of disagreement
Disseminate instructive findings to clinical center personnel
Sponsor: National Eye Institute
Monitor overall study progress
Observe adherence to study goals
Image analysis and interpretation centers were used in early multicenter for
standard interpretation of echocardiographic tracings from participants (Prineas
and Blackburn 1983; Rautaharju et al. 1986). Over time, manual review and coding
largely has been replaced with automated methods (Goodman 1993). Resource
centers with similar roles for other types of images have become common as new
imaging methods have been developed and implemented clinically to monitor the
effects of treatment. They have been widely used in oncology (Chauvie et al. 2014;
Gopal et al. 2016) and ophthalmology trials (Siegel and Milton 1989; Danis 2009;
Price et al. 2015; Toth et al. 2015; Domalpally et al. 2016; Rosenfeld et al. 2018) but
also in trials in other medical settings (Desiderio et al. 2006; Ahmad et al. 2016).
Central pharmacies/procurement and distribution centers also have a long
history in multicenter clinical trials. They have been established to aid with masking
of treatment assignments in pharmaceutical trials and with distribution of supplies
to clinical centers and participants (Fye et al. 2003; Martin et al. 2004; Peterson et al.
2004; Rogers et al. 2016).
Adjudication centers or committees have been created for many trials to confirm
or code outcomes reported by clinical center personnel or trial participants (Moy
et al. 2001; Pogue et al. 2009; Marcus et al. 2012; Barry et al. 2013). The need for
central adjudication of outcomes such as death has been debated (Granger et al.
2008; Meinert et al. 2008).
Other types of resource centers have been less common but have played impor-
tant roles in multicenter clinical trials to date (Glicksman et al. 1985; Kempson 1985;
Sievert et al. 1989; Carrick et al. 2005; Henning 2012; Shroyer et al. 2019).
A registry of resource centers with various types of expertise and experience with
participation in multicenter clinical trials would be a useful resource for designers
of future trials. Similarly, a registration system similar to that used in the United
Kingdom for clinical trials units could be modified to be applicable to other types
of resource centers.
Regardless of the role of a resource center, interactions with the (clinical and data)
coordinating centers and clinical center personnel are required to assure that trans-
mission of information and materials from clinical centers to the resource center is
timely and accurate and that the information transmitted to the trial database is linked
to the correct trial participant and examination.
7 Centers Participating in Multicenter Trials 109
Clinical Centers
Clinical centers are a unique “resource” center. Without a committed and fully
functioning clinical center, a multicenter trial is doomed to failure. Clinical centers
or sites are the engines that generate the data needed to answer the research question
posed by the trial. The historical model for a clinical center has been a single
academic center, but other workable models have emerged with the inclusion of
clinical practice sites (Submacular Surgery Trials Research Group 2004; Dording
et al. 2012), nontraditional sites such as nursing homes (Kiel et al. 2007), and
international centers (Perkovic et al. 2012). In some cases, with the organization
of the clinical trial networks, a network of dedicated clinical sites may contribute to
multiple related trials (Beck 2002; Sun and Jampol 2019). Even though clinical
centers may be located at any one of these types of sites, all are responsible for
administrative functions, patient interactions, data management functions, and inter-
actions with coordinating and other resource centers. Similarly, personnel within
a clinical center can vary depending on the complexity of the trial being conducted,
but all clinical centers have a principal investigator who is usually the clinician
actively involved in the trial and a clinical coordinator, who handles day-to-day
functions. Overall, clinical centers communicate with and are responsive to the
clinical coordinating or data coordinating center and often with other resource
110 R. W. Scherer and B. S. Hawkins
The first responsibility of the clinical center is to learn about the incipient trial. As the
principal investigator, the clinical center director must have a thorough understand-
ing of the research question, the trial design, the operations, and the possible impact
on the clinic if the decision is to be part of the investigative group. The primary
clinical coordinator optimally is selected at this phase of the trial, and the clinical
director and coordinator together perform many of the tasks necessary to integrate
the trial into the clinic setting.
Following clinical center selection, the clinical center enters into negotiations
for contractual or funding arrangements. Because operational failures are often based
on insufficient allocation of resources (Melnyk et al. 2018), it is important to ensure
sufficient funding for all phases of the trial, including both start-up and final visit
data collection. Certainly, given the time commitment required by a trial, either time
protected from clinical responsibilities or appropriate reimbursement for research
time is helpful, if not necessary, for the principal investigator (Herrick et al. 2012).
Different models for clinical center funding exist from fixed funding based on
personnel effort and anticipated costs to capitation with reimbursement based
on completion of all data entry for enrollment and study visits by participants.
A combination with some fixed up-front costs and further reimbursement based on
completed visits also has been used successfully (Jellen et al. 2008).
Responsibility for protocol adherence ultimately rests with the principal investigator, but
ensuring that it is achieved falls primarily to the clinic coordinator.
Ethical Approval
Before participating in the trial, local ethics board or a commercial review board
approval is required to conduct the trial at each institution. Often the coordinator or
regulatory monitor drafts the required materials for ethics review by using templates
of the study protocol and consent forms provided by the data or clinical coordinating
center. Continuing interactions with the ethics committee include obtaining approval
for any amendments to the protocol, approval for ancillary studies, notification
of any serious adverse events that occur, and submission of annual progress reports.
If there is a data monitoring committee for the trial as a whole, clinical centers also
submit the report of this committee following each meeting.
Organization Binders/Files
Prior to trial initiation, the coordinator or regulatory monitor typically sets up all
the requisite binders and trial files, whether paper-based or electronic, and organizes
all correspondence. Binders include those for essential documents as mandated
by Good Clinical Practice, a current manual of procedures or handbook, required
logbooks, ethics board correspondence, a scheduling book, and study participant
binders, among others. The coordinator maintains currency of these documents and
files as the trial progresses.
through the process of certification to participate in the trial. Usually there are
requirements for certification that must be met to allow data collection or treatment
administration within the trial. The purpose of certification is to demonstrate
knowledge of the trial and competency in the role the staff member will hold.
Requirements may include reading the manual of procedures, demonstrating under-
standing of the trial objectives, and design and skill in administration of tests
or questionnaires. Certification may also require submission of “dummy” data
collection forms or data collected using trial materials (e.g., an audiotape for
counseling). For surgical procedures, documentation of experience or submission
of videotapes may be required. With staff replacements, training and certification
continue during the trial, especially for longer trials (Mobley et al. 2004). While
the trial is ongoing, staff may be trained by an existing certified staff member or
a member of the clinical coordinating center or data coordinating center who visits
the clinical center for this purpose. Other options include special training meetings,
webinars with didactic lectures and/or videotapes, or slides sets or videotapes from
previous training meetings.
Recruitment
Following ethics review board approval, accrual of study participants, or recruit-
ment, proceeds. Recruitment materials, such as educational brochures or poster
templates, are used to facilitate identification of potential study participants. These
materials are either prepared using templates provided by study resource centers or
may be developed at the clinic and specifically aimed at the local population (Mullin
et al. 1984). Other recruitment activities include disseminating information through
grand rounds or local presentations or directly using letters or personal meetings with
colleagues or other persons who have access to the potential patient population.
consent statement; a copy is given to the participant, and a copy is kept for the clinic
and is stored in a binder or file designated for that purpose. Randomization and
treatment assignment typically occurs only after informed consent has been obtained
and documented. Randomization is often achieved via an online system, although in
small or single-center studies, sealed opaque-coded envelopes may be issued to
provide the assigned treatment. Documentation related to randomization is stored in
the participant binder or file.
Adverse Events
Adverse events that occur during a trial, particularly serious adverse events, must be
handled correctly and in a timely manner. Appropriate medical care for the study
7 Centers Participating in Multicenter Trials 115
participant overrides the trial protocol, especially for serious adverse events or
events that may be related to a trial intervention. The assigned treatment may need
to be discontinued and appropriate documentation prepared. All regulatory bodies,
including trial sponsors, local ethics board, and pertinent regulatory agencies, must
be notified within the time frame designated by the trial protocol.
Data Collection
Data are the backbone of clinical research, mandating proper data collection and
management. Data may be collected using paper forms, directly onto electronic
media, or indirectly through review of medical charts, taped interviews, or other
methods. Data sources include participant self-report, scheduled interviews or exami-
nations, or imaging or laboratory findings. Collection of external data also may be
required, e.g., death certificates for determination of cause of death, hospital records, or
operative notes from surgical procedures. Data are collected at various points during a
participant’s sojourn in the trial and possibly in multiple ways. At screening and
eligibility assessment, eligibility criteria are verified, and baseline data used to assess
change over follow-up may be collected. Treatment administration requires data related
to treatment adherence or compliance and associated treatment-related adverse events.
Outcome data and adverse events are collected during follow-up. Clinical center staff
must ensure faithful and accurate data collection. Data forms should be checked for
completeness and accuracy prior to transmission to the data coordinating center.
Transmission of data may be implemented using various modes (e.g., paper forms,
online, submission of electronic files) but should be completed expeditiously and
accurately while maintaining confidentiality. In addition to collecting participant-related
data, the clinic also manages other study-related data, such as drug accountability or
biospecimen collection and shipment. There may also be specific data forms dealing
with study-related events, such as protocol deviations or adherence to treatment.
Data Management
Although the database may have checks to provide for accurate entry and double-data
entry be employed, there still may be missing, out of range, or inconsistent data items.
Errors may occur within a single form, across forms, or across visits. Typically, the
data coordinating center routinely reviews the data and queries the clinical center about
perceived errors. Clinic staff then review each query and, if an error had occurred,
correct the paper forms or electronic record and transmit the corrected item to the data
coordinating center. These data management activities typically take more time at the
beginning of a trial or during slow recruitment when more errors occur (Taekman et al.
2010). The amount of effort for all data management activities requires a substantial
time commitment by the coordinator (Goldsborough et al. 1998).
Site Visits
Site visits are a quality assurance measure that typically includes a review of the clinic
setting, implementation of the trial protocol, and routine audits of the data. Having a site
visit scheduled often provides an incentive for a clinical center to make sure all
documents are current and well-organized. During the site visit, the site visitors typically
116 R. W. Scherer and B. S. Hawkins
will observe administration of a clinical procedure, determine that adequate space and
required facilities are available to conduct the trial, review that essential documents
(paper or electronic) are current and that signed informed consent documents are
available for each study participant, and conduct a data audit. During an audit, data in
the database are compared with those recorded on paper forms or in other ways.
Discrepancies are treated as data errors and require review and correction. Common
problems encountered during site visits are inadequate consent documentation, prob-
lems with drug accountability, and protocol nonadherence (Turner et al. 1987).
Study Meetings
Goals of full investigative group meetings vary, but generally are designed to build
collaboration between investigators and coordinators. Topics covered may include
trial progress; problem-solving, especially during the recruitment phase; and issues
related to implementing the protocol in the clinic setting. Protocol amendments may
be discussed and reviewed, as well as performance reports and accumulating base-
line data. Continuing education also may be provided, by either engaging outside
speakers or having a trial investigator provide updates on the scientific literature in
the trial topic area.
Post-Funding Phase
of both types of centers change depending on the phase of the trial. Collaboration
among all resource centers and clinical centers is essential as they aim toward the
common goal of successfully completing a multicenter trial.
Key Facts
Cross-References
References
Ahmad HA, Gottlieb K, Hussain F (2016) The 2 + 1 paradigm: an efficient algorithm for central
reading of Mayo endoscopic subscores in global multicenter phase 3 ulcerative colitis clinical
trials. Gastroenterol Rep (Oxf) 4(1):35–38
Alamercery Y, Wilkins P, Karrison T (1986) Functional equality of coordinating centers in a
multicenter clinical trial. Experience of the International Mexiletine and Placebo Antiarrhythmic
Coronary Trial (IMPACT). Control Clin Trials 7(1):38–52
118 R. W. Scherer and B. S. Hawkins
Barry MJ, Andriole GL, Culkin DJ, Fox SH, Jones KM, Carlyle MH, Wilt TJ (2013) Ascertaining
cause of death among men in the prostate cancer intervention versus observation trial. Clin
Trials 10(6):907–914
Beck RW (2002) Clinical research in pediatric ophthalmology: the Pediatric Eye Disease
Investigator Group. Curr Opin Ophthalmol 13(5):337–340
Blumenstein BA, James KE, Lind BK, Mitchell HE (1995) Functions and organization of
coordinating centers for multicenter studies. Control Clin Trials 16(2 Suppl):4s–29s
Brennan EC (1983) The Coronary Drug Project. Role and methods of the Drug Procurement and
Distribution Center. Control Clin Trials 4(4):409–417
Canner PL, Gatewood LC, White C, Lachin JM, Schoenfield LJ (1987) External monitoring of
a data coordinating center: experience of the National Cooperative Gallstone Study. Control
Clin Trials 8(1):1–11
Carrick B, Tennyson M, Lund B (2005) Managing a blood repository for use by multiple ancillary
studies in the Women’s Health Initiative. Clin Trials 2(Suppl 1):S73
Chauvie S, Biggi A, Stancu A, Cerello P, Cavallo A, Fallanca F, Ficola U, Gregianin M, Guerra UP,
Chiaravalloti A, Schillaci O, Gallamini A (2014) WIDEN: a tool for medical image manage-
ment in multicenter clinical trials. Clin Trials 11(3):355–361
Chung KC, Song JW (2010) A guide to organizing a multicenter clinical trial. Plast Reconstr Surg
126(2):515–523
Cooper GR, Haff AC, Widdowson GM, Bartsch GE, DuChene AG, Hulley SB (1986)
Quality control in the MRFIT local screening and clinic laboratory. Control Clin Trials 7(3
Suppl):158s–165s
Danis RP (2009) The clinical site-reading center partnership in clinical trials. Am J Ophthalmol 148
(6):815–817
Davey J (1994) Managing clinical laboratory data flow. Drug Inf J 28:397–402
Davis JM (1994) Current reporting methods for laboratory data at Zeneca Pharmaceuticals.
Drug Inf J 28:403–406
Desiderio LM, Jaramillo SA, Felton D, Andrews LA, Espeland MA, Tan JC, Bryan NR, Perry J,
Liu DF (2006) A multi-institutional imaging network: application to Women’s Health Initiative
Memory Study. Clin Trials 3(2):193–194
Dijkman JHM, Queron J (1994) Feasibility of Europe-wide specimen shipments. Drug Inf
J 28:385–389
Dinnett EM, Mungall MM, Kent JA, Ronald ES, Gaw A (2004) Closing out a large clinical trial:
lessons from the prospective study of pravastatin in the elderly at risk (PROSPER). Clin Trials 1
(6):545–552
Domalpally A, Danis R, Agron E, Blodi B, Clemons T, Chew E (2016) Evaluation of geographic
atrophy from color photographs and fundus autofluorescence images: Age-Related Eye Disease
Study 2 report number 11. Ophthalmology 123(11):2401–2407
Dording CM, Dalton ED, Pencina MJ, Fava M, Mischoulon D (2012) Comparison of academic and
nonacademic sites in multi-center clinical trials. J Clin Psychopharmacol 32(1):65–68
Farrell B (1998) Efficient management of randomised controlled trials: nature or nurture. BMJ 317
(7167):1236–1239
Fiss AL, McCoy SW, Bartlett DJ, Chiarello LA, Palisano RJ, Stoskopf B, Jeffries L, Yocum A,
Wood A (2010) Sharing of lessons learned from multisite research. Pediatr Phys Ther 22(4):
408–416
Franzosi MG, Bonfanti I, Garginale AP, Nicolis E, Santoro E, N. Investigators (1996) The role
of a regional data coordinating centre (RDCC) in a multi-national large phase-II trial. Control
Clin Trials 17(Suppl 2S):104S–105S
Fye CL, Gagne WH, Raisch DW, Jones MS, Sather MR, Buchanan SL, Chacon FR, Garg R, Yusuf
S, Williford WO (2003) The role of the pharmacy coordinating center in the DIG trial. Control
Clin Trials 24(6 Suppl):289s–297s
Gassman JJ, Owen WW, Kuntz TE, Martin JP, Amoroso WP (1995) Data quality assurance,
monitoring, and reporting. Control Clin Trials 16(2 Suppl):104s–136s
7 Centers Participating in Multicenter Trials 119
Glicksman AS, Reinstein LE, Laurie F (1985) Quality assurance of radiotherapy in clinical trials.
Cancer Treat Rep 69(10):1199–1205
Goldsborough IL, Church RY, Newhouse MM, Hawkins BS (1998) How clinic coordinators spend
their time. Appl Clin Trials 7(1):33–40
Goodman DB (1993) Standardized and centralized electrocardiographic data for clinical trials.
Appl Clin Trials 2(6):34, 36, 40–41
Gopal AK, Pro B, Connors JM, Younes A, Engert A, Shustov AR, Chi X, Larsen EK, Kennedy DA,
Sievers EL (2016) Response assessment in lymphoma: concordance between independent
central review and local evaluation in a clinical trial setting. Clin Trials 13(5):545–554
Granger CB, Vogel V, Cummings SR, Held P, Fiedorek F, Lawrence M, Neal B, Reidies H,
Santarelli L, Schroyer R, Stockbridge NL, Feng Z (2008) Do we need to adjudicate major
clinical events? Clin Trials 5(1):56–60
Habig RL, Thomas P, Lippel K, Anderson D, Lachin J (1983) Central laboratory quality control in
the National Cooperative Gallstone Study. Control Clin Trials 4(2):101–123
Hainline A Jr, Miller DT, Mather A (1983) The Coronary Drug Project. Role and methods of the
Central Laboratory. Control Clin Trials 4(4):377–387
Harris RAJ (1994) Clinical research aspects of sampling, storage, and shipment of blood samples.
Drug Inf J 28:377–379
Hawkins BS, Gannon C, Hosking JD, James KE, Markowitz JA, Mowery RL (1988) Report from
a workshop: archives for data and documents from completed clinical trials. Control Clin Trials
9(1):19–22
Henning AK (2012) Starting a genetic repository. Clin Trials 9(4):523
Herrick LM, Locke GR 3rd, Zinsmeister AR, Talley NJ (2012) Challenges and lessons learned in
conducting comparative-effectiveness trials. Am J Gastroenterol 107(5):644–649
Hosking JD, Newhouse MM, Bagniewska A, Hawkins BS (1995) Data collection and transcription.
Control Clin Trials 16(2 Suppl):66s–103s
Jellen PA, Brogan FL, Kuzma AM, Meldrum C, Meli YM, Grabianowski CL (2008) NETT
coordinators: researchers, caregivers, or both? Proc Am Thorac Soc 5(4):412–415
Kempson RL (1985) Pathology quality control in the cooperative clinical cancer trial programs.
Cancer Treat Rep 69(10):1207–1210
Kiel DP, Magaziner J, Zimmerman S, Ball L, Barton BA, Brown KM, Stone JP, Dewkett D,
Birge SJ (2007) Efficacy of a hip protector to prevent hip fracture in nursing home residents: the
HIP PRO randomized controlled trial. JAMA 298(4):413–422
Krockenberger K, Luntz SP, Knaup P (2008) Usage and usability of standard operating procedures
(SOPs) among the coordination centers for clinical trials (KKS). Methods Inf Med 47(6):
505–510
Kyriakides TC, Babiker A, Singer J, Piaseczny M, Russo J (2004) Study conduct, monitoring and
data management in a trinational trial: the OPTIMA model. Clin Trials 1(3):277–281
Larkin ME, Lorenzi GM, Bayless M, Cleary PA, Barnie A, Golden E, Hitt S, Genuth S (2012)
Evolution of the study coordinator role: the 28-year experience in Diabetes Control and
Complications Trial/Epidemiology of Diabetes Interventions and Complications (DCCT/
EDIC). Clin Trials 9(4):418–425
Larson GS, Carey C, Grarup J, Hudson F, Sachi K, Vjecha MJ, Gordin F (2016) Lessons learned:
infrastructure development and financial management for large, publicly funded, international
trials. Clin Trials 13(2):127–136
Lee JY, Hill A (2004) A multicenter lab sample tracking system. Clin Trials 2:252–253
Marcus P, Gareen IF, Doria-Rose P, Rosnbaum J, Clingan K, Brewer B, Mille AB (2012) Did
death certificates and a mortality review committee agree on lung cancer cause of death in The
National Lung Screening Trial? Clin Trials 9(4):464–465
Martin DE, Pan J-W, Marticn JP, Beringer KC (2004) Pharmacy management for randomized
pharmacotherapy trials: the MATRIX web data management system. Clin Trials 1(2):248
McBride R, Singer SW (1995) Interim reports, participant closeout, and study archives. Control
Clin Trials 16(2 Suppl):137s–167s
120 R. W. Scherer and B. S. Hawkins
Shroyer ALW, Quin JA, Wagner TH, Carr BM, Collins JF, Almassi GH, Bishawi M, Grover FL,
Hattler B (2019) Off-pump versus on-pump impact: diabetic patient 5-year coronary artery
bypass clinical outcomes. Ann Thorac Surg 107(1):92–98
Siegel D, Milton RC (1989) Grading of images in a clinical trial. Stat Med 8(12):1433–1438
Sievert YA, Schakel SF, Buzzard IM (1989) Maintenance of a nutrient database for clinical trials.
Control Clin Trials 10(4):416–425
Strylewicz G, Doctor J (2010) Evaluation of an automated method to assist with error detection in
the ACCORD central laboratory. Clin Trials 7(4):380–389
Submacular Surgery Trials Research Group (2004) Clinical trial performance of community- vs
university-based practice in the Submacular Surgery Trials: SST report no. 2. Arch Ophthalmol
122:857–863
Sun JK, Jampol LM (2019) The Diabetic Retinopathy Clinical Research Network (DRCR.net) and
its contributions to the treatment of diabetic retinopathy. Ophthalmic Res 62:225–230
Taekman JM, Stafford-Smith M, Velazquez EJ, Wright MC, Phillips-Bute BG, Pfeffer MA,
Sellers MA, Pieper KS, Newman MF, Van de Werf F, Diaz R, Leimberger J, Califf RM
(2010) Departures from the protocol during conduct of a clinical trial: a pattern from the data
record consistent with a learning curve. Qual Saf Health Care 19(5):405–410
Toth CA, Decroos FC, Ying GS, Stinnett SS, Heydary CS, Burns R, Maguire M, Martin D, Jaffe GJ
(2015) Identification of fluid on optical coherence tomography by treating ophthalmologists
versus a reading center in the comparison of age-related macular degeneration treatments trials.
Retina 35(7):1303–1314
Turner G, Lisook AB, Delman DP (1987) FDA’s conduct, review, and evaluation of inspections of
clinical investigators. Drug Inf J 21(2):117–125
Wilkinson M (1994) Carrier requirements for laboratory samples. Drug Inf J 28:381–384
Williford WO, Krol WF, Bingham SF, Collins JF, Weiss DG (1995) The multicenter clinical trials
coordinating center statistician: more than a consultant. Am Stat 49(2):221–225
Qualifications of the Research Staff
8
Catherine A. Meldrum
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
History of Clinical Research Staff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Staff Qualifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Credentialing Organizations in Clinical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Abstract
Through the use of clinical trials, the global research community have paved the
way for new medical interventions and ground breaking new therapies for
patients. In the past 60 years we have embraced rigorous scientific standards for
assessing and improving our therapeutic knowledge and practice. These scientific
standards dictate that a research workforce possessing knowledge, skills, and
abilities is crucial to the success of a study. The many challenging (and ever
changing) rules and regulations inherent in the clinical research arena also require
diligent trained staff be available to conduct the study. While it is well known that
the Principal Investigator is ultimately responsible for the conduct of the study, it
is generally a team of individuals who conduct the daily operations of the study.
This chapter discusses qualifications and training of other research staff members
and why these are important for achieving success in clinical trials.
C. A. Meldrum (*)
University of Michigan, Ann Arbor, MI, USA
e-mail: [email protected]; [email protected]
Keywords
Clinical research staff · Qualifications · Education · Clinical research coordinator
Introduction
The Principal Investigator (PI) of a research study is the primary individual respon-
sible for the overall research study though we are keenly aware it takes a village to
actually complete a successful research study. It is not unusual for many required
study tasks be delegated from the PI to members of the research team but who are the
members of the research team and more importantly how have they been trained. Do
8 Qualifications of the Research Staff 125
the study members possess the experience and skill needed to assure compliance
with guidelines and regulations set forth by regulatory agencies and/or institutions?
Unlike many professions the professional research staff role is really in its infancy
in terms of development. While the nursing profession, widely recognized and
licensed, flourished in 1860 when the first school of nursing was opened in Europe,
the creation of research staff roles really began less than 40 years ago. Due to the
early stage development of this career pathway, there has been less consensus on
standardized job descriptions for this profession. There is no state licensure required
for this type of work nor is the expectation that baseline experience is the same
between institutions. The actual roles and titles of the research staff conducting a
clinical trial are also widely varied. Titles may include: Clinical Research Coordi-
nator, Study Nurse, Clinical Research Nurse, Research Nurse Coordinator, Clinical
Research Assistant, Study Coordinator, and many more.
In nursing, the use of research nurses in oncology trials was common though their
roles within the research project were not well defined. Many nurses worked with
oncology patients clinically, though lacked clear direction or training to work with
patients in a research capacity. An oncology research nurse may have complemen-
tary functions and roles with oncology chemotherapy nurses, but there are unique
characteristics required of the research nurse that are not applicable to the oncology
chemotherapy nurse (Ocker and Pawlik Plank 2000). In 1982, the Oncology Nursing
Society sought to standardize job descriptions for oncology nurses who were
involved in research studies (Hubbard 1982). Subsequently, as clinical trials grew
in both numbers and complexity, there was a further push to define the scope of
practice in the role of the research nurse. In 2007, the Clinical Center Nursing at the
National Institute of Health launched an effort to help define Clinical Research
Nursing (CRN). Later, in 2009 the first professional organization, the International
Association of Clinical Research Nurses (IACRN) for research nurses was founded.
This organization is not specific to oncology nurses but supports the professional
development of nurses that specialize in any research domain thus, regardless of the
title for this occupation or the research domain, nurses engaged in clinical trials
develop, coordinate, and implement research and administrative strategies vital to
the successful management of research studies.
Though it was common to use nurses in the research enterprise in oncology, there
are many other areas outside of oncology that do not rely solely on a Clinical
Research Nurse/Research Nurse Coordinator. Many staff hired to complete a
research project are not trained as nurses. In fact, it is common to hire varied
ancillary personnel to get the research study/project done. Without industry job
description standards much of the initial workforce sort of “fell into the role.” As
noted early on it was quite common for an Investigator to utilize a nurse to assist in
research. The nurse would begin to take on other responsibilities within research
until it literally grew into some type of research role. Eventually other staff that may
have worked in allied health or even in a clerical position began to take on additional
duties, again eventually inheriting the role of study coordinator. Within the past
decades there has been a tremendous growth in clinical research and with this, the
need for more individuals conducting clinical research grew. No longer could just
126 C. A. Meldrum
on-the-job training be the answer nor could pulling additional staff in at random
times to assist in a research study be deemed sufficient as that individual may lack
the proper training. A job entity/title for this profession was clearly needed. This led
to several job titles and descriptions for research professionals as noted above but
probably most common are Study Coordinator or Clinical Research Coordinator
(CRC). Today many of the titles are interchangeable though for the purposes of this
chapter we will refer to the Research Professional as a Clinical Research Coordinator
(CRC).
A CRC does work under the direction of the Principal Investigator (PI); however,
the background of the CRC can be quite diverse. Traditionally, a high percentage of
CRCs did possess a nursing background (Davis et al. 2002; Spilsbury et al. 2008),
and given the nursing curriculum and their experience with patients, this could be
considered a well-suited career move for a nurse. Transitioning from the bedside
nursing position to a CRC position while not difficult does require additional
training, but the registered nurse (RN) already possesses some of the major key
attributes required in the CRC role. Certainly understanding medical terminology,
documentation skills, and good people skills help facilitate the transition. If an
individual has a medical background other than nursing such as pharmacy, respira-
tory therapy, or another allied health field, they also understand medical terminology
and are likely skilled in working with patients. More recently, individuals with these
types of medical backgrounds are clearly more abundant in the research community
as they pursue advanced training to work in the field of research. Individuals who
would like to work as a CRC and have a nonmedical background may require even
more additional training to sufficiently work in the clinical research arena but this
can be achieved.
Although it is evident that it takes many personnel needed to effectively and
efficiently carry out clinical research, the regulations require the Principal Investigator
(PI) ensure that all the study staff are adequately trained and maintain up-to-date
knowledge about the study (FDA 2000). Frequently, Co-Investigators (CO-I) are
available to support the principal investigator in the management and leadership of the
research project but the actual day to day operations of the study are carried out with
ancillary personnel other than a PI or CO-I. In most studies the PI (or a delegate) hire
personnel to assist in carrying out the study yet what really needs to be considered is:
do the individuals hired have the appropriate training and skills necessary to fulfill the
high demands of a clinical research study. This can be challenging as many times it
may not even be the PI doing the actual hiring of the candidate. The individual doing
the hiring needs to be fully aware of what the needs are for the study and be able to
assess whether the potential candidate can meet those needs.
Staff Qualifications
As with any job position you have to find the person who best suits your needs.
There are certainly times when an entry level candidate can assist in performing the
activities needed for the research study but usually a highly experienced staff
8 Qualifications of the Research Staff 127
member is needed to oversee the study especially if it is a complex study. The overall
question facing the researcher (or PI) is what type of individual do I need to complete
the study successfully? Some broad topics that should be considered when hiring
staff are:
Hiring can also be complicated since the length of time a research study is
ongoing is quite varied; thus, the person you hire for that particular research study
may not be the same type of individual you need for the next research study. A PI
must consider the long term needs of their research enterprise and gauge what the
staff composition should look like to fulfill their goals.
For most health care professions, entry level requirements include a focused
didactic curriculum usually from an academic institution followed by some hands
on experience. This has not been the case with entry level staff in clinical research.
An individual cannot pursue this degree as a new student at an academic institution
as no academic institution has entry level programs in this type of discipline.
There is also no license mandate to practice in clinical research as required in
other medical disciplines though research staff may obtain professional certification
credentials within the field. Frequently in clinical research entry level people per-
form such tasks as data entry, data management, and basic patient care tasks such as
taking a blood pressure, measuring height and weight, or performing a manual count
for returned medications. As one achieves more experience, they may gradually
move up the clinical ladder with added responsibilities and commonly take on the
title of Clinical Research Coordinator (or some facsimile of this).
CRCs work in a variety of settings such as private and public institutions, device
pharmaceutical and biotechnology companies, private practice, Clinical Research
Organizations (CRO), Site Management Organizations (SMO), and varied indepen-
dent organizations involved in clinical research. Being a CRC is a multifaceted role
with many responsibilities. They are really at the center of the research enterprise
with multiple roles. Previous work has demonstrated that one research coordinator
may be responsible for between 78 and 128 different activities (Papke 1996). With
the growth in clinical research studies, there are still many more activities in the
future that may be required of the research coordinator. The additional complexities
of research ethics that have evolved over time and regulatory and economic pres-
sures that continue to mount create the need for a skilled individual in this role (NIH
2006). Though the duties of clinical research staff vary from each institution, they
likely include some or all of the following:
128 C. A. Meldrum
While the above list is not an exhaustive list of duties, it illuminates the need for
clinical research staff to have expert clinical skills and well-developed thinking
skills. To achieve the best possible outcomes for research participants and the overall
research process, they must be well versed in the regulatory, ethical and scientific
domains of clinical research. Thus, it is crucial that PI assure the study employ at
least one highly skilled individual to conduct a clinical trial. This leads us to discuss
“How does staff obtain training?”
Training
log serves to ensure that the research staff member performing study-related tasks/
procedures has been appropriately trained and authorized by the investigator to
perform such tasks. Although a delegation log is not federally mandated, it may be
a Sponsor requirement and must be completed and maintained throughout the trial.
ICG/GCP guidance (E6 4.1.5) requires an investigator maintain a list of qualified
staff to whom the investigator has designated study-related activities.
In 2012, a Clinical & Translational Science Awards Program (CTSA) taskforce
found insufficient training, and lack of support among CRC’s employed at Clinical
Translational Science Institutes (CTSI) (Speicher et al. 2012). They also observed a
low job satisfaction within this field. Recognizing the evolving demands of the
clinical research enterprise across the nation, the Task Force’s study reiterated the
need for support and educational development while recognizing that there are
insufficient numbers of adequately trained and educated staff for these roles.
One year after the CTSA study Clinical Research was formally accepted as a
profession by Commission on Accreditation of Allied Health Education Programs
(CAAHEP) though at the time of this writing the occupational description, job
description, employment characteristics, and educational programs available for
the role are still not available on the CAAHEP website (CAAHEP 1994).
Increasing clinical research studies and newer technologies create a demand for
an evenr newer skillset in the research workforce, and this is where professional
competencies come into play. In 2014, the Joint Task Force (JTF) for Clinical Trial
Competency published a landmark piece on defining the standards for professional-
ism in the research industry (Sonstein et al. 2014). This universal Core Competency
Framework has undergone three revisions with the most recent publication in
October 2018 (Sonstein et al. 2018). It now incorporates three levels (Fundamental,
Skilled, and Advanced), whereas more standard roles, assessments, and knowledge
can be assessed within the eight domains. The domains include: Scientific Concepts
and Research Design, Ethical and Participant Safety Considerations, Investigational
Products Development and Regulation, Clinical Study Operations, Study and Site
Management, Data Management and Informatics, Leadership and Professionalism,
and Communication and Teamwork. A diagram of the framework is provided below
(Fig. 1):
To date there are two organizations that credential research staff. They are the
Association of Clinical Research Professionals (ACRP) and The Society of Clinical
Research Associates (SOCRA). Both of these organizations are international in
scope. SOCRA currently has chapters in six international countries (Belgium, Brazil,
Canada, Nigeria, Poland, Saudi Arabia) and certification testing can be done at a PSI
testing center throughout the world. ACRP is located in more than 70 countries with
about 600 testing centers available internationally.
Credentialing is achieved by way of an examination through both organizations.
Interestingly though is that both organizations require at least 2 years of documented
130 C. A. Meldrum
Fig. 1 JTF core competency domains. (JTF, Joint Task Force for Clinical Trial Competency,
Sonstein and Jones 2018)
Summary
While there is still no standard education level for a position in clinical research, it is
often associated with a bachelor’s degree and some type of clinical trial research
experience. Increased technology has essentially pushed the clinical research enter-
prise to expect that clinical research staff have a higher skillset than in previous
decades. Certainly the importance of having a trained workforce in clinical research
ultimately impacts the integrity of clinical research. The research field has come a
long way in education and job description roles for research staff but there is still
have a long road ahead. The field is evolving towards somewhat more standardized
job descriptions even with the historical lack of clear consistent definitions for
research staff titles and their responsibilities. This evolution should allow transpar-
ency that standardizes job classifications and education expectations while providing
new growth opportunity for advancement in the clinical research profession. This
will ultimately allow the profession to mature increasing workforce development
and increased job satisfaction within the profession. If the universal competencies,
certifications, and accepted job descriptions are not adopted, it may end up in the
hands a government who would license this professional group.
Key Facts
Cross-References
References
ACRP (2015) A New Approach to Developing the CRA Workforce. Retrieved January 30, 2019
from https://www.acrpnet.org/resources/new-approach-developing-cra-workforce/
Association of Clinical Research Professionals (2018) CRC Certification. Available at: https://
www.acrpnet.org/certifications/crc-certification/
8 Qualifications of the Research Staff 133
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Multicenter Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Trial Leadership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Coordination of Study Activities and Logistics Between Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Clinical Trial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Abstract
Multicenter clinical trial designs offer a unique opportunity to leverage the
diversity of patient populations in multiple geographic locations, share the burden
of resource acquisition, and collaborate in the development of research questions
and approaches. In a time of increasing globalization and rapid technological
advancement, investigators are better able to conduct such projects seamlessly,
benefiting investigators, sponsors, and patient populations. Regulatory agencies
have embraced this shift towards the use of multicenter clinical trials in product
development and have issued statements and guidance documents promoting
their utility and offering best practices. Some governmental health agencies
have even formed clinical trial networks to facilitate the use of multicenter
clinical trials to answer a broad range of clinical questions related to a disease
or disease area. This chapter will cover design considerations, data coordination,
S. Baksh (*)
Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
e-mail: [email protected]
Keywords
Multicenter clinical trials · Clinical trial networks · Trial consortiums ·
Cooperative group clinical trials
Introduction
The conduct of clinical trials sometimes requires multiple clinical sites in order to
complete studies in a timely manner and maximize the external generalizability of
trial results. Increased globalization and inherent improvements in global coordina-
tion of data and research activities have made this study design the preferred option
when large study populations, generalizable study results, and fast turnaround are the
primary goals. Multicenter clinical trials allow for the streamlining of trial resources,
collaborative consensus for research decisions, greater precision in study results,
increased generalizability and external validity, and a wider range of population
characteristics. Multicenter clinical trials conducted in various regions of the world
may also bring clinical care options that would otherwise not be available to study
participants in lower- and middle-income countries. There are special considerations
for multicenter clinical trials that must be incorporated into protocols, statistical
analysis plans, and data monitoring plans. Additionally, data collection, data man-
agement, and treatment guidance must be coordinated across study sites to comply
with local standards.
Multicenter clinical trial designs have been in use for several decades, necessi-
tating additional guidance on trial conduct from national and multinational regula-
tory and funding agencies. Many places such as Brazil, China, The European Union,
and the United States use consensus documents, such as those produced in the
International Conference for Harmonisation (ICH) as the basis for their own guid-
ance on the conduct of clinical trials. For instance, the South African Good Clinical
Practice Guidelines draws from source documents developed by ICH, the Council
for International Organisations of Medical Sciences, World Medical Association,
and the UNAIDS, but heavily emphasizes the importance of incorporating the local
South African context into the design of multicenter clinical trials conducted in
South Africa (Department of Health 2006). In the United States, since the passage of
Kefauver-Harrison Amendments in 1962, multicenter clinical trials conducted in
foreign countries have been used for regulatory submissions. However, the United
States Food and Drug Administration (United States Food and Drug Administration
2006a, b, 2013) and the United States National Institutes of Health (NIH) (National
Institutes of Health 2017) have only recently developed guidelines for trialists who
are either submitting multicenter clinical trial data for regulatory approval or being
funded for a multicenter clinical trial through the federal government. Each of the
Institutes of the NIH have also developed specific guidelines for multicenter clinical
9 Multicenter and Network Trials 137
trial grants under their purview. These guidelines address the nuances of trial
conduct, coordination, data analysis, and ethical considerations across many trial
designs conducted in a multicenter setting. These guidelines are also heavily based
on those developed by ICH. Other countries implementing and tailoring ICH
guidelines include Brazil, Singapore, Canada, Korea, and others.
The ICH of Technical Requirements for Pharmaceuticals for Human Use E17
Guideline for multi-regional clinical trials outline principles for the conduct of
multicenter clinical trials for submission to multiple regulatory agencies (The Inter-
national Conference on Harmonisation of Technical Requirements for Registration
of Pharmaceuticals for Human Use 2017). The document discusses important
considerations in study design, such as regional variability on the measured treat-
ment effect, choice of study population, dosing and comparators, allowable con-
comitant medications, and statistical analysis planning. Additionally, E17 highlights
the benefit of incorporating multi-regional clinical trials into the global product
development plan to decrease the need for replication in various regions for each
submission. Study investigators should consider the regulatory requirements of
different regions, outcome definitions, treatment allocation strategies, and subpopu-
lations of interest when designing multicenter clinical trials across different regions.
This may require consultation with multiple regulatory agencies in the design of the
trial. Safety reporting should also conform to the local requirements for all study
sites. By coordinating study activities to meet the regulatory requirements in differ-
ent regions, sponsors can efficiently leverage study results for timely reviews of
investigational products.
The ICH has created additional guidance on the potential impact of ethnic
differences across study sites that should be considered when conducting multicenter
clinical trials in different countries (The International Conference on Harmonisation
of Technical Requirements for Registration of Pharmaceuticals for Human Use
1998). The E5 Guideline discusses intrinsic and extrinsic factors that have the
potential to modify the association between treatment and safety, efficacy, and
dosing. Characterization of treatment effect may differ based on factors such as
genetic polymorphism, receptor sensitivity, socioeconomic factors, or study end-
points. The clinical data package should have sufficient documentation of pharma-
cokinetics, pharmacodynamics, safety, and efficacy in the study population for each
region in which the trial will be submitted for regulatory consideration. In the
absence of that, additional bridging data that assesses the sensitivity of the treatment
effect to specific ethnic factors unique to the target population in a particular region
can help regulators extrapolate from study results accordingly. While the ICH E5
Guideline was developed for multicenter clinical trials in an international context, it
can be applied to any multicenter clinical trial with heterogeneity in study
populations across clinical sites.
Multicenter clinical trials can also be conducted via clinical trial networks. This
mechanism allows for collaborative research and the alignment of research priorities
among a core group of investigators. Networks are typically organized around a
specific disease area and consist of investigators with common research initiatives.
Clinical trial networks can serve to advance innovation in research methodologies
138 S. Baksh
Formation
This chapter will delve into the conduct of multicenter clinical trials within the
United States, with selected comparisons to contexts in other countries. The exam-
ples presented here are typical of an NIH-funded, investigator-initiated, multicenter
clinical trial. As there are other models for multicenter clinical trials such as industry
trials for regulatory approval, this chapter will highlight other notable design aspects
when applicable. For the purposes of this chapter, the principal investigator (PI) is
the lead clinical scientist who has received funds from government or private entities
for the conduct of a multicenter clinical trial. The funder is the provider of financial
support for the clinical trial. The sponsor is the responsible party for the clinical trial
and may or may not be the same party as the funder.
Multicenter clinical trials are clinical trials conducted under a single protocol that
utilize multiple clinical sites in different geographical locations to recruit participants
for a clinical trial answering a specific clinical question. The clinical site principal
investigators typically contribute to the study leadership and collaborate in the
development and refinement of the study question and design. Communication
among the sites and between the sites and the study PI is often coordinated by a
data coordinating center and often directed by the PI.
Multicenter clinical trials understandably require a considerable deal of coordi-
nation, both administratively and functionally. The decision for which clinical sites
to include in a clinical trial can be determined during the study planning phase and
finalized after the start of the trial. During this time, the PI may invite clinical sites to
apply to join the multicenter clinical trial. PIs can choose to invite individual clinical
trial sites within their professional networks, existing clinical trial networks, or sites
identified through clinical trial site directories or trial registries. During the applica-
tion process, potential clinical sites are asked to list their proposed study team,
clinical site resources, confirmation of ability to conduct clinical research activities,
and their ability to coordinate ethical approvals across sites. These applications can
be accompanied by a site assessment visit, where site monitors can visit the potential
clinical site to inspect the facilities and capabilities of the applicant site. These visits
can be useful in the design phase of the study, when the site monitors consider which
research activities may or may not be feasible for each clinical site and what tasks
might better be completed centrally. This is also an opportunity for the site monitors
9 Multicenter and Network Trials 139
to ascertain the types of patients a clinic typically receives, what proportion would
meet study eligibility criteria, and discuss appropriate recruitment goals with the
potential clinical site investigator. Site monitors may also use the site visit as an
opportunity to conduct a risk assessment of the site’s ability to complete study
recruitment and quality goals.
Once the PI and the clinical site decide to pursue collaboration on the clinical trial,
they enter into a contract that outlines the rights, roles, and responsibilities of each
party. The contract may also address payment schedules to the sites, any resource
transfers to or sharing with the individual sites, data storage and security liability,
and event reporting responsibilities. If there are specimen collections in the clinical
trial, the contract might also specify who holds ownership for those specimens and
material transfer agreement details.
Earlier collaboration and consensus between the clinical sites and the PI in the
development of the study and investment by the clinical sites in the clinical trial are
two benefits to engaging and onboarding potential sites earlier in the planning
phase. The Wrist and Radius Injury Surgical Trial (WRIST) group highlighted
three techniques they employed in consensus building during their trial planning
phase: focus group discussion, nominal group technique, and the Delphi method
(Chung et al. 2010; Van De Ven and Delbecq 1974). Each of these methods require
different levels of structure in reaching consensus. The PI for the clinical trial must
assess whether the group of clinical site investigators have existing relationships
with each other and whether or not some voices might carry more weight than
others. For example, in a focus group discussion on increasing study recruitment,
dominant voices may decrease democratic decision-making through the discus-
sion, and less vocal investigators may have fewer opportunities to voice their ideas,
resulting in a net loss to innovation in problem-solving. Pre-existing relationships
may lend themselves to an established group dynamic that may or may not
accommodate the addition of new voices. The nominal group technique is better
suited for this type of situation in that it requires participation from all members
(Van de Ven and Delbecq 1972). In the study recruitment example, by offering
everyone an opportunity to share their ideas, innovation around recruitment strat-
egies can be readily shared and amplified through voting by others in the group. It
can be difficult to implement however, since it involves face-to-face meetings in
order to prioritize and vote on decisions. Additionally, PIs should consider whether
each investigator has an equal opportunity to voice his/her opinion in the course of
designing and implementing the clinical trial. This includes ensuring that each
investigator has his/her research interests considered for incorporation into the
study objectives and is given an equal chance at authorship for manuscripts
resulting from the trial. If the investigator group consists of researchers with
varying levels of experience and seniority, the ideas of those more junior may be
lost in the conversation. In this scenario, the Delphi method might be a more
appropriate means of reaching consensus, as this method uses anonymity to
minimize the effect of dominant voices (Dalkey 1969). Using the recruitment
strategy example, this technique might further allow for the amplification of
novel strategies, regardless of who presents the idea.
140 S. Baksh
Trial Leadership
Design Considerations
There are a few design considerations unique to multicenter clinical trials. First,
when a study statistician develops the randomization scheme for a multicenter
clinical trial, he/she typically stratifies the randomization by clinical site. This
reduces potential bias due to measured and unmeasured differences across clinical
sites. In the case of stratifying by clinical site, the investigators control for the
potential interaction between clinical site and primary outcome measures (Senn
1998; Zelen 1974). Stratification by clinical site maximizes the probability of
balanced numbers of participants receiving each treatment arm in the study. Without
this balance, there is potential for bias if one clinical site experiences different
outcomes on average than other clinical sites. One can imagine a situation where
clinical sites might have catchment areas with different socioeconomic statuses,
patient demographics, and clinical characteristics. All of these could potentially
affect baseline risk for the primary outcome in the study population at each site.
The second point of consideration is the target number of randomizations for each
site. After a site has agreed to participate, the data coordinating center typically
establishes recruitment goals for each site participating in the study to ensure that the
overall recruitment goal for the study is met. These site-specific recruitment goals
should consider what the clinical capacity is at each site, length of study visits and
142 S. Baksh
contacts, full-time equivalents dedicated to the study at each site, recruitment goals
of other sites in the study, the timeline for completion of study recruitment, and the
flexibility around adding additional sites to the study. Different recruitment goals
across sites should not bias results or result in less precision, especially if random-
ization is stratified by clinical site (Senn 1998). Recruitment goals at each site should
be roughly similar with allowances for faster and slower recruitment at sites; these
goals should not be uniform and driven by extremes.
Third, multicenter clinical trial designs can benefit from built in flexibility for
adding sites at any point during the trial. In order to facilitate this process, PIs should
seriously consider the burden both on the data coordinating center and the prospec-
tive sites of the site start-up procedures. While multicenter trials inherently increase
power, the benefits of adding a site as a collaborator should outweigh the adminis-
trative and logistical hurdles of doing so. Having a start-up package of forms for
clinical sites to complete, a mini handbook of start-up activities, and a start-up
training can ease this burden and allow for transparency.
Sites involved in a multicenter clinical trial should all have the resources or capacity
to acquire the resources necessary for the execution of all study procedures. This
includes key personnel, study materials, and regulatory infrastructure (if applicable).
PIs should bear in mind these potential limitations at each site as they design the
study. In extreme cases, this may mean that study budgets may have to account for
infrastructure support to ensure that each site has the minimum required resources to
conduct study activities.
Through the course of a multicenter clinical trial, the data coordinating center
works to ensure uniformity in study procedures, data collection, and adverse event
and protocol deviation reporting across sites. To accomplish this, they coordinate a
number of study logistics in an orchestrated manner. This begins when a site is
chosen to join the clinical trial and has agreed to participate. For example, in the
United States, all clinical sites are asked to join an sIRB designated for the multi-
center clinical trial. As of 2016, all National Institutes of Health (NIH) sponsored
multicenter clinical trials are required to use an sIRB of record for their ethical
review (National Institutes of Health 2016). This move was intended to streamline
the review of studies, promote consistency of reviews, and alleviate some of the
burdens to investigators (Ervin et al. 2016). There are situations when a site may be
unable to join an sIRB (i.e., foreign jurisdiction or highly restrictive local regula-
tions). If a site agrees to rely on the sIRB for the study, they must complete a reliance
agreement documenting this arrangement between the sIRB of record and their site.
The letter of indemnification corresponding to this reliance outlines the scope of
reliance, claims, and governing laws. Australian regulatory agencies have endorsed a
similar approach through the National Mutual Acceptance (NMA) system. Through
this agreement, health departments across Australian states and territories agree to
recognize the ethical reviews conducted in member states for multicenter trials.
9 Multicenter and Network Trials 143
Similarly, the government of Ontario, Canada has supported Clinical Trials Ontario
to streamline the ethical review of study protocols across the province. While a
single ethical review may not be possible for protocols for all multicenter clinical
trials, streamlining these activities when possible has been endorsed by sponsors and
regulatory agencies.
After a site has received ethical approval from either the study’s sIRB or their
local institutional review board, then the data coordinating center can work with the
clinic staff to prepare for initiating the study at their site. The data coordinating
center may hold a training session for all certified clinic staff to orient them with the
study protocol and data entry system. In preparation for this training session, clinic
staff may be asked to review the protocol as well as any handbook (i.e., manual of
procedures or standard operating procedures). This smaller training session during
the onboarding process is an opportune time for clinic staff to clarify any technical
issues with the protocol or identify any difficulties with data entry. By conducting
this training with every site, the data coordinating center reinforces uniformity in
study activities across sites. The coordinators introduce the data collection instru-
ments at this time and indoctrinate clinic staff into the formatting requirements of the
data as well as any nuances to the data system. Data collection instruments are
standardized across the entire study and do not differ between clinical sites; however,
sites are allowed to maintain their own local records of study participants. As the
study proceeds, the data and/or clinical coordinating center may hold regular tele-
conferences or webinars with clinic staff to communicate important study changes,
assess and triage any challenges, and solicit feedback from sites. This is also an
opportunity for sites to learn from the experience of the other sites.
Given the potential for a large number of clinical sites, data coordinating centers
may utilize risk-based approaches to remote and on-site data monitoring. This is a
multipronged strategy for monitoring that prioritizes the most important aspects of
patient safety, study conduct, and data reporting (Organization for Economic
Co-operation and Development 2013; United States Food and Drug Administration
2013). Key features of risk-based monitoring include statistical approach to central
monitoring, electronic access to source documents, timely identification of systemic
issues, and greater efficiency during on-site monitoring. This risk-based monitoring
plan is usually developed after a risk assessment of critical data and procedures to be
monitored throughout the trial both remotely and on-site. In contrast to regular visits
to all clinical sites with 100% data audits, this approach to monitoring allows study
sponsors to effectively use resources to centralize data quality checks and use site
visits as an opportunity to further investigate any data anomalies, observe clinic and
study activities, and gather feedback about ease of data procedures and the data
system. This may mean that monitors conduct source data verification on selected
data items, a random sample of data forms, or a hybrid approach of source data
verification of 100% of key data collection instruments and a sample of the
remaining forms. This is another opportunity for study monitors to reinforce unifor-
mity across sites in the conduct of study procedures. These on-site monitoring visits
are seen as particularly useful at the beginning of a trial, with supplementary
centralized monitoring through the duration of the trial. If clinics are found to be
144 S. Baksh
“higher risk” with regards to errors, then additional on-site monitoring visits and
re-training can be arranged in a targeted manner.
Multicenter clinical trials can be conducted within clinical trial networks, also
referred to as trial consortiums or cooperative group clinical trials. Clinical trial
networks can be organized around a common clinical or disease area, can span
multiple countries, and can be publicly or privately funded. Table 1 lists several
clinical trial networks around the world, their sponsors, and mission. They range in
specificity of their missions, and in some cases, their goals have evolved since their
establishment. Some may focus on research areas on therapeutic testing for
neglected diseases or diseases that have high mortality with few to no treatment
options. Many of the institutes within the United States National Institutes of Health
sponsor clinical trial networks to accelerate research in priority areas.
Clinical trial networks in Europe may build on the existing relationships between
governments in the European Union (EU) and apply for European Research Infra-
structure Consortium (ERIC) designation. This allows for clinical trial networks to
be legally recognized across all EU member states, to fast-track the development of
an international organization, and to be exempt from Value Added Tax (VAT) and
excise duty. Countries outside of Europe are also allowed to join ERICs. Clinical
trial networks interested in this designation must provide evidence that they have the
infrastructure necessary to carry out the intended research, that the research is a
value-add to the European Research Area (ERA), and the venture is a joint European
a
Centralized Trial Network Clinical
Management Site
Clinical
Site
Executive Committee
Executive Committee
Regulatory Team
Fig. 1 (continued)
146 S. Baksh
Country
Analytic Core Y Clinic
Fig. 1 Clinical Trial Network Structure Examples. (a) Clinical trial network with one sponsor who
works directly with the centralized trial network management to direct and coordinate research at
multiple clinics. (b) Clinical trial network with multiple public and private sponsors who work
directly with the centralized trial network management to direct and coordinate research nodes,
focusing on different diseases, that then coordinate research at multiple clinal sites. (c) Clinical trial
network with public sponsors from different governments who work directly with the centralized
trial network management to direct and coordinate research at multiple clinics in one country;
typically, seen with one northern country sponsoring government and one southern country
sponsoring government, with research conducted in the southern country
that feeds into a data repository can serve to inform several trials with minimal
associated administrative overhead (Massett et al. 2019). Finally, clinical trial
networks have a formal system for building consensus around research priorities,
resource allocation, and network leadership (Organization for Economic
Co-operation and Development 2013). Developing the protocol for reaching con-
sensus is essential when clinical trial networks are large and multinational with
differing regulatory oversight and clinical standards.
Clinical trial networks offer many benefits to various stakeholders. They provide
an opportunity for investigators with similar research interests to exchange ideas,
develop novel trial methodologies for their clinical area, share resources, and
leverage a large pool of potential participants to push their field forward (Bentley
et al. 2019; Davidson et al. 2006). In cases where there is national buy-in from local
governments, this becomes an important public-private partnership to accelerate
national research agendas in an efficient manner. Initiating trials within an existing
infrastructure of research groups with existing relationships between each other and
with the sponsor, organizational competence, and administrative support contribute
to this efficiency. The inherent structure of clinical trial networks lends itself to
comparative effectiveness research that can inform government reimbursement
decisions and potential guideline changes. Smaller clinical sites wishing to develop
relationships with certain sponsors can benefit from joining clinical trial networks as
well. Sub-studies are also easier to execute for clinical sites with limited resources by
leveraging existing infrastructure. Lastly, all clinical sites, regardless of capacity, can
benefit from the increased exchange of ideas through the frequent meetings.
Despite these benefits, clinical trial networks can carry limitations worth consid-
ering. There is a common perception that enrollment is the standard metric of success
within a network. As such, payment structures may be based on the number of
participants enrolled at each site, with little consideration of overhead costs. This
could be a deterrent for investigators from academic institutions with high overhead
costs or for study investigators with study protocols requiring high resource utiliza-
tion, as investigators are dependent on their institution’s cooperation and support of
the endeavor. Clinical trial networks can also fall victim to inadequate staffing, with
consequences more substantial than would be in a single clinical research group
(Baer et al. 2010). Additionally, participation in a clinical trial network may mean
involvement in multiple clinical trials; however, not all of these may lead to
significant credit or publications for every investigator (Bentley et al. 2019). The
decision for investigators to participate in a clinical trial network should weigh the
benefits and limitations of their home institution, existing patient pool, and potential
for professional growth in their group and contribution to science.
Key Facts
Cross-References
▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ Institutional Review Boards and Ethics Committees
▶ Responsibilities and Management of the Clinical Coordinating Center
▶ Selection of Study Centers and Investigators
▶ Trial Organization and Governance
References
Baer AR, Kelly CA, Bruinooge SS, Runowicz CD, Blayney DW (2010) Challenges to National
Cancer Institute-Supported Cooperative Group clinical trial participation: an ASCO survey of
cooperative group sites. J Oncol Pract 6(3):114–117. https://doi.org/10.1200/jop.200028
Bentley C, Cressman S, van der Hoek K, Arts K, Dancey J, Peacock S (2019) Conducting clinical
trials – costs, impacts, and the value of clinical trials networks: a scoping review. Clin Trials
16(2):183–193. https://doi.org/10.1177/1740774518820060
9 Multicenter and Network Trials 149
Chung KC, Song JW, Group, WS (2010) A guide to organizing a multicenter clinical trial. Plast
Reconstr Surg 126(2):515–523. https://doi.org/10.1097/PRS.0b013e3181df64fa
Dalkey NC (1969) The Delphi method: an experimental study of group opinion. Santa Monica, CA:
RAND Corporation, 1969. https://www.rand.org/pubs/research_memoranda/RM5888.html
Davidson RM, McNeer JF, Logan L, Higginbotham MB, Anderson J, Blackshear J, . . . Wagner GS
(2006) A cooperative network of trained sites for the conduct of a complex clinical trial: a new
concept in multicenter clinical research. Am Heart J 151(2):451–456. https://doi.org/10.1016/j.
ahj.2005.04.013
Daykin A, Selman LE, Cramer H, McCann S, Shorter GW, Sydes MR, . . . Shaw A (2016) What are
the roles and valued attributes of a Trial Steering Committee? Ethnographic study of eight
clinical trials facing challenges. Trials 17(1):307. https://doi.org/10.1186/s13063-016-1425-y
Department of Health (2006) Guidelines for good practice in the conduct of clinical trials with
human participants in South Africa. https://www.dst.gov.za/rdtax/index.php/guiding-docu
ments/south-africangood-clinical-practice-guidelines/file
Ervin AM, Taylor HA, Ehrhardt S (2016) NIH policy on single-IRB review – a new era in
multicenter studies. N Engl J Med 375(24):2315–2317. https://doi.org/10.1056/
NEJMp1608766
European Commission (2009) Report from the Commission to the European Parliament and the
council on the application of Council Regulation (EC) No 723/2009 of 25 June 2009 on the
community legal framework for a European Research Infrastructure Consortium (ERIC). (COM
(2014) 460 final). European Commission, Brussels. Retrieved from https://ec.europa.eu/info/
sites/info/files/eric_report-2014.pdf
Liu G, Chen G, Sinoway LI, Berg A (2013) Assessing the impact of the NIH CTSA program on
institutionally sponsored clinical trials. Clin Transl Sci 6(3):196–200. https://doi.org/10.1111/
cts.12029
Massett HA, Mishkin G, Moscow JA, Gravell A, Steketee M, Kruhm M, . . . Ivy SP (2019)
Transforming the early drug development paradigm at the National Cancer Institute: the
formation of NCI’s Experimental Therapeutics Clinical Trials Network (ETCTN). Clin Cancer
Res 25(23):6925–6931. https://doi.org/10.1158/1078-0432.Ccr-19-1754
McCrae N, Douglas L, Banerjee S (2012) Contribution of research networks to a clinical trial of
antidepressants in people with dementia. J Ment Health 21(5):439–447. https://doi.org/10.3109/
09638237.2012.664298
National Institutes of Health (2016) Final NIH policy on the use of a single institutional review
board for multi-site research. Bethesda. Retrieved from http://grants.nih.gov/grants/guide/
notice-files/NOT-OD-16-094.html
National Institutes of Health (2017) Guidance on implementation of the NIH policy on the use of a
single institutional review board for multi-site research. Bethesda. Retrieved from https://grants.
nih.gov/grants/guide/notice-files/NOT-OD-18-004.html
Organization for Economic Co-operation and Development (2013) OECD recommendation on
the governance of clinical trials. Retrieved from http://www.oecd.org/sti/inno/oecdrecomme
ndationonthegovernanceofclinicaltrials.htm
Senn S (1998) Some controversies in planning and analysing multi-centre trials. Stat Med
17(15–16):1753–1765. https://doi.org/10.1002/(sici)1097-0258(19980815/30)17:15/16<1753::
aid-sim977>3.0.co;2-x; discussion 1799–1800
The International Conference on Harmonisation of Technical Requirements for Registration of
Pharmaceuticals for Human Use (1998) ICH E5(R1) ethnic factors in the acceptability of foreign
clinical data. cited European Medicines Agency. Available from: https://www.ema.europa.eu/
en/documents/scientific-guideline/iche-5-r1-ethnic-factors-acceptability-foreign-clinical-data-
step-5_en.pdf
The International Conference on Harmonisation of Technical Requirements for Registration of
Pharmaceuticals for Human Use (2017) ICH E17 general principles for planning and design of
multi-regional clinical trials. Available from: The International Conference on Harmonisation of
Technical Requirements for Registration of Pharmaceuticals for Human Use. ICH E17 General
Principles for Planning and Design of Multi-Regional Clinical Trials. 2017
150 S. Baksh
Trial Governance (2015) Field trials of health interventions: a toolbox, 3rd edn. In: Smith P,
Morrow R, Ross D (Eds). OUP Oxford, Oxford, UK
United States Food and Drug Administration (2006a) Guidance for clinical trial sponsors –
establishment and operation of clinical trial Data Monitoring Committees. Rockville. Retrieved
from https://www.fda.gov/media/75398/download
United States Food and Drug Administration (2006b) Guidance for industry – using a centralized
IRB review process in multicenter clinical trials. Retrieved from https://www.fda.gov/
regulatory-information/search-fda-guidance-documents/using-centralized-irb-review-process-
multicenter-clinical-trials
United States Food and Drug Administration (2013) Guidance for industry – oversight of clinical
investigations – a risk-based approach to monitoring. Silver Spring. Retrieved from https://
www.fda.gov/media/116754/download
Van de Ven AH, Delbecq AL (1972) The nominal group as a research instrument for exploratory
health studies. Am J Public Health 62(3):337–342. https://doi.org/10.2105/ajph.62.3.337
Van De Ven AH, Delbecq AL (1974) The effectiveness of nominal, Delphi, and interacting group
decision making processes. Acad Manag J 17(4):605–621. https://doi.org/10.2307/255641
Zelen M (1974) The randomization and stratification of patients to clinical trials. J Chronic Dis
27(7–8):365–375. https://doi.org/10.1016/0021-9681(74)90015-0
Principles of Protocol Development
10
Bingshu E. Chen, Alison Urton, Anna Sadura, and
Wendy R. Parulekar
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Administrative Information: SPIRIT Checklist Items 1–5d . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Background and Rationale and Objectives: SPIRIT Checklist Items 6–7 . . . . . . . . . . . . . . . . . 154
Trial Design: SPIRIT Checklist Item 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Participants, Interventions, and Outcomes: SPIRIT Checklist Items 9–17b . . . . . . . . . . . . . . . 155
Participant Timeline, Sample Size Recruitment (Items 13–15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
Assignment of Interventions (for Controlled Trials): Spirit Checklist Items 16–17b . . . . . . 159
Data Collection/Management and Analysis: SPIRIT Checklist Items 18a–20c . . . . . . . . . . . 160
Monitoring: SPIRIT Checklist Items 21–23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Quality Assurance (Monitoring/Auditing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Ethics and Dissemination: SPIRIT Checklist Items 24–31c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Abstract
Randomized clinical trials are essential to the advancement of clinical care by
providing an unbiased estimate of the efficacy of new therapies compared to
current standards of care. The protocol document plays a key role during the life
cycle of a trial and guides all aspects of trial organization and conduct, data
collection, analysis, and publication of results.
Several guidance documents are available to assist with protocol generation.
The SPIRIT (Standard Protocol Items: Recommendations for Interventional
Trials) Statement comprises a checklist of essential items for inclusion in a
Keywords
SPIRIT Statement · International Conference on Harmonization · Declaration of
Helsinki
Introduction
The protocol serves as the reference document for the conduct, analysis, and
reporting of a clinical trial which must satisfy the requirements of all stakeholders
involved in clinical trial research including trial participants, ethics committees,
regulatory and legal authorities, funders, sponsors, public advocates, as well as the
medical and scientific communities that are the direct consumers of the research
findings.
An inadequate or erroneous protocol has significant consequences. A deficient
protocol may result in delayed or denied regulatory or ethical approval, risks to the
safety of study subjects, investigator frustration and poor accrual, inconsistent
implementation across investigators, as well as increased workload burden and
financial costs due to unnecessary amendments. Ultimately, the trial results may
not be interpretable or publishable.
The purpose of this chapter is to outline the general principles of protocol
development with an emphasis on use of standard definitions and criteria for
protocol content where applicable. Essential reading for this chapter is the SPIRIT
2013 Statement (Chan et al. 2013a) and accompanying Elaborations and Explana-
tions paper (Chan et al. 2013b). The SPIRIT Initiative was launched in 2007 to
address a critical gap in evidence-based guidance documents for protocol generation.
Using systematic reviews, a formal Delphi consensus process, and face-to-face
meetings of key stakeholders, a 33-item checklist relating to protocol content was
generated and subsequently field tested prior to publication. Although the SPIRIT
checklist was primarily developed as a guidance document for randomized clinical
trials, the principles and application extend to all clinical trials, regardless of design.
The reader is also directed to the SPIRT PRO extension which builds on the
methodology of the SPIRIT Statement and provides recommendations for protocol
development when a patient-reported outcome is a key primary or secondary
outcome (Calvert et al. 2018).
Key principles that underpin the content of high-quality clinical trial protocols
relate to the originality and relevance of the primary research hypothesis contained
therein, use of design elements to adequately test the hypothesis, and inclusion of
appropriate measures to protect the rights and safety of trial participants. Guidance
documents generated by the International Conference on Harmonization (ICH) are
10 Principles of Protocol Development 153
useful references and address multiple topics of interest such as the E6 Good Clinical
Practice (GCP) and E8 General Considerations for Clinical Trials (https://www.ich.
org/products/guidelines/efficacy/efficacy-single/article/integrated-addendum-good-
clinical-practice.html). Another important reference document is the Declaration of
Helsinki which was developed by the World Medical Association and represents a
set of principles that guides the ethical conduct of research involving humans
(https://www.wma.net/policies-post/wma-declaration-of-helsinki-ethical-principles-
for-medical-research-involving-human-subjects).
What follows is a brief summary of key protocol content topics annotated with the
associated SPIRIT checklist items. Additional comments to assist with comprehen-
sion or use of SPIRIT protocol items are included as appropriate.
The administrative information relates to protocol title, unique trial registry number,
amendment history, and contact information for trial conduct from scientific, oper-
ational, and regulatory perspectives. The title should indicate the trial phase, inter-
ventions under evaluation and disease settings/trial population (Fig. 1)
The Declaration of Helsinki (revised 2008) mandates the registration of all
clinical trial in a publicly accessible database before recruitment of the first subject.
Trial registration in a primary register of the WHO International Clinical Trials
Registry Platform (ICTRP) or in ClinicalTrials.gov has been endorsed by the
International Committee of Medical Journal Editors (http://www.icmje.org/recom
mendations) since both registries meet the criteria of access to the public at no
charge, oversight by a not for profit organization, inclusion of a mechanism to ensure
Study Chair:
Steering Committee:
Biostatistician:
Collaborating Research Organizations:
Regulatory Sponsor:
Support Providers: Grant Agencies:
Pharmaceutical Companies:
154 B. E. Chen et al.
The justification for a research study is the single most important component of
any clinical results of trial that will not contribute meaningfully to the advance-
ment of healthcare and research represents a waste of resources and is unethical,
regardless of adherence to checklists and standards for research involving human
subjects.
The background section should summarize the current literature about the
research topic and the hypothesis that will be addressed by the clinical trial. A
review of ongoing trials addressing the same or similar research questions will
demonstrate non-duplication of research efforts. Finally, explicit statements regard-
ing the anticipated impact of the trial results – either positive or negative – provide a
powerful justification for trial conduct. This section should be updated as required to
reflect important advances in knowledge as they relate to the research question,
especially if they result in changes to trial design or conduct.
The objectives of the trial enable the research hypothesis to be tested and are
listed in order of importance as primary and secondary objectives. The primary
objective links directly to the statistical design of the trial which allow the results
to be interpreted within a pre-specified set of statistical parameters (see
Section Statistical Methods). Secondary objectives are selected to provide additional
information to support interpretation of the primary analysis data and typically focus
on additional measures of efficacy, safety, and tolerability associated with a given
10 Principles of Protocol Development 155
intervention. Tertiary objectives are exploratory in nature and may address prelim-
inary research questions related to disease biology or response to treatment.
The trial design is driven by the research hypothesis under evaluation. For example,
new therapeutic strategies with the potential for greater disease control compared to
standard of care may be tested in a parallel group superiority trial; a non-inferiority
trial may be suitable to test a therapy associated with less toxicity or greater ease of
delivery for which a small loss of efficacy may be acceptable. In addition to a
description of the trial framework, the protocol must clearly indicate the randomi-
zation allocation ratio. Deviation from the usual 1:1 allocation may be justified by
the desire for a more in-depth characterization of treatment-associated safety and
tolerability and may increase participant willingness to be enrolled in a specific trial
if there is a greater chance of receiving a new treatment compared to standard of care.
Crossover in treatment administration is another important aspect of trial design and
is used frequently in studies of chronic diseases which are relatively stable, and
therapeutic interventions result in amelioration but not cures of the condition, e.g.,
pain syndromes or asthma. Patients are randomly allocated to predefined sequence of
treatments administered over time periods, and the outcome of interest is measured at
the end of each treatment time period unit.
A design of increasing interest and use is the pilot study. This type of trial is to is
conducted using the same randomization scheme and interventions but on a smaller
scale. The goal of the pilot study is to gather information regarding trial conduct such
as ability to randomize patients, administer the therapeutic interventions, or measure
the outcomes measure(s) of interest but not to estimate relative treatment efficacy
between the interventions (Lancaster et al. 2004; Whitehead et al. 2014).
Participants
The population selected for trial participation must meet specific criteria to ensure
safety and enable the primary and secondary objectives to be met.
For trials testing drug interventions, adequate organ function is based on the
known pharmacokinetic and pharmacodynamic properties of the drug. Surgical or
radiotherapy trials may require additional tests of fitness for the required intervention
including lung function tests, adequacy of coagulation, and ability to tolerate an
anesthetic.
A patient is enrolled on a trial based on the assumption that he/she will contribute
meaningful information to the outcome measure(s) with a small loss of data due to
trial dropouts or withdrawals. The eligibility criteria should ensure that enrolled
patients can contribute data to enable the trial objectives to be met. For example, a
156 B. E. Chen et al.
trial examining the impact of a therapeutic intervention on pain response must enroll
symptomatic patients with a specific pain threshold; trials evaluating the ability of
interventions to control or shrink disease must have a quantifiable disease burden,
e.g., radiological evidence of cancer in a trial evaluating anticancer activity of
different therapies. Given the significant resource required for the conduct of a
randomized trial, a natural tendency is to include as many outcome measures as
possible to maximize the yield of the data generated by the trial. This approach is not
recommended since it increases the burden of study conduct and participation and
the risk of noncompliance with data submission and may negatively impact accrual
if enrollment is contingent on ability to provide data on multiple outcome measures.
The overall false-positive rate could be inflated when multiple numbers of hypoth-
eses are being tested and proper adjustment for multiple tests is required.
Patient-reported outcomes (PROs) are frequently included in trials of therapeutic
interventions to provide a patient perspective on their health status during the course
of a trial. Specific eligibility criteria related to participation in PRO data collection,
e.g., language ability, comprehension of questionnaires, and access to electronic
devices for direct patient to database submissions, should be adequately described in
the eligibility criteria. Similarly, criteria related to other research objectives such as
submission of biological tissue for analyses related to disease prognosis or predictors
of response to therapeutic interventions or health utility questionnaires for economic
analyses are included in the eligibility criteria as appropriate. The mandatory versus
optional nature of the criteria for tissue submission and patient-reported outcome
must be stipulated.
A long-standing criticism of clinical trials is that the results may have limited real-
world applicability due to the highly selected patient population enrolled. Linked to
this concern is the issue of screen failures, i.e., patients who are appropriate candi-
dates to participate in the trial but cannot be enrolled due to inability to meet the
stringent eligibility criteria. In response to this concern, efforts are underway by
research and advocacy organizations to broaden criteria to allow greater participation
in clinical trials by removing barriers such as the presence of comorbidities, organ
dysfunction, prior history of malignancies, or minimum age (Gore et al. 2017; Kim
et al. 2017; Lichtman et al. 2017).
Interventions
The treatment strategies under evaluation must be clearly described in the protocol,
to allow participating centers to safely administer the intervention and the medical
community to reproducibly administer the intervention should it be adopted or used
on a wider basis. A basic trial schema included early in the protocol document
provides a visual illustration of the interventions (Fig. 2).
For drug trials, dose calculation and guidelines regarding administration and dose
modification are provided. A tabular format is a convenient method to illustrate the
dose modification requirements mandated by specific laboratory values and/or
adverse events. In addition, guidance regarding dose modifications should a patient
experience multiple adverse events with conflicting recommendations regarding
dose adjustments is essential information for the protocol. Nondrug interventions
10 Principles of Protocol Development 157
O Arm 1
Stratification
Factors M
I
Patient
Z Primary Outcome
Population
Measure
A
I
Arm 2
O
Outcomes
The outcome measures selected in a clinical trial are of paramount importance – they
form the basis for data collection, statistical analysis, and results reporting. An
appropriate outcome measure must have a biologically plausible and clinically
relevant link to the intervention(s) under evaluation and be objectively and reliably
measured and reported using appropriate nomenclature.
Standardization of outcome measures has been identified by the research com-
munity as a goal to improve the general interpretability of the results of individual
trials and to enhance the integration and analysis of results from multiple trials. The
COMET (Core Outcome Measures in Effectiveness Trials) initiative is an example
of a collaborative effort to define a minimum core set of outcomes to be measured
and reported in clinical trials (http://www.comet-initiative.org). In addition to
providing guidance regarding disease-specific outcome measures, the COMET
initiative represents a rich resource of relevant methodologies for interested
researchers.
Composite outcome measures are often used to evaluate the efficacy of therapeu-
tic interventions and deserve specific mention. As with single-item outcome mea-
sures, the individual components of a composite measure must be clearly defined
and evaluable. In addition, the hierarchy of importance of the individual components
of a composite outcome measure must be prospectively identified to assist with data
collection and reporting. For example, if disease worsening can be defined by
radiological investigations or measurement of a blood-based marker, guidance for
reporting must be included in the protocol should both outcome events occur
simultaneously.
Perhaps the most important and challenging criterion to satisfy when selecting an
outcome measure relates to clinical benefit or meaningfulness. If the ultimate goal of
a therapeutic intervention is to live longer or better, the outcome measure must be
correlated to clinical benefit. Overall survival is considered the gold standard
outcome measure for trials testing therapeutic interventions for life-threatening
diseases but may be challenging to measure and interpret if death occurs years
after enrollment in a trial or if multiple efficacious therapies are administered after
the intervention of interest has failed to control the disease. Use of an intermediate,
clinically meaningful outcome measure may be justified in circumstances when
overall survival measurement is not feasible, especially when the alternative out-
come measure is a validated surrogate for overall survival, e.g., metastasis-free
survival in early prostate cancer (Xie et al. 2017).
and safety profile of the treatment intervention; the latter must be symmetric between
arms to avoid biased assessment of treatment efficacy. Classification of investiga-
tions by disease and treatment trajectory is a logical way to convey the information
to trial participants, i.e., prior to randomization, treatment phase, and follow-up
phase after the treatment has been completed or discontinued. Only essential inves-
tigations should be included in a protocol to minimize the burden of participation on
patients and healthcare facilities. An important principle guiding protocol develop-
ment relates to alignment of study assessments to usual care. Tests or interactions
within the healthcare systems that deviate from current practice will increase the risk
of noncompliance of participants with protocol-mandated assessments and may lead
to incomplete data collection and an impact on enrollment. To minimize this risk, the
protocol schedule of assessments and follow-up should be shared with prospective
participants for review prior to trial initiation.
Sample Size
The sample size justification is directly linked to the trial hypothesis and primary
objective. The statistical and clinical assumptions that inform the sample size
calculation must be clearly stated. The relevant information includes identification
of the primary outcome measure, expected primary outcome in the control group,
and the targeted difference in the primary outcome measure between treatment
groups, primary statistical test, type I and II errors rates, and measures of precision.
In general, a minimal clinically important difference (MCID) shall be used in sample
size calculation. Sample size adjustments for missing data and/or interim analyses
should be detailed. Additional important details to include in this section relate to the
planned duration of accrual and follow-up required to compile sufficient data to
enable the primary analysis.
Recruitment
The success of a trial is directly related to its ability to meet the pre-specified accrual
target. Given the tremendous effort and resource required to conduct a randomized
trial, every effort must be made to ensure the enrollment of consenting patients in a
timely manner. Details of recruitment plans are included in the protocol and will vary
with the patient population and interventions of interest, participating research
networks, and duration of accrual. Oversight measures to ensure adequacy of accrual
are described in this section.
The single most powerful design aspect of a controlled clinical trial is the process of
randomization or random assignment of enrolled subjects/patients to protocol treat-
ments. The purpose of randomization is to reduce the impact of bias of known and
unknown factors on treatment comparisons as a means of isolating the treatment
effect on patient outcome. Multiple methods of randomization exist. Blocked
160 B. E. Chen et al.
Data Collection
Data collection must align with the protocol specifications and thus not exceed what
has been approved by regulators, ethics boards, and consenting patients. Several
principles guide data collection during trial conduct: protection of identity and
confidentiality of trial participant data, adequacy of data to meet the primary and
secondary objectives of the trial, use of standard criteria to collect and report data,
and non-duplication of data collection unless justified and pre-specified in the
protocol document. The protocol must specify the data points of interest, methods
of collection, and frequency of reporting. Standard dictionaries for data collection
and reporting should be used where available, e.g., TNM (tumor, lymph node,
metastases) system for solid tumor cancer staging in oncology trials. Use of vali-
dated questionnaires or other instruments to enable accurate measurement of out-
come measures will ensure consistency of reporting and enhance the quality and
interpretation of the statistical analyses, e.g., EORTC QLQ-C30 questionnaire for
global quality of life evaluation in cancer patients (Aaronson et al. 1993) (Table 1).
10 Principles of Protocol Development 161
Data Management
To demonstrate adherence to guidelines and regulations for database compilation,
storage, and access, the protocol or associated documents must detail the infrastruc-
ture and oversight procedures for data management. This includes information
regarding how trial conduct will be monitored at participating sites to enable data
verification, ethics compliance, and review of pharmacy documentation for drug
trials.
Guidance documents for retention of essential documents at participating sites
should be cited as appropriate. For example, ICH GCP 4.9.5 guidance refers to the
number of years that essential documents must be retained at an investigative site;
GCP 4.9.7 outlines investigative site obligations to allow direct access to trial-related
documents by an oversight bodies such as a regulatory authority, research ethics
board, or monitors/auditors (https://ichgcp.net/4-investigator/).
The integrity of a database is related to the quality of data contained therein. To
ensure the submission of high-quality data by trial participants, including accurate,
complete, and timely submission, data collection forms should include clear instruc-
tions and unambiguous data entry fields. Submitted data should be consistent with
source records. Data management guidebooks are useful tools to address topics such
as data entry and editing; methods to record unknown data; how to add comments
and how to respond to queries. Specific trial-related procedures can also be detailed
162 B. E. Chen et al.
Statistical Analysis
The statistical analysis section must be described in sufficient detail to allow
replication of the analysis and interpretation of the trial results by the scientific and
clinical community. Inclusion of an experienced statistical member/team in the
protocol writing, trial conduct, and analysis phases is essential to meet these goals.
The parameters of interest include the outcome measure to be compared; the
population whose data will be included, e.g., all randomized versus eligible; and
the statistical methods used to analyze the data. Details regarding the use of
censoring and methods to deal with missing data should also be included. When
stratification was used at randomization, the statistical test for primary hypothesis
shall account for the stratification factors (e.g., stratified Cochran-Mantel-Haenszel
test for response rate and stratified log-rank test for time to event outcome). For
example, a clinical trial comparing the impact of a new therapy compared to standard
of care on overall survival may utilize a time to event analysis. Appropriate statistical
methods to analyze the survival experience of all randomized patients grouped by
assigned treatment include graphical display using the Kaplan-Meier method and
comparison using an appropriate log-rank test (Rosner 1990) with additional explor-
atory comparisons adjusted for prognostic covariates (Cox 1972).
Subgroup analyses are of great interest to the clinical community to understand the
treatment effect of a given intervention on different populations defined by specific
covariates such as those related to disease burden, exposure to prior treatment, or
patient characteristics. Given the exploratory nature of subgroup analyses, these
should be prospectively justified and defined in the protocol with the appropriate
statistical tests to determine if there is an interaction between treatment and subgroup.
Analyses of secondary outcome measures should inform interpretation of the
primary analysis and, ultimately, the research hypothesis. Sufficient details regarding
these analyses to justify their inclusion in the protocol and the associated data
collection plans are required. Using quality of life as an example, the specific
questionnaire/domains of interest, definition of meaningful change in score(s),
time point of data collection for analysis, and methods to control the type I error
due to multiplicity of testing should be outlined in the statistical section (Calvert
et al. 2018).
Data Monitoring
Oversight of data is integral to the regulatory, safety, and ethical obligations for any
trial. It is expected that all phase III randomized trials will be monitored in a real time
10 Principles of Protocol Development 163
Efficacy
Interim analyses of the primary outcome measure allow for early termination of a
clinical trial if extreme differences between the treatment arms are seen. Given the
potential for misleading results and interpretations due to multiple analyses of
accumulating data (Geller and Pocock 1987), all prospectively planned interim
analyses must be described in detail in the statistical section. The description will
include the timing or triggers for the interim analyses, the nominal critical p-values
for rejecting the null and alternative hypotheses that may lead to early disclosure of
results or termination of the trial, and required statistical adjustment to preserve the
overall type I error of the trial.
Harm
Safety monitoring is continuous during trial conduct and is multifaceted. It includes
adverse event reporting, laboratory, and organ-specific surveillance testing such as
EKCs as well as physical examinations of enrolled trial participants. An adverse
event is defined by the ICH E2A guideline as any untoward medical occurrence in a
patient or clinical investigation subject administered a pharmaceutical product and
which does not necessarily have to have a causal relationship with this treatment
(www.ich.org). Key components of this definition relate to the temporal association
of the untoward sign, symptom, or disease with the pharmaceutical product, regard-
less of causality. The ICH 2sA guideline further defines the term serious adverse
event as any medical experience that result in death, is life-threatening, requires
inpatient hospitalization or prolongation of existing hospitalization, results in per-
sistent or significant disability/incapacity, or is a congenital anomaly/birth defect.
For protocol development, these definitions of adverse event and serious adverse
events apply to any medical procedure, not just pharmaceutical products.
To enable safety oversight in a given trial, the lexicon for adverse event classi-
fication and submission timelines by research personnel must be referred to in the
protocol and provided in a companion document or appendix. One example of such a
lexicon is the Common Terminology Criteria for Adverse Events (CTC AE) devel-
oped by the US National Cancer Institute and widely utilized in oncology and
non-oncology clinical trials (www.ctep.cancer.gov/protocoldevelopment). These
criteria provide standard wording and severity ratings for adverse events, grouped
by organ or system class. An essential component of safety reporting relates to
164 B. E. Chen et al.
The quality management and quality assurance process is essential to the successful
conduct of clinical trials to ensure human subject protection and the integrity of trial
results. Systems should be in place to manage quality in all aspects of the trial
through all stages. The quality assurance process should be defined in the trial
protocol and be supported by standard operating procedures and plans. Details of
the plan must comply with applicable regulations and guidelines, health authority
expectations, and sponsor standard operating procedures. This includes details
regarding visit frequency, scope of review, and extent of compliance and source
data assessment. Risk factors to consider in development of the plan include but are
not limited to population, phase of trial, safety profile of agent, trial objectives and
complexity, accrual, performance history, and regulatory filing intent.
Quality assurance may include monitoring, either central or on-site, and auditing
activities. Per GCP these activities may be risk adapted. GCP 1.38 defines monitor-
ing as “the act of overseeing the progress of a clinical trial, and of ensuring that it is
conducted, recorded, and reported in accordance with the protocol, standard operat-
ing procedures, Good Clinical Practice, and the applicable regulatory requirements
(s),” whereas GCP 1.6 defines audit as “a systematic and independent examination of
trial related activities and documents to determine whether the evaluated trial
activities were conducted, and the data were recorded, analyzed, and accurately
reported according to the protocol, sponsor’s standard operating procedures, Good
Clinical Practice, and the applicable regulatory requirements.” Quality assurance
activities may include reviews at participating sites and vendors, internally with
respect to sponsor procedures. The objectives are to verify patient safety, to verify
the accuracy and validity of reported data, and to assess the compliance with
regulations/guidelines and standard operating procedures. In general, components
of review related to informed consent, protocol compliance and source data verifi-
cation, ethics and essential documents part of the trial master file which includes
standard operating procedures and training, and handling of investigational medic-
inal product as applicable.
10 Principles of Protocol Development 165
Ethics
An ethical trial is one that addresses an important research question while protecting
the safety, rights, and confidentiality of trial participants. The protocol must include
sufficient detail to reflect adherence to regulatory and guidance documents
pertaining to these principles. This includes adherence to the Declaration of Helsinki
and other reference documents such as the Tri-Council guidelines (Tri-Council
Policy Statement: Ethical Conduct for Research Involving Humans, December
2014. Retrieved from http://www.pre.ethics.gc.ca/pdf/eng/tcps2-2014/TCPS_2_
FINAL_Web.pdf) regarding research in vulnerable populations as defined by the
ability to make independent decisions or susceptibility to coercion. The protocol
should contain specific wording regarding enrollment of vulnerable individuals.
ICH GCP Section 4.8 provides guidance on the informed consent process. This
includes the requirement for an ethics committee-approved, signed informed consent
document prior to enrollment in the trial, the need to identify the most responsible
parties in a trial from compliance and liability perspectives, the use of a translator to
obtain informed consent, the methods to consent a participant who cannot read, and
the obligation to disclosure new information to a trial participant during trial
conduct. Guidance regarding pregnancy reporting and follow-up is also required if
applicable to the trial population.
ICH GCP 4.8 also provides guidance regarding the explanations of the trial to be
included in the consent document. The explanations cover topics related to the
experimental nature of the research and the research question; the treatments under
evaluation and likelihood of assignment; trial-mandated interventions and categori-
zation of which are experimental versus nonexperimental; the risks and benefits of
trial participation including exposure of unborn embryos, fetuses, or nursing infants
to protocol therapies; the existence of alternative treatment options; the trial sample
size; and anticipated duration.
Topics relating to legal and ethical oversight of the trial must also be addressed in
the consent document including the roles of regulatory and ethics bodies in trial
conduct, the voluntary nature of trial participation, the rights of an enrolled partic-
ipant including the ability to withdraw consent to participate or submit data to the
sponsor, compensation for injuries should they occur, and the protection of confi-
dentiality, which trial-related organizations will have direct access to original patient
data and how data will be stored. Specific contact information for all trial-related
questions or in the case of emergency is also provided.
Optional consents are utilized if there is a nonmandatory aspect of trial conduct in
which enrolled patients can participate. An example of an optional consent is one
that allows banking of tissue samples for future biomarker analyses related to disease
prognosis or predictors of response to the treatment strategies under investigation in
a trial.
Practically speaking, a consent must be written in clear, nontechnical language
aimed for the general readership rather than a research savvy/legally trained partic-
ipant. Content-specific sections should be clearly identified using appropriate
166 B. E. Chen et al.
headings and inclusive of the critical information required for an informed decision
regarding trial participation to be made. In reality, the process of ensuring that the
consent is informed extends beyond a written signature of a consent and the protocol
document. Adequate time and resources must be available prior to and after the
actual signature is obtained to respond to questions and provide information regard-
ing the clinical trial. The actual consent document is retained as a permanent part of
the healthcare record and is a useful resource for continued dialogue during the entire
trajectory of the trial including the analysis, publication, and dissemination
process (Resnick 2009; www.fda.gov/patients/clinical-trials-what-patients-need-
know/informed-consent-clinical-trials).
Dissemination
Dissemination of results of a trial is usually done via presentations at scientific
meetings and/or a peer-reviewed research manuscript published in a scientific
journal. The International Committee of Medical Journal Editors (ICMJE)
has established four general criteria for authorship in a medical journal that
must be met for all named individuals on a submitted manuscript (http://www.
icmje.org/):
Conclusion
The protocol is the pivotal guidance document for a clinical trial that communicates
the essential details of the research plan to trial participants and organizations
involved in research oversight. A well-written protocol has internal consistency
and logical and clear designation of specific protocol sections and is written using
unambiguous wording. Guidance documents for protocol development and content
are useful resources for all stakeholders involved in clinical trial research.
References
Aaronson NK, Ahmedzai S, Bergman B, Bullinger M, Cull A, Duez NJ, Filiberti A, Flechtner H,
Fleishman SB, de Haes JC, Klee M, Osoba D, Razavi D, Rofe PB, Schraub S, Sneeuw K,
Sullivan M, Takeda F (1993) The European Organization for Research and Treatment of Cancer
QLQ-C30; a quality-of-life instrument for use in international clinical trials in oncology. J Natl
Cancer Inst 85:365–376
Altman DG, Bland JM (1999) How to randomise. BMJ 319:703–704
Calvert M, Kyte D, Mercieca-Bebber R, Slade A, Chan AW, King MT, The SPIRIT-PRO Group,
Hunn A, Bottomley A, Regnault A, Chan AW, Ells C, O’Connor D, Revicki D, Patrick D,
Altman D, Basch E, Velikova G, Price G, Draper H, Blazeby J, Scott J, Coast J, Norquist J,
Brown J, Haywood K, Johnson LL, Campbell L, Frank L, von Hildebrand M, Brundage M,
Palmer M, Kluetz P, Stephens R, Golub RM, Mitchell S, Groves T (2018) Guidelines for
inclusion of patient-reported outcomes in clinical trial protocols: the SPIRIT-PRO extension.
JAMA 319(5):483–494
Chan AW, Tetzlaff JM, Altman DG (2013a) SPIRIT 2013 statement: defining standard protocol
items for clinical trials. Ann Intern Med 158:200–207
Chan AW, Tetzlaff JM, Gotzsche PC (2013b) SPIRIT 2013 explanation and elaboration: guidance
for protocols of clinical trials. BMJ 346:e7586. https://doi.org/10.1136/bmj.e7586
Cox DR (1972) Regression models and life tables (with discussion). J R Statist Soc Ser
B34:187–220
Geller NL, Pocock SJ (1987) Biometrics 43(1):213–223
Gore L, Ivy SP, Balis FM, Rubin E, Thornton K, Donoghue M, Roberts S, Bruinooge S, Ersek J,
Goodman N, Schenkel C, Reaman G (2017) Modernizing clinical trial eligibility: recommen-
dations of the American Society of Clinical Oncology–friends of Cancer research minimum age
working group. J Clin Oncol 35:3781–3787
Kernan WN, Viscoli CM, Makuch RW, Brass LM, Horwitz RI (1999) Stratified randomization for
clinical trials. J Clin Epidemiol 52(1):19–26
Kim ES, Bruinooge SS, Roberts S, Ison G, Lin NU, Gore L, Uldrick TS, Lictman SM, Roach N,
Beavre JA, Sridhara R, Hesketh PJ, Denicoff AM, Garrett-Mayer E, Rubin E, Multani P,
Prowell TM, Schenkel C, Kozak M, Allen J, Sigal E, Schilsky RL (2017) Broadening eligibility
criteria to make clinical trials more representative: American society of clinical oncology and
friends of cancer research joint research statement. J Clin Oncol 35:3737–3744
Lancaster GA, Dodd S, Williamson PR (2004) Design and analysis of pilot studies: recommenda-
tions for good practice. J Eval Clin Pract 10:307–312
Lichtman SM, Harvey RD, Smit MAD, Rahman A, Thompson MA, Roach N, Schenkel C,
Bruinooge SS, Cortazar P, Walker D, Fehrenbacher L (2017) Modernizing clinical trial eligi-
bility criteria: recommendations of the American Society of Clinical Oncology–friends of
Cancer research organ dysfunction, prior or concurrent malignancy, and comorbidities working
group. J Clin Oncol 35:3753–3759
Pocock SJ, Richard S (1975) Sequential treatment assignment with balancing for prognostic factors
in the controlled clinical trial. Biometrics Int Biometric Soc 31(1):103–115
168 B. E. Chen et al.
Resnick DB (2009) Do Informed Consent Documents Matter?. Contemp Clin Trials 30(2):114–115
Rosner B (1990) Fundamentals of biostatistics, 3rd edn. PWS-Kent, Boston
Whitehead AL, Sully BG, Campbell MJ (2014) Pilot and feasibility studies: is there a difference
from each other and from a randomised controlled trial? Contemp Clin Trials 38(1):130–133
Xie W, Regan MM, Buyse M, Halabi S, Kantoff PW, Sartor O, Soule H, Clarke NW, Collette L,
Dignam JJ, Fizazi K, Paruleker WP, Sandler HM, Sydes MR, Tombal B, Williams SG, Sweeney
CJ (2017) J Clin Oncol 35(27):3097–3104
Zelen M (1974) J Chronic Dis 27:365–375
Procurement and Distribution of Study
Medicines 11
Eric Hardter, Julia Collins, Dikla Shmueli-Blumberg, and
Gillian Armstrong
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
Procurement of Investigational Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Investigational Product Procurement Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Use of a Generic Drug . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Considerations for IP Procurement/Manipulation in Blinded Trials . . . . . . . . . . . . . . . . . . . . . . . 173
Impact of IP-Related Factors: Controlled Substances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Identification of Qualified Vendors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Packaging Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Manufacturing and Packaging Considerations for Blinded Trials . . . . . . . . . . . . . . . . . . . . . . . . . . 176
International Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Distribution of Investigational Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Documents to Support Release of IP to Qualified Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Use of Controlled Substances in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
IP Inventory Management for Complex Study Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
IP Accountability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Shipping and Receipt of IP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Abstract
When compared to clinical trials involving a new (unapproved for human use)
drug or biologic, utilizing an approved, commercially available medication in a
clinical trial can introduce a new set of variables surrounding procurement and
distribution, all of which are fundamental to successful trial implementation.
Numerous procurement factors must be considered, including the identification
of a suitable vendor, manufacturing of a matching placebo, and expiration dating,
all of which can become more intricate when the study increases in complexity by
involving factors like active comparators, drug tapering regimens, and research
sites in more than one country. Distribution is a similarly complex operation,
which involves adherence to regulatory requirements and consideration of
aspects such as blinded study designs or utilization of additional safeguards
with the use of controlled substances. This chapter will review the basic factors
to be taken into consideration during the planning and operational stages of a
clinical trial involving a marketed medication and provide examples of how to
manage these factors, all of which are aimed at ensuring compliance with both
applicable local and international laws and with guidance documents aimed at
protecting the rights, safety, and well-being of trial participants.
Keywords
Investigational product (IP) · Placebo · Current Good Manufacturing Practices
(cGMPs) · Blinded/blinding · Manipulation · Procurement · Controlled
substance · Vendor · Accountability · Distribution
Introduction
Fig. 1 Examples of clinical trials using marketed medication usually requiring an IND or regional
equivalent
172 E. Hardter et al.
will outline the main components and points to consider for IP procurement (includ-
ing sourcing, manipulation of dosage forms, and compounding) and distribution
(tracking, restocking, and destruction) throughout the life of a clinical trial.
The extent to which a study drug must be manipulated for the clinical trial will
dictate selection of not only an initial source of the commercially available drug
but also the requirement for all other IP-related vendors or suppliers. Thus, it is
critically important for the Sponsor to decide upon all IP-related protocol aspects
during the planning phase, prior to selecting a supplier, and to make minimal
changes to the protocol that can impact IP during study conduct. For small open-
label clinical studies where IP is administered once, IP procurement may be as
simple as the on-site physician ordering the medication from an appropriate
commercial vendor or pharmacy and dispensing to participants, tracking lot
numbers as per institutional practices. However, IP procurement requirements
can quickly become more complicated for later phase trials (phase 2 and 3),
which can last longer, require blinded medication, and/or have many participat-
ing sites (national and international). This complexity can be compounded further
by IP-driven storage requirements (controlled, refrigerated, or frozen
medication).
Once the initial protocol design is finalized, the identification of a suitable
commercially approved drug or biologic is the first step in procurement planning.
This should be a dosage form (tablet, capsule, liquid, etc.), strength, and formula-
tion (oral, injectable, topical, etc.) suitable for use in the study, given the proposed
schedule and route of administration, with factors such as color, taste, shape, etc.,
as well as the availability of immediate-release or extended-release formulations,
taken into consideration (if appropriate). Pricing of each of the available options
for the study should then be performed to allow an initial check against the study
budget. If the proposed clinical trial is being conducted by an academic institution
or public health agency, it can be worthwhile to approach the pharmaceutical
company manufacturing the drug to ask about any programs through which the
IP needed to conduct the trial may be obtained for free or at a lower cost. In such
situations, the company donating the drug may dictate the packaging/labeling for
their IP and the process that must be utilized to supply medication to the study
sites. A detailed agreement should be in place regarding the provision of IP; the
requirement, if any, for clinical sites to return unused medication to the manufac-
turer; the ability of the trial Sponsor to cross-reference the manufacturer’s inves-
tigational or marketing application as required; and any specific safety reporting
related to product quality that the manufacturer requires for their post-marketing
obligations.
11 Procurement and Distribution of Study Medicines 173
In many countries around the world, innovative drugs are protected from generic
intrusion by a patent or a period of marketing exclusivity. In the USA, the former is a
legal protection obtained through and afforded by the US Patent and Trademark
Office, while the latter is provided to a manufacturer by FDA. Both have an ability
to prohibit competitors from seeking approval of a drug or biologic therapeutically
equivalent to the innovator drug, which often increases drug availability and can
push prices down. International regulatory authorities, such as Health Canada and
the European Medicines Agency (EMA), have similar data exclusivity protections.
Having only a single source of IP can not only make IP prohibitively expensive but
may also delay or halt the study if there is a market shortage of the drug.
If neither a patent nor marketing exclusivity applies, purchasing options may
increase to potentially include generic versions of an IP. Prior to marketing
authorization, generic drugs must be considered therapeutically equivalent to the
innovator drug. The FDA considers drugs pharmaceutical equivalents if they contain
the same active ingredients, are of the same dosage form and route of administration,
are formulated to contain the same amount of active ingredient, and meet the
same compendial or other applicable standards (i.e., strength, quality, purity, and
identity). Generic drugs will differ in characteristics such as shape, scoring config-
uration, release mechanisms (for immediate- or extended-release formulations),
packaging, excipients (including colors, flavors, preservatives), expiration dating,
and, within certain limits, labeling, all important factors to take into consideration
when selecting a generic version of an approved drug to use as an IP. In the USA,
therapeutically equivalent generic drugs will receive an “A” rating in the FDA
Approved Drug Products with Therapeutic Equivalence Evaluations book (also
known as the Orange Book).
Blinded studies will require additional consideration as the drug or comparator will
be manipulated prior to being used in the trial, e.g., covered with another color to
obscure identifying markers (i.e., inking or debossing) to allow the manufacture of
a matching placebo (▶ Chaps. 43, “Masking of Trial Investigators” and ▶ 44,
“Masking Study Participants”). A blinded study design is used to reduce bias and
involves the study participants being unaware of which treatment assignment or
study group they are randomized to (single-blind) or all parties (the Sponsor,
investigator, and participant) being unaware of a participant’s treatment assignment
(double-blind). For placebo-controlled studies, a commercially sourced IP must be
disguised, and a matching placebo must be manufactured, if not already available
from the commercial IP manufacturer as part of their study support. In comparison,
open-label medication trials do not disguise the IP, as both the participants and
investigators are aware of the assigned treatment. IP can be obtained and managed in
174 E. Hardter et al.
a more straightforward fashion, both during drug procurement and throughout the
implementation of open-label clinical trials.
Generic medications, each of which vary in shape, color, and/or debossing/
imprinting (for tablets), can provide additional challenges or potential benefits
for blinding in a clinical trial, as certain shapes/sizes may be easier to insert
into a capsule, to replicate in a placebo, or to disguise for blinded use; e.g., a tablet
that has a letter or logo printed in ink on the surface can be easier to disguise by
overspraying than a tablet with a similar marking which is debossed, since the latter
contains a gap that must be filled.
Once a marketed product has been selected, securing a reliable IP source for the
entire duration of the study is the most important next step. Key parameters to
consider are the lead time for procuring the IP, quantity available, the time needed
to manipulate (i.e., spray coat, repackage, etc.,), and the expiration date of the IP
available. Lead time and available quantity could be subject to change pending a
potential shortage of drug. The time needed to get the IP ready to ship to sites is
dependent on the degree of manipulation, whereas expiration date of the IP is
directly tied to its stability profile, with drug wholesale companies usually providing
their “oldest” stock for shipment over stock which can remain on their shelves
longer. While the use of generic medication may reduce initial costs, availability
over an extended period may still become an issue for longer trials, necessitating IP
restocking. Further, generic drugs can be removed from the market without warning,
affecting the entire supply chain for a clinical trial. Thus, it is important to consider
the longevity of generic manufacturing (i.e., the likelihood of continuation of IP
manufacture for the duration of the trial) prior to selecting a manufacturer.
In a blinded trial, the purchased IP (and active comparator, if one is available)
must be manipulated (and the matching placebo manufactured) prior to study start.
For example, an IP in tablet form may be obscured via discoloration (e.g., over-
spraying) to cover identifying markers/debossing and to allow the manufacture of a
matching placebo. An injectable medication, however, may only require relabeling
if the color can be matched with a placebo. The extent to which purchased IP is
manipulated for the clinical trial will dictate the selection of an appropriate supplier
or vendor for this manipulation, set the timeline from procurement to shipping to the
clinical site, and also set expectations for IP-related data to be collected during the
study. For example, manipulation of a study drug may require release and ongoing
stability testing to ensure its continued identity, purity, and potency.
Packaging Considerations
If, for the purposes of blinding, a commercial drug product is inserted into a capsule,
sprayed to change color, etc., the impact of these changes should be taken into
consideration and testing performed to ensure that the quality attributes of the IP are
maintained and that patient safety and study integrity are also maintained. For example,
depending on the dissolution rate of the tablet, placing it into an opaque gelatin capsule
for blinding purposes may change the rate of drug release and therefore impact the
onset of drug action, which can be important if a drug has a narrow therapeutic
window, so performing re-encapsulation (emptying current capsules and transferring
content to a new capsule) may be a preferable option. An IP that is sensitive to light
should be repackaged under appropriate conditions, utilizing amber/opaque bottles or
suitable blister packages. While placebo has no expectation of potency, it typically still
needs to undergo testing for characteristics such as sterility, appearance, and odor
during stability studies. Neither drug nor placebo should be released for use in the
study until it meets all applicable release testing requirements, with testing usually
performed according to local Pharmacopeia monographs (standards for identity, qual-
ity, purity, strength, packaging, and labeling).
11 Procurement and Distribution of Study Medicines 177
studies surrounds the regulatory status of the commercial product chosen (and the
active ingredient) in each country (e.g., approved for marketing, approved but no
longer available, etc.), as this influences the ability to import IP and the requirement
for a clinical trial application locally. For example, a multinational study utilizing IP
which is FDA-approved may only require IRB oversight in the USA, without an
IND application, provided the study meets all criteria in 21 CFR 312.2. However, if
the same drug does not have marketing approval in Canada, it will require
full reporting to Health Canada under a Clinical Trial Application (CTA), an
assessment by the study Research Ethics Board, and an environmental assessment
by Environment Canada. Such regulatory approvals, or lack thereof, influence drug
sourcing options. In the scenario described above, IP would need to be exported
from the US manufacturer and imported into Canada. This requires prior approval of
the CTA and appropriate labeling of the exported drug, along with sign-off from an
importing agent in Canada (who must be a Canadian resident); otherwise, the IP will
be seized at the border by the Canadian Border Services Agency that works in
conjunction with Health Canada.
Similarly, IP manufacturing requirements may differ between countries. While
ICH member states typically overlap in this regard, small differences may result in
additional levels of compliance. For example, an IP manufactured in the USA and
intended for import to a European Union member state (e.g., Germany and France),
for a clinical trial, will require release by a qualified person (QP). The QP is
responsible for verifying that the IP meets a certain degree of cGMP compliance
for import into the country and thus will likely require access to IP batch records to
determine cGMP adherence. In some instances, the QP may wish to assess the batch
manufacture in person, depending on the risk of the activities undertaken during the
manufacturing process. It should be noted that, while the above scenario remains
plausible, mutual recognition agreements often exist between CAs (typically
between ICH members). These agreements effectively state that the competent
regulatory authority from an importing country will defer to a cGMP inspection of
the competent regulatory authority from the exporting country (or from another ICH
country, if such an inspection has been performed), without necessitating additional
inspection. Therefore, choice of a commercially available drug manufactured by a
company that has already obtained marketing authorizations for it in the countries to
be used in the clinical trial may lead to a quicker study start-up, as the local CAs will
be familiar with the IP and only the manipulation for the clinical trial will need to be
described.
As per ICH E6 GCP, many documents must be generated and be on file prior to
study start, which is often considered the initial shipment of IP to a clinical site.
These documents include those relating to the release of IP by the manufacturer,
for example, a Certificate of Analysis, which ensures that the IP fulfills the quality
attributes set for it. A subset of documents is also collected from the site
and includes documents related to the investigator’s ability to conduct the trial
(documentation of relevant qualifications and training), the favorable review of the
study protocol and other documents by the IRB or International Ethics Committee
(IEC), (▶ Chap. 36, “Institutional Review Boards and Ethics Committees”) and an
agreement to follow the study protocol, including the requirements for the reporting
of adverse events. These documents should be reviewed for their accuracy and
suitability to support study conduct prior to authorizing IP shipment and will often
include ensuring that the site has the appropriate documents and training for han-
dling IP, including IP disposition logs. If the IP is a controlled medication, the
applicable local registrations for the site to receive and prescribe controlled sub-
stances, such as DEA registration in the USA, are particularly important and will
have to be provided to the central distributing facility to ensure they comply with the
facility’s SOPs.
In a clinical trial with a small number of sites, it may be feasible to collect and
manage these regulatory documents using a paper system; however, in a larger-scale,
multicenter clinical trial, the collection and management of regulatory documents
throughout the life of the study is more challenging and complex. When multiple
sites are involved, the use of an electronic trial master file (eTMF) and a linked
clinical trial management system (CTMS) can help facilitate this task and assist
the Sponsor in maintaining compliance with applicable regulations (Zhao et al.
2010). For example, some systems can automatically trigger alerts and notifications
for upcoming expiration dates of documents filed in the system, as well as generate
reports that can be used to quickly identify any missing documents. Only once
all required documents have been collected and the required national and local
approvals are in place can the Sponsor or designee supply an investigator or
institution with IP. Setting clear expectations early in study start-up for the number
and quality of documents to be collected from the site prior to IP shipment is key to
on-time study start. Also, routine frequent monitoring of site documents, including
IP logs, at the site will ensure that the documents are maintained and are inspection
ready at the end of the study.
180 E. Hardter et al.
DEA
Description Example Substance(s)
Schedule
No currently accepted medical use in
the US, a lack of accepted safety for heroin, lysergic acid diethylamide
I
use under medical supervision, and (LSD), marijuana (cannabis)
a high potential for abuse.
hydromorphone, oxycodone
High potential for abuse which may
(Schedule II); amphetamine
II/IIN lead to severe psychological or
(Adderall®) (Schedule IIN
physical dependence.
(stimulants))
Have a potential for abuse less than
substances in Schedules I or II and
Tylenol with Codeine®;
III/IIIN abuse may lead to moderate or low
buprenorphine; ketamine
physical dependence or high
psychological dependence.
Have a low potential for abuse
alprazolam (Xanax®); diazepam
IV relative to substances in Schedule
(Valium®)
III.
Have a low potential for abuse
Cough preparations containing not
relative to substances listed in
more than 200 milligrams of
V Schedule IV and consist primarily of
codeine per 100 milliliters or per
preparations containing limited
100 grams (Robitussin AC®)
quantities of certain narcotics.
Source: US DEA Diversion Control Division 2018.
Fig. 2 List of US DEA categories or “schedules.” Drugs are categorized into schedules based on
their acceptable medical use and abuse or dependency potential (United States Department of
Justice Drug Enforcement Administration, Diversion Control Division 2018)
Inventory management is driven primarily by study design, the specific drugs used
(special handing requirements, e.g., controlled substances, frozen/refrigerated), the
number of clinical sites, and the countries involved. For example, as mentioned,
IP for an open-label study conducted at a single site can be prescribed locally
and dispensed as needed to the participants following institutional procedures.
Conversely, more intricate study designs, particularly those involving blinded IP,
increase the complexity of the drug supply process. To accommodate these com-
plexities, electronic systems have evolved to include more streamlined, auditable,
and user-friendly drug assignment and distribution processes. Systems such as a
CTMS, IRT (interactive response technologies), and/or electronic data capture
(EDC) systems (▶ Chap. 13, “Design and Development of the Study Data System”)
are used alone or in conjunction with each other, interfacing and automatically
updating to assign IP to a participant, track site inventory, and request resupplies.
The key benefits to these systems are the ability to monitor the IP in real time, the
immediate allocation of IP (bottle, kit, or single dose) available at the site to a
participant according to an overall randomization scheme, tracking of expiration of
IP, and the possibility for automatic replenishment of IP once a predetermined
182 E. Hardter et al.
Fig. 3 Key control measures impacting clinical trials with controlled substances (adapted from
Woodworth 2011)
threshold is reached. In blinded studies, the EDC can bypass the requirement for any
direct staff involvement in resupply. One component of these systems is careful
planning in setup of not only the system but also the IP itself, as the Sponsor must
ensure that any system-specific information, such as a bottle or kit identifying
number and corresponding barcode for electronic systems, is included to allow the
IP to be tracked. Overall, these systems can help minimize last-minute supply
requests, oversupply at a single site, and waste at clinical sites.
11 Procurement and Distribution of Study Medicines 183
Current Expiration
Site Name Supply Name Threshold
Inventory Date
Research
9-Aug-18
Site A
XR-NTX Medication Kit 10 5 8/31/2018
Suboxone 4mg strips 750 400 12/31/2019
Suboxone 8mg strips 300 350 11/30/2019
Research
9-Aug-18
Site B
XR-NTX Medication Kit 8 5 8/31/2018
Suboxone 4mg strips 615 400 12/31/2019
Suboxone 8mg strips 425 350 11/30/2019
Research
9-Aug-18
Site C
XR-NTX Medication Kit 4 5 5/31/2019
Suboxone 4mg strips 375 400 12/31/2019
Suboxone 8mg strips 428 350 11/30/2019
Fig. 4 Example Inventory Form for IP/study medication. IP levels at or below threshold or past
their expiration date appear in red
IP Accountability
One method for utilizing electronic IP management systems is to ask research staff
to report IP inventory periodically (e.g., weekly) directly in the EDC system. These
data are subsequently pulled into specifically programmed reports, which can
be reviewed to identify reorder needs based on predetermined thresholds and
usage at each site. Color-coding of these reports is useful for quick visual identifi-
cation of supplies that are nearing expiration or are below the desired threshold.
Of note, to maximize efficiency, inventory forms may also include study supplies
other than the IP (e.g., blood draw equipment) (not reflected in figure) (Fig. 4).
Shipments should be carefully timed, and the appropriate amount of IP (taking site
storage capacity and projected enrollment rate/target into account) should be distrib-
uted with each shipment to minimize the amount of IP and other supplies left unused
at the site while ensuring there are sufficient supplies on site for active study
participants. The latter can often be predicted based on participant enrollment rates
at the site and the IP distribution schedule delineated in the study protocol. The ideal
ratio and time will reduce shipping costs and waste.
Drug accountability is more than simply counting pills; it goes hand in hand
with inventory management and refers to the record keeping associated with the
receipt, storage, and dispensation of an investigational product. When done
correctly, it should provide a complete and accurate accounting of drug handling
from initial receipt on site to final disposition (e.g., utilization in the study, return,
or destruction). While the study Sponsor is responsible for procurement of
study medication (Sponsor Requirements) (including the processes delineated at
the start of this chapter), once at the sites, it is the responsibility of the principal
184 E. Hardter et al.
investigator (PI) to maintain adequate records of the product’s handling and dispen-
sation (ICH GCP 4.6.1) (▶ Chap. 6, “Investigator Responsibilities”). In accordance
with ICH GCP, the PI can choose to delegate responsibility of IP accountability to an
“appropriate pharmacist or another appropriate individual” of whom they have
oversight (ICH GCP 4.6.2). It should be ensured that this delegation and the
qualifications of the individual are documented appropriately.
By nature, working with human research subjects introduces participant-level
error with regard to study drug accountability, particularly when IP is “sent home”
with the participant. For example, participants might forget doses, misplace or lose
some or all of the study IP, share IP with others, or sell it illegally. To account for
these situations, clinical trial protocols often require participants to return any
unused IP at designated intervals throughout the study (e.g., weekly, monthly)
before providing them with more. Once IP is returned, qualified study staff
complete an inventory (e.g., number of remaining capsules/tablets) to evaluate
medication compliance and perform IP accountability (the latter is particularly
important when the IP is a controlled substance). To address the expected level of
participant error in IP management during a clinical trial, there are certain strate-
gies that study Sponsors can employ to attempt to increase medication adherence
and enhance subsequent drug accountability. These methods may range from basic
paper-and-pencil record keeping, such as providing participants with a small
calendar to mark the dates and times that they took their medication, to leveraging
technology-based options. One such option involves real-time medication adher-
ence monitoring by using a “smart” pill bottle, which may perform tasks such as
indicating or recording whether the patient took their medication on schedule (e.g.,
through a glowing light or via a counting tool built into the bottle cap), measuring
the time that elapses between doses using a stopwatch function, or even sending
automatic medication reminders to patients via text message on their smart phone
(Choudhry et al. 2017)). Another technology-based option is providing patients
with a QR (quick response) code to access easy-to-understand pharmacist counsel-
ing videos, which guide patients through how to take the medication, any potential
side effects, etc. (Yeung et al. 2003). Although all materials provided to a patient
during a clinical trial can require IRB review, an IRB may request additional
oversight of compliance and require that a participant take daily videos of them-
selves taking the study medication and securely share these with the study team for
compliance measurement purposes. Potential study participants will be informed
of such a mechanism during the informed consent process (▶ Chap. 21, “Consent
Forms and Procedures”).
Enhancing medication adherence is not only beneficial for study data reliability,
but it also plays a role in overall IP accountability. As such, it is the responsibility of
the study Sponsor to become familiar with the available options for participant IP
adherence and select a method (or methods) that is appropriate for the population
being studied and makes sense within the larger context of the clinical trial. The PI
must also ensure proper security and storage of the investigational drug while on site
(ICH GCP 4.6). Even when research pharmacists or vendors are involved, the PI
retains this responsibility (▶ Chap. 6, “Investigator Responsibilities”). Finally, drug
11 Procurement and Distribution of Study Medicines 185
Fig. 5 Example study drug inventory and tracking log for a clinical trial
E. Hardter et al.
11 Procurement and Distribution of Study Medicines 187
Quality Assurance
The Sponsor is responsible for ensuring that adequate quality assurance procedures
are followed throughout the drug procurement and distribution process (“Sponsor
Requirements”). Specific to IP receipt and accountability, ICH GCP guidelines
indicate that “the Sponsor should ensure that written procedures include instruc-
tions that the investigator/institution should follow for the handling and storage of
investigational product(s) for the trial and documentation thereof” (ICH GCP
5.14.3). Detailed procedures (e.g., in the study protocol or pharmacy manual)
should address the proper receipt, handling, storage, dispensing, and final disposi-
tion of the investigational product. Once IP is on site, the PI is required to follow all
applicable laws and regulations regarding IP administration and distribution to
participants, such as adhering to the dosing guidelines in the investigator’s bro-
chure or package insert and keeping detailed records of the IP (e.g., lot number,
amount) that is distributed to each study participant. Those tasked with observing
the research sites in person to ensure that the study procedures are being properly
followed, often termed clinical trial monitors, should also periodically ensure that
procedures related to IP, such as adequate IP storage, temperature monitoring
during shipment, and appropriate documentation for receiving and dispensing
drug throughout the trial, are in place and are being followed.
188 E. Hardter et al.
Key Facts
References
Choudhry N, Krumme A, Ercole P et al (2017) Effect of reminder devices on medication adherence:
the REMIND randomized clinical trial. JAMA Intern Med 177(5):624–631
Dragic L, Lee E, Wertheimer A, et al. (2015) Classifications of controlled substances: insights from
23 countries. Inov Pharm 6(2):Article 201
State of California Department of Justice (2018) Research advisory panel. https://oag.ca.gov/
research. Accessed 06 Sep 2018
United Nations (1961) Single convention on narcotic drugs. https://www.unodc.org/pdf/conven
tion_1961_en.pdf. Accessed 04 Oct 2018
United Nations (1971) Convention on psychotropic substances. https://www.unodc.org/pdf/conven
tion_1971_en.pdf. Accessed 04 Oct 2018
United Nations (1988) Convention against illicit traffic in narcotic drugs and psychotropic sub-
stances. https://www.unodc.org/pdf/convention_1988_en.pdf. Accessed 04 Oct 2018
United States Department of Justice Drug Enforcement Administration, Diversion Control Division
(2018) Control substance schedules. https://www.deadiversion.usdoj.gov/schedules/. Accessed
04 Oct 2018
United States Drug Enforcement Administration (2018) Drug scheduling. https://www.dea.gov/
drug-scheduling. Accessed 04 Oct 2018
US Department of Health and Human Services Food and Drug Administration Center for Drug
Evaluation and Research (2017) Guidance for industry: re-packaging of certain human drug
products by pharmacies and outsourcing facilities. https://www.fda.gov/downloads/Drugs/Guid
ances/UCM434174.pdf. Accessed 19 Oct 2018
Woodworth T (2011) How will DEA affect your clinical study? J Clin Res Best Pract 7(12):1–9
World Health Organization (WHO) (2018) Substances under international control. http://www.who.
int/medicines/areas/quality_safety/sub_Int_control/en/. Accessed 06 Sep 2018
Yeung D, Alvarez K, Quinones M et al (2003) Low-health literacy flashcards & mobile video
reinforcement to improve medication adherence in patients on oral diabetes, heart failure, and
hypertension medications. J Am Pharm Assoc 57(1):30–27
Zhao W, Durkalski V, Pauls K et al (2010) An electronic regulatory document management system
for a clinical trial network. Contemp Clin Trials 31:27–33
Selection of Study Centers and
Investigators 12
Dikla Shmueli-Blumberg, Maria Figueroa, and Carolyn Burke
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Site and Investigator Selection Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Site Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Facility Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Administrative Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Recruitment Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Regulatory and Ethics Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
Investigator Responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Investigator Qualification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Investigative Team Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Site and Investigator Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Surveys and Questionnaires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
Site Qualification Visit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
Abstract
Site and investigator selection has traditionally been the result of a comprehensive
process by which a study sponsor and/or designated representative, often a
contract research organization (CRO), evaluates prospective investigative teams
and associated clinical sites for clinical trial participation. A list of criteria is often
compiled and used by the sponsor to grade site and investigator suitability for
study participation.
In implementing a study, sponsors and site teams become partners in achieving
study goals. For longer-term or complex studies, this partnership can become
extensive. Sponsors and site teams must mutually invest in respective stakeholder
perspectives and approaches to answer research questions. Site teams interested
in the research question, with sufficient and qualified staff, and with access to the
desired participant population, are key to conducting sound research. Sponsors
that appreciate site perspectives, consider site operations and logistics in protocol
design, support efforts to mitigate site challenges, communicate updates on
broader perspectives of study activities, and offer fair compensation for site
resources are key factors in implementing a successful study. Conversely, dis-
connected relationships between investigative teams and study sponsors can
disrupt the timetable for research, contributing to compromised morale, cost
overruns, and increased variability in administrating the protocol resulting in
reduced ability to detect treatment differences, and may result in an unsuccessful
trial.
A site and investigator selection process should be designed to ensure that both
sponsors and site teams thoroughly evaluate whether the protocol design,
resources, sponsor/team relationships, general timelines, and site facilities are
compatible in achieving the goals of the study.
Keywords
Site · Investigator · Sponsors · Investigator selection plan · Site selection ·
Recruitment · Investigator qualification · Investigative team · Qualification visit
Introduction
Site and investigator selection is not a one-size-fits-all activity. It’s not unusual for
a site to be labeled a “good site” or a “bad site” in research, yet the criteria for such
conclusions are ill-defined. What qualities do those designations represent, and how
are those qualities best assessed? More importantly perhaps is the philosophy that
there is no universally “good” or “bad” site, but rather the partnership or “fit”
between the site team and the sponsor in executing a specific protocol at a specific
site can be better, or worse, given the site resources and protocol requirements. A site
team that successfully contributed to achieve study goals in one study may not
necessarily have the same success in a subsequent, similar study. A sponsor that may
have been a positive partner to a study team on a previously successful study may not
be a positive partner under a subsequent study.
Site teams and sponsors with established relationships from prior partnerships can
use their experiences to evaluate potential future partnerships. These relationships
often are not devoid of subjective considerations, but objective measures should be
incorporated into evaluations to the extent possible.
12 Selection of Study Centers and Investigators 193
As a study protocol evolves, sponsors consider the site facilities and teams that may
best complement the goals of the study. The protocol context can have a significant
impact on site characteristics and serve to narrow the field of potential sites quickly.
A documented site and investigator selection plan can be useful in defining the site
and investigator characteristics that are expected to complement the study
requirements.
Investing time developing a site and investigator selection plan encourages the
sponsor to review the protocol with perspective for anticipated needs and challenges
associated with subject recruitment, site and subject compensation, quantity and
location of sites (e.g., single country or international, rural or urban), site type (e.g.,
academic, commercial, private practice), study visit schedules, operations and
logistics, staffing experience and credentials, equipment, and applicable regulations.
Given the scope of potential factors to evaluate, a comprehensive plan assists
sponsors and site teams to objectively evaluate sponsor and site compatibility
more efficiently and possibly mitigate potential for subjective factors that can
introduce bias in selection. For example, an objective and transparent plan can
limit potential for hurt feelings based on pre-existing relationships. Plans should
include documented timepoints intended to evaluate the effect of the plan, once
applied, to identify any potential areas in the site and investigator selection approach
requiring modification.
Once a site and investigator selection plan has been developed, sponsor and site
teams should have common understanding and insight into potential study require-
ments before considering partnering in a research project. At minimum, a detailed
protocol synopsis, if not an initial draft of the protocol, should be available for site
teams to evaluate feasibility and potentially offer perspectives and experience that
could support further, more robust, protocol development. With access to detailed
information about a prospective study, site teams may be able to enhance the
integrity of a site application by offering objective and specific examples
of resources and past performance. In turn, this may create opportunities for more
candid review and discussion among both site and sponsor teams for evaluating
suitability for a study. Considerations that could be mutually applied to both the
sponsor and site team could include quantitative categories associated with level of
engagement, response time, adherence to timelines, and resolutions to action items/
troubleshooting initiatives. At the site level, considerations may include protocol
and regulatory compliance, subject recruitment and retention, number of queries
generated in and the time to query resolution, and data completion. Even the best
194 D. Shmueli-Blumberg et al.
possible site teams will “fail” if the protocol requirements are not suitable for their
site and the budget is inadequate to support their efforts.
Site Selection
Facility Resources
Key considerations in assessing the suitability of sites for a research study include
evaluation of the research space, which may include exam room features, secure and
appropriate areas to store study drug or devices, specialized equipment needs,
availability of and access to a pharmacy, and laboratory and imaging capabilities.
Important factors to evaluate include the availability of adequate infrastructure,
staff availability (e.g., hours/days of week), staff depth (e.g., coverage for key staff
on leave, attrition management), and staff credentials and expertise such as clinicians
representing a disease specialty. Alternatively, if there is a possibility of supporting
capacity-building activities such as staff training at a site, then there could be more
flexibility regarding this criterion. The facility itself would be ideally located in close
proximity, or easily accessible via public transportation, to the subject population of
interest. Other important facility-related considerations include accounting for insti-
tutional standards of care, attitudes and participation of the various departments
in the facility, the ability to support data management operations, Internet connec-
tivity, immediate and long-term record storage capabilities, and other study-specific
operational concerns.
Administrative
Considerations
Recruitment
Potential
Regulatory & Ethics
Requirements
12 Selection of Study Centers and Investigators 195
for who, what, when, where, and how study requirements will be completed will
help site teams evaluate their processes and either identify areas for modifications to
accommodate the subject experience or may determine they will not be able
to satisfy the study requirements.
In addition to the subject experience, the same considerations should be applied to
improve overall efficiency at the site. For example, if a protocol requires laboratory
sample analysis within 2 h of a subject’s arrival in an Emergency Room, the site team
will need to identify means to collect that sample shortly after consent, get the
sample to the lab quickly, and place a stat order for analysis. The sample collection
and analysis are already challenging, but if the laboratory happens to be on the other
side of an academic campus, or site policies require specific site personnel to
transport the sample that may not be immediately available, then there is increased
risk of compromising protocol requirements in ensuring sample analysis within 2 h.
Sites that have a central laboratory facility and equipment resources consistent
with protocol requirements (e.g., specific Tesla MRI machine) and can follow study-
wide standards of operations (SOPs) and/or guidelines may be more timely in
preparing for site initiation than sites that do not have such resources.
Administrative Considerations
Study costs may differ between research sites for a variety of reasons, including the
presence or absence of a national healthcare system in each country, regional
standards of care, or routine patient care costs that a form of Medicare or private
insurance carriers in that area will cover. Individual study budgets may not be known
during the site selection process, but investigating key budget-related questions (e.g.,
institutional overhead fees) might be informative even in the most preliminary phase
of the process.
Policies and legal requirements of both the sponsor and the site should be
explored. Time requirements for negotiating budgets and contract terms should
be considered. Sites with extensive legal contract and budget reviews may be less
favorable depending on study timelines. Additionally, there may be requirements
(e.g., protection from liability) either from the sponsor or at the site that cannot be
accommodated, instantly ruling out study participation. Alternatively, requirements
that can be accommodated but require negotiation may prolong time needed between
site selection and site initiation, ultimately impacting the overall project budget and
timelines. Such factors will also impact timelines for implementing protocol and
contract amendments after the study is initiated.
Recruitment Potential
Research questions cannot be answered without study subjects who are engaged in
the trial and eager to support activities to answer the research question. The more
collaborative the sponsor and site team are in encouraging and supporting subject
recruitment and retention, the better.
196 D. Shmueli-Blumberg et al.
Many trials in the past have failed to reach their enrollment goal within the
anticipated timeframe or ever. A primary factor to consider in site selection is
whether the site team has access to a target population with the condition of interest
and who are willing to participate in a trial. Recruitment potential is a necessary, but
insufficient, stand-alone criterion for selection. Sites may have a history of reaching
their recruitment goals, but a careful selection process must also assess whether
subjects were good research subjects, once enrolled, by reviewing compliance and
retention rates.
The investigator should be able to demonstrate that they have adequate recruit-
ment potential for obtaining the required number of subjects for that study. This
could be based on retrospective data, such as showing the number of patients who
have come in for a similar treatment at that facility over the past 12 months.
Ideally, site and sponsor teams will have experience in compiling and executing
a recruitment and retention plan centered around the subject experience. Even
without such experience, it’s never too late to try a new approach. Considerations
for such a plan include streamlined study activity workflow, potential compensation,
subject access to transportation, availability of childcare, snack/meal options during
visits, flexible scheduling hours, personnel dispositions, general hospitality, and
even facility aesthetics that will better support recruitment and retention initiatives.
Investigator Selection
Investigators hold a key role in clinical trials, and the success or failure of a trial
may hinge in part on finding a suitable individual to fill this important position.
It is important for sponsors and prospective investigators to have a mutual under-
standing of the roles and responsibilities of a site investigator and the investigative
team in ensuring study subject privacy and safety and protocol and regulatory
compliance during a research project. Without appreciation for protocol and regula-
tory requirements prior to implementing a study, investigative teams can experience
12 Selection of Study Centers and Investigators 197
Investigator Responsibilities
• Knowledge of the study protocol, ensuring other research staff are informed about
the protocol, and conducting their roles in accordance with the processes and
procedures outlined in the current version of that document.
• Maintaining proper oversight of the study drug, device, or investigational product
including documenting product receipt, handling, administration, storage, and
destruction or return. An investigator has a responsibility to inform potential
study subjects when drugs or devices are being used for investigational purposes.
• Reporting safety events throughout the implementation of a clinical trial (21 CFR
312.64). Investigators should document adverse events (AEs) which are unto-
ward medical occurrences associated with the use of a drug in humans that occur
during the study and, sometime, even for period after the study closure. They
must carefully follow the protocol and Federal guidelines for the appropriate
procedures (cite OHRP, FDA) for reporting AEs and serious adverse events
(SAEs) as needed.
9. COMMITMENTS
I agree to conduct the study(ies) in accordance with the relevant, current protocol(s) and will only make changes in a protocol after
notifying the sponsor, except when necessary to protect the safety, rights, or welfare of subjects.
I agree to inform any patients, or any persons used as controls, that the drugs are being used for investigational purposes and I will
ensure that the requirements relating to obtaining informed consent in 21 CFR Part 50 and institutional review board (IRB) review
and approval in 21 CFR Part 56 are met.
I agree to report to the sponsor adverse experiences that occur in the course of the investigation(s) in accordance with 21 CFR
312.64. I have read and understand the information in the investigator’s brochure, including the potential risks and side effects of the
drug.
I agree to ensure that all associates, colleagues, and employees assisting in the conduct of the study(ies) are informed about their
obligations in meeting the above commitments.
I agree to maintain adequate and accurate records in accordance with 21 CFR 312.62 and to make those records available for
inspection in accordance with 21 CFR 312.68.
I will ensure that an IRB that complies with the requirements of 21 CFR Part 56 will be responsible for the initial and continuing
review and approval of the clinical investigation. I also agree to promptly report to the IRB all changes in the research activity and all
unanticipated problems involving risks to human subjects or others. Additionally, I will not make any changes in the research without
IRB approval, except where necessary to eliminate apparent immediate hazards to human subjects.
I agree to comply with all other requirements regarding the obligations of clinical investigators and all other pertinent requirements in
21 CFR Part 312.
Investigator Qualification
The US Code of Federal Regulations specifies that clinical trial investigators should
be qualified by training and experience as “appropriate experts to investigate the
drug” (21 CFR 312.53). The ICH guidelines include a similar assertion, stating that
investigators should be qualified by education, training, and experience to oversee
the study and provide evidence of meeting all qualifications and relevant regulatory
requirements (ICH, GCP 4.1). When the study intervention involves use of an
investigational product (IP) or devices, then part of being qualified involves being
12 Selection of Study Centers and Investigators 199
familiar with that product. The investigator should be thoroughly familiar with
the use of the IP as described in the protocol as well as having reviewed the
Investigator’s Brochure (for an unapproved product) or Prescribing Information
(for approved products) which describe the pharmacological, chemical, clinical,
and other properties of the IP. Appropriate general and protocol-specific training
can help ensure that an investigator is adequately qualified to conduct the study.
The FDA does not specifically require that a lead investigator have a medical degree
(e.g., MD, DO), but often he or she does. If the PI is not a physician, one should
be listed as a sub-investigator to perform trial-related medical decisions. This is
consistent with GCP standards that state that medical care given to trial subjects
(ICH GCP 2.7) and trial-related medical decisions (ICH GCP 4.3.1) should be the
responsibility of a qualified physician.
Investigator Equipoise
In research there is an expectation of equipoise or uncertainty about the effectiveness of
the various intervention groups which is necessary to ethically run the trial. The nature
of the relationship between a patient and physician changes once a physician enrolls a
patient in a clinical trial, thereby creating a potential conflict of interest (Morin et al.
2002). For example, even if the physician believes that one of the treatment arms or
groups is more likely to be successful, all investigators have a responsibility to follow
the study protocol and most importantly the randomization plan. Physicians who have
equipoise do not have an inherent conflict of interest when suggesting study enrollment
and randomization to their patients. Those without equipoise may inject conscious or
subconscious bias in treatment or protocol administration.
Investigator Motivation
Other investigator qualifications are more subjective in nature and difficult to
quantify or document, such as motivation, leadership style, and investigator engage-
ment. Motivation may arise from the desire to work on cutting-edge research or
develop or test products that could ultimately improve the health of patients around
the world. Other motivating factors include financial benefits as well as prestige and
recognition in the professional and scientific community. Some have asserted that
enthusiasm and scientific interest in the research question are the most important
qualifications for potential principal investigators (Lader et al. 2004). A passionate
PI leading the study team is likely to evoke enthusiasm and determination which will
be valuable for successful implementation of the trial.
Most studies will have a research coordinator who supports the daily operations
of the study such as scheduling subjects for visits, interacting with subjects at the site
and possibly conducting some of the assessments, ensuring data is accurately
collected and reported, and maintaining the site regulatory files. Other site staff
may include physicians and other medical clinicians, pharmacists, phlebotomists,
counselors, and other support personnel. Investigators can increase the likelihood for
a strong and productive team by fostering an environment of cooperation based on
the site staff’s shared mission of implementing a successful trial. Establishing clear
roles and responsibilities for each staff member is also important, and investigators
are required to maintain a list of all staff and their delegated trial-related duties (ICH
GCP 4.1.5) in addition to ensuring that the activities are delegated appropriately to
trained and qualified site staff members. An engaged investigator who spends
sufficient time at the research site will be able to monitor and evaluate the group
dynamics throughout the trial and ensure that morale remains high during difficult
times (e.g., low subject recruitment rates) and that staff have sufficient time to
complete their work. Other strategies for maintaining a strong site team include
providing immediate feedback to staff about their performance, having backups for
key staff roles, and providing opportunities for staff to review and provide input on
the protocol and manual of operations prior to study start-up to ensure their input is
incorporated in final versions.
can be mitigated or overcome, there can be many advantages of selecting a site team
with research history and experience. Such advantages include:
There may also be advantages in selecting site teams with limited or no previous
research experience; these sites should not be overlooked based on this criterion
alone. New research sites may:
Once the ideal criteria and characteristics for a site and study team for a project have
been determined, sponsors will need to find efficient and effective means to share
information about the project and collect relevant information from site teams.
This is often accomplished via a combination of utilities and interpersonal
interaction.
There are several ways for sponsors to solicit site information about prospective
study site teams. Sponsors may elect to cast a broad net and open the selection
process to any study team interested in participation. They may also elect to
implement a more strategic approach using some of the following resources:
• Sponsor databases
• Participation in similar studies listed on ClinicalTrials.gov (see Fig. 3)
• Research network memberships/registries
• Prior research partnerships
• Professional organization databases
• Literature searches
202 D. Shmueli-Blumberg et al.
While a process that includes an “all are welcome” site recruitment approach may
be an easy way to achieve a specific number of desired sites, and expedite study start-
up, this may increase the risk of compromising longer-term goals of the study and
can result in ineffective resource allocation. The site and investigator selection plan
can help sponsors more clearly define the total number of sites, the type of site, site
team, and corresponding qualifications and training expectations of ideal interest to
support study activities.
With specified ideal qualifications, sponsors and site teams can conduct prelim-
inary review of requirements to quickly determine “go/no go” conclusions for
pursuing a study partnership. Once a site team elects to pursue study participation,
they can highlight specific areas of their staffing, facilities, subject recruitment
population, and any prior training and experience to illustrate why they are a good
fit for the study. Sponsors can also compare the desired qualifications and charac-
teristics for the study with submitted site application information to evaluate the site
teams that are most compatible with study goals.
Common tools used to collect information about prospective study sites include
surveys, questionnaires, and interviews. Depending on the design of the chosen
12 Selection of Study Centers and Investigators 203
method(s), the information collected can be integral in determining site team and
study compatibility, or it can expend staffing time and resources without adding
much value to the selection process. The tools intended to be used to collect
information from sites should be included as part of the sponsor site and investigator
selection plan.
The overall purpose of any method chosen needs to be clearly defined. This will
help ensure that information relevant to study participation and to proposed
outcome criteria will be offered by prospective study teams. For example, if a
sponsor site selection plan indicates interest primarily in site teams with research
experience in multiple sclerosis, but the method used to collect information from
sites doesn’t specifically address this level of detail, then the sponsor team may
receive information from site teams that reflect research experience, but without
specific reference to disease area expertise. Thus, the sponsor won’t have a
complete picture from which to determine whether a site team is compatible for
the study. Additional time and effort for further clarification may be needed, or in
the interest of time, the sponsor may move on to other proposals and overlook a
potentially ideal site team because there wasn’t sufficient detail in the information
provided simply because of the design of the utility selected to solicit the infor-
mation. The design of the collection tool will likely be a blend of both objective
and subjective information. The questions posed should reflect the four primary
areas discussed in section “Site Selection” of this chapter: facility resources,
administrative considerations, recruitment potential, and regulatory and ethics
requirements. The survey questions can be grouped by area for ease of completion
and can include specific study-relevant items beyond the four primary areas
described above.
A more objective design that corresponds with a scoring system for each type of
response could limit potential bias in the site selection process. At the least, there
should be a predetermined agreement on the particularly important items so that the
sites can be appropriately scored or ranked based on those criteria (Figs. 4 and 5).
Examples of objective items include:
On the other hand, subjective responses usually offer more insight into the
dynamics of the site team that can offer information about the team’s demeanor
and motivation to participate in the study, empathy and compassion for the study
population, and any creative means the site team has used to improve efficiencies
and recruit and retain study subjects.
Examples of subjective items include:
• Have any of the potential site staff worked together before? Share information
about methods of communication at the site to ensure study requirements and
updates are distributed.
• Describe a problem you encountered with a previous study and what approach
was taken to address it.
Prior to distributing the information collection tool to the site team, sponsor teams
should consider the process for how information from sites will be received and be
reviewed for potential partnership. Consideration should be given to having sites
provide masked information to the sponsor to eliminate as much bias in the selection
process as possible. If this is clarified as part of the design process, the sponsor can
be more forthright in advising prospective site teams of what to expect once a site
proposal has been submitted. Some questions to consider include:
• Will the information collected be kept confidential solely with this sponsor team?
• How will the information be stored (e.g., paper/electronic files, a site database)?
• Will the information collected be considered for this study alone or for this study
and other future potential studies with this sponsor?
• What process will be used to select site teams for study participation?
• Are additional activities expected after review of preliminary information (e.g.,
interviews, follow-up on site visits)?
• What process will be used to advise site teams of sponsor selections?
• What are the timelines for distribution and expected returned information from
site teams?
• Who will be available to respond to inquiries from site teams as they attempt to
complete the requested information?
and describing the process and expectations for evaluating prospective study partner-
ships, may be more likely to collect more timely, thoughtful, and comprehensive
responses from site teams.
Despite all the effort that is required to compile a site information collection utility,
and all the technology available to help people connect via phone, email, chat,
social media, and video, in-person interactions offer the greatest opportunity to
evaluate whether the sponsor and site teams can establish an effective partnership
206 D. Shmueli-Blumberg et al.
Summary
As has been described throughout this chapter, the site selection process is
complex and dynamic. Sponsors and site teams who elect to conduct a trial
together are entering, at minimum, into a short-term partnership with each other
and must be interdependent to achieve the goals of the study. Sponsors must
consider a variety of factors and embark upon both remote and in-person means
to learn more about prospective study sites. Investigators who participate in a
study have significant responsibility for ensuring appropriate staffing and
corresponding training and qualifications, regulatory compliance, protocol com-
pliance, and, most importantly, protecting the safety and rights of the subjects who
participate in a study. Investigative teams must weigh their responsibilities with
the study requirements provided by the sponsor. Ultimately, both sponsors and site
teams must evaluate whether they can be compatible and can achieve the goals of
the research project.
12 Selection of Study Centers and Investigators 207
Key Facts
1. The sponsor and study teams should be viewed as partners with a common goal of
identifying a good fit between a study and an investigator and site staff.
2. Some of the important considerations when selecting a site for a clinical trial
include facility resources, administrative considerations, recruitment potential,
and regulatory and ethics requirements.
3. Selection of an investigator may include objective quantifiable considerations
such as previous studies and publications and specific area of expertise, as well as
more subjective qualifications such as motivation and leadership style.
4. There are a variety of ways for sponsors to solicit site information about pro-
spective study site teams, such as through site surveys and questionnaires and site
qualification visits.
References
Anderson C, Young P, Berenbaum A (2011) Food and drug administration guidance: supervisory
responsibilities of investigators. J Diabetes Sci Technol 5(2):433–438. https://doi.org/10.1177/
193229681100500234
Lader MD, Cannon CP, Ohman EM et al (2004) The clinician as investigator: participating in
clinical trials in the practice setting. Circulation 109(21):2672–2679. https://doi.org/10.1161/01.
CIR.0000128702.16441.75
Morin K, Rakatansky H, Riddick F et al (2002) Managing conflicts of interest in the conduct of
clinical trials. JAMA 287(1):78–84. https://doi.org/10.1001/jama.287.1.78
Design and Development of the Study Data
System 13
Steve Canham
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Descriptive Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
System Components: The Naming of Parts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
The Rise of the eRDC Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Deployment Options and Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Constructing the Study Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Working from the Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Working Up the Functional Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
“User Acceptance Testing” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Final Approval of the Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Using Data Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Validating the Specification and Final Release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
Development and Testing Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
The Study Data System in the Longer Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Change Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
Exporting the Data for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Abstract
The main components of a typical modern study data system are described,
together with a discussion of associated workflows and the options for deployment,
such as PaaS (platform as a Service) and SaaS (software as a service), and their
implications for data management. A series of recommendations are made about
how to create a study specific system, by developing a specification from the study
S. Canham (*)
European Clinical Research Infrastructure Network (ECRIN), Paris, France
e-mail: [email protected]
Keywords
Clinical Data Management System · CDMS · Study definition · Electronic remote
data capture · eRDC · Functional specification · Validation · Data standards ·
Change management · Data extraction
Introduction
The study data system is designed to collect, code, clean, and store a study’s data,
deliver it for analysis in an appropriate format, and support its long-term manage-
ment. In fact, a “study data system” is rarely a single system – it is normally a
collection of different hardware and software components, some relatively generic
and others study specific, together with the procedures that determine how those
components are used and the staff that operate the systems, all working together to
support the data management required within a study.
This chapter has three parts. The first provides a descriptive overview and looks at
the typical components of a study data system, the associated data processing
workflows, and the main options for deployment. The second is a more prescriptive
account of how a data system should be designed, constructed, and validated for an
individual study, while the third discusses two longer term aspects of system use:
managing change and delivering data for analysis.
Descriptive Overview
Any study data system has to be study specific, collecting only the data items
required by a particular study, in the order specified by the assessment schedule.
But any such system also has to meet a more generic set of requirements, to
guarantee regulatory compliance and effective, safe, data management. The func-
tionality required includes:
• The provision of granular access control, to ensure users only see the data they are
entitled to see (in most cases, only the data of the participants from their own site).
13 Design and Development of the Study Data System 211
• Automatic addition of audit trail data for each data entry or edit, with preservation
of previous values.
• The ability to put logic checks on questions, so that impossible, unusual, or
inconsistent values can be flagged to the user during data entry.
• Conditional “skipping” of data items that are not applicable for a particular
participant, (as identified by data entered previously).
• Support for data cleaning – usually by built in dialogues that allow central and site
staff, and monitors, to easily exchange queries and responses within the system.
• The ability to accurately extract part or all of the data, in formats that can be
consumed by common statistical programs.
These requirements mean that a study data system almost always has as its core a
specialist Clinical Data Management System (CDMS), a software package that
provides the generic functions listed above, but whose user interface can be adapted
to reflect the requirements of specific studies. Such systems are usually purchased on
a commercial basis and may be installed locally or hosted externally.
A single CDMS installation can and usually does support multiple studies. It makes
use of a database for storing the data and provides a set of front-end screens for
inputting and querying it. The database is normally one of the common relational
systems (e.g., SQL Server, Oracle, MySQL), but it will be automatically configured by
the CDMS to store both the study design details and the clinical, user, and audit data.
The user interface screens are normally web pages, which means that as far as end
users are concerned, the CDMS system is “zero-footprint”: it does not require any
local installation at the clinical sites or the use of a dedicated laptop. An end user at a
clinical site accesses the system, remotely and securely, by simply going to a pre-
specified web page. From there they can send the data immediately back to the central
CDMS, where it is transferred to the database. For security and performance reasons,
the database is normally on a different server, with tightly controlled access, whereas
the web server is necessarily “outward facing” and open to the web (see Fig. 1).
The study-specific part of the system is essentially a definition and is often
therefore referred to simply as the “study definition.” It is stored within and
referenced by the CDMS, and stipulates all the study-specific components, for
example, the sites, users, data items, code lists, logic checks, and skip conditions.
It also defines the order and placement of items on the data capture screens (usually
referred to as “eCRFs,” for electronic case report forms), and how the eCRFs are
themselves arranged within the “study events” (or “visits”), i.e., the distinct time
points at which data is collected. Even though a single CDMS installation usually
contains multiple study definitions, it controls access so that users only ever see the
study (or studies) they are working on, and the data from their own site.
Almost all systems also allow a study definition to be exported and imported as a
file. This allows the definition to be easily moved between different instances of the
same CDMS – for example, when transferring a study definition from a development
to a production environment. If the file is structured using an XML schema called the
Operational Data Model, or ODM (CDISC 2020a), an international standard devel-
oped by CDISC (the Clinical Data Interchange Standards Consortium), then it is
212 S. Canham
Fig. 1 The main components of a modern study data system. The CDMS stores one or more study
definitions and is usually installed on a web server, which presents each study’s screens to
authorized users via a secure internet link. The CDMS is also connected to a database, usually on
a separate server, for data storage
sometimes possible to transfer the study definition between different data collection
systems, (e.g., between collaborators). Sometimes rather than always because,
unfortunately, not all CDMS fully support ODM export/import and there are some
elements of a study definition (such as automatic consistency checks on data) where
ODM still only provides partial support.
In terms of features, almost all CDMS are technically compliant with clinical
trial regulations, especially GCP and CFR21(11), e.g., they allow granular access
control, they provide automatic audit trail data, the internal timestamps are
guaranteed to be consistent, etc. Without this technical compliance, they would
stand little chance in the marketplace. What makes a system fully compliant,
however, is the way in which it is used: the set of standard operating procedures
and detailed work instructions that govern how the system is set up and functions
in practice, together with the assumed competence of the staff operating those
systems. For this reason both ‘policies and procedures’, and ‘central IT and DM
staff’ are included as elements within Figure 1.
The study-specific system that users interact with, the product of the study
definition and the underlying CDMS, is sometimes known as a Clinical Data
Management Application (CDMA). Although in practice “study definition,” or
even “study database,” is probably more common, in this chapter, the more accurate
“CDMA” is used to refer to the software systems supporting a specific study, and
“study definition” is restricted to the detailed specification that defines the CDMA’s
features.
Researchers are normally much more engaged with the details of the CDMA
rather than the underlying systems, but they should at least be satisfied that the
13 Design and Development of the Study Data System 213
systems used for their trial are based upon an appropriate CDMS and that there
is a mature set of procedures in place that govern its consistent and regulatory
compliant use. This is one of the many reasons why the operational manage-
ment of a trial is best delegated to a specialist trials unit, which might be a
department within a university, hospital, research institute or company, or an
independent commercial research organization, or CRO (for simplicity, in this
chapter, all of these are referred to as a “trials unit”). Not only will such a unit
already be running or managing one or more CDMSs, they will also be able to
provide the expertise to develop the study specific part of the system safely and
quickly.
CDMSs differ in their ease of use and setup, for instance, in creating study
definitions, extracting data or generating reports, and the additional features that
they may contain. The latter can include:
The great majority of CDMSs in use are commercial systems, available from a
wide variety of vendors. In early 2020, a search on a software comparison site listed
58 different systems that offered both electronic data capture and CFR21(11)
compliance (Capterra 2020), and that list was far from comprehensive. Vendors
range from large multinationals to small start-ups, and license costs vary over at least
an order of magnitude, from a few thousand to several hundred thousand dollars per
study, though as discussed below costs can also depend on the deployment models
used. There are also a few open source CDMSs: OpenClinica (2020) and RedCap
(2020) are the two best-known; both are available in free-to-install versions (as well
as commercial versions that provide additional support), and both have enthusiastic
user communities.
There are also some local CDMSs, built “in-house,” particularly in academic
units, although they are becoming less common. CDMSs are increasingly complex
and costly to build and validate, and effective ongoing support requires an invest-
ment in IT staff that is beyond the budget of most noncommercial units. Local
systems can also be over dependent on local programmers and become more difficult
to maintain if key staff leave. Although there is no question that home-grown
CDMSs can function well, there is an increased risk in using such systems. Sponsors
and researchers who find themselves relying on such systems need to be confident
that they are fully validated, and that they are likely to remain supported for the
length of the trial.
214 S. Canham
The dominant web-based workflow for collecting clinical trial data, as depicted in
Fig. 1, is known as electronic remote data capture, or eRDC (EDC and RDC are
also used, and in most contexts mean the same thing). Since the early 2000s, eRDC
has slowly supplanted the traditional paper-based workflow, where paper CRFs were
sent through to the central trials unit or CRO, by post or courier, to be transcribed
manually into a central CDMS. Early papers extolling the benefits of eRDC were
often written by the CDMS vendors (e.g., Mitchel et al. 2001; Green 2003), who had
obvious vested interests. Despite this, the cost and time benefits of eRDC have
driven gradual adoption, especially for multi-site trials and in geographical areas
where reliable internet infrastructure is available. The advantages include:
• Removing the transcription step, and thus the time lag between the arrival of a
paper CRF and loading its data into the system, and eliminating transcription
errors. It therefore removes the need for expensive checks on data transcription,
such as double data entry.
• Speeding up data queries – the “dialogue” between site and central data manage-
ment can proceed securely on-line, rather than by sending queries and responses
manually. This can be especially important when chasing down queries in
preparation for an analysis.
• Allowing safety signals to be picked up more quickly. In addition, some systems
can generate emails if adverse events of a particular severity or type are recorded.
• Making it possible to reject “impossible” data (e.g., dates that can never be later
than the date of data entry) and thus force an immediate revision on data entry. In
a paper-based system, the need to reflect a paper CRF’s contents, however bizarre,
means that this type of data error must be allowed and then queried, or subject to
“self-evident correction” rules.
• Making it easier and clearer to tailor systems to the particular requirements of a
site, or a particular study participant (e.g., based on gender, treatment, or severity
of illness), by using skipping logic rather than sometimes complex instructions on
a paper CRF.
• Avoiding CRF printing costs and time.
• Allowing the data collection system to be more easily modified, for instance, in
the context of an adaptive trial.
By 2009, a Canadian study found 41% eRDC use for phase II–IV trials (El Emam
et al. 2009), and anecdotal evidence suggests eRDC use has continued to rise
considerably since then, with many units now only using eRDC for data collection.
42 of the 49 (86%) of the UK noncommercial trials units that applied for registration
status in 2017, i.e., most of the university-based trials units in the country, explicitly
mentioned using eRDC based systems, even if they did not always indicate they
were using eRDC for every trial (personal communication, UKCRC, Leeds).
Furthermore, empirical studies have now confirmed some of the benefits claimed
for eRDC (Dillon et al. 2014; Blumenberg and Barros 2016; Fleischmann et al.
13 Design and Development of the Study Data System 215
2017). Not all those benefits are relevant to single site studies, but even here the same
systems can be used, albeit normally within an intranet environment.
The main disadvantage of eRDC is that it demands that a large group of staff,
across the various clinical sites, are trained to use both the CDMS system and
specific CDMAs, and there is a greater level of general user management. A user-
initiated, automatic, “forgotten password?” facility in an eRDC system is a nontrivial
feature of any CDMS, avoiding an otherwise inordinate amount of time spent
managing requests to simply reenter the system.
Where paper-based trials are still run, they use essentially the same system for
their data management, except that the CDMS’s end users will be in-house data entry
staff rather than clinical site staff. Paper-based trials are still used, for instance, in
areas where internet access is patchy or unreliable, but eRDC is now the default
workflow for collecting clinical site data. Participant questionnaires (e.g., on quality
of life measures) have traditionally been collected on paper and then input centrally,
though in recent years there has been much interest in replacing these with ePRO
(electronic patient reported outcomes) systems, e.g., using smart phones, that can
connect directly to a CDMS. A review is provided by Yeomans (2014), though some
potential problems with ePRO, from a regulatory compliance perspective, are
highlighted by Walker (2016).
Traditionally, a CDMS would be installed and run directly by the trials unit or CRO,
with hardware in server rooms within the trials unit’s own premises or at least under
their direct control. That scenario allows the unit to have complete control over their
systems and infrastructure, making it much easier to ensure that everything is run
according to specified procedures and that all staff understand the specialized
requirements of clinical trials systems.
This can be a relatively expensive arrangement, however, and may not sit well
with the centralizing tendencies of some larger organizations. It is also sometimes
difficult to retain the specialist IT staff required. It has therefore become increasingly
common to find the CDMS hosted in the central IT department of a hospital,
university, or company. The trials unit staff still directly access the CDMS across
the local network and can develop study definitions as well as oversee data man-
agement. They often also access and manage the linked databases, but data security,
server updates, and other aspects of IT housekeeping are carried out by “central IT.”
The servers are provided as PaaS, or “platform as a service,” i.e., they are set up to
carry out designated functions, as database or web servers, and the customer, the
department managing the trials, manages those functions (see Fig. 2).
This arrangement may be more efficient, but it does require that all parties are
very clear about who does what and that clear communication channels are in place.
From the point of view of trial management, the central IT department is an
additional subcontractor supporting the trial. It shifts the day-to-day responsibility
of many IT tasks (backups, server updates, firewall configuration, maintaining anti-
216 S. Canham
malware systems, user access control, etc.) out of the trials unit, but it does not
change the fundamental responsibility of the unit, acting on behalf of the sponsor, to
assure itself that those tasks are being carried out properly.
As stressed by the quality standards on data and IT management established by
ECRIN, the European Clinical Research Infrastructure Network (ECRIN 2020), this
oversight is not a “one-off” exercise – the requirement is for continuous monitoring
and transparent reporting of changes (Canham et al. 2018). For example, trials unit
staff do not need to know the details of how data is backed up but should receive
regular reports on the success or otherwise of backup procedures. They do not need
to know the details of how servers are kept up to date, or logical security maintained
through firewalls, but they do need to be satisfied that these processes are happening,
are controlled and documented, and that any issues (e.g., security breaches) are
reported and dealt with appropriately.
This problem, of “quality management in the supply chain,” becomes even more
acute when considering the increasingly popular option of external CDMS hosting.
In this scenario, the CDMS is managed by a completely different organization –
most often the CDMS vendor. The trials unit staff now access the system remotely to
carry out their study design and data management functions, with the external system
presenting the CDMS to the unit as “software as a service” or SaaS. This scenario is
popular with many system vendors, because it allows them to expand their business
model beyond simple licensing to include hosting services, and in many cases offer
additional consultancy services to help design and build study systems. It also means
that they only have to support a single version of their product at any one time, which
can reduce costs. In fact, some CDMS vendors now insist on this configuration, and
only make their system available as SaaS.
But in many cases, the delegation chain is extended still further, as shown in Fig. 3,
because the software vendor may not physically host the system on its own
13 Design and Development of the Study Data System 217
• It provides a very good way for trials units and CROs to experiment with different
CDMSs, without the costs and demands of installing and validating them locally.
• The burden of validating the CDMS is transferred to the organization controlling
its installation, usually the software vendor. The trials unit/CRO still needs to
satisfy itself that such validation is adequate, but that is cheaper and quicker than
doing it themselves.
• It empowers sponsors, who have greater ability to insist that a particular CDMS
system is used, regardless of who is carrying out the data management.
• It removes any suspicion that the trials unit or CRO, and through them the
sponsor, can secretly manipulate the data – data management is always through
the CDMS’ user interface, where all actions can be audited, and never by direct
manipulation of the data in the database.
Sponsors, researchers, and trial teams, therefore, need to ensure that when
functionality is subcontracted, these issues have been dealt with, so that they are
confident that appropriate oversight is taking place all the way down the supply
chain. This need not involve detailed scrutiny, but it does mean selecting and
building up relationships and trust with a specialist trials unit or CRO, and being
confident that they have not just good technical systems available but also a
comprehensive quality management system in place.
Overview
Whoever is managing and monitoring the CDMS and its underlying IT infrastruc-
ture, there is no doubt that the study-specific part of the study data system, the “study
definition” or CDMA, is the responsibility of the study management team. The team
aspect is important – even though the sponsor retains overall responsibility, success-
ful development of a CDMA requires expertise and input from a wide variety of
13 Design and Development of the Study Data System 219
people: investigators, statisticians, study managers, data and IT staff, quality man-
agers and site-based end users.
The process of creating a CDMA is summarized in Fig. 4. It has two distinct and
clearly defined phases – development and validation – both of which involve
iterative loops. The development phase takes the protocol and the data management
plan as the main input documents and creates a full functional specification for the
CDMA. It does so by organizing input from the various users of the system and
consumers of the data, and iteratively developing the specification until all involved
are happy that it will meet their requirements. Very often prototype systems are built
against the developing specification to make the review process easier but, in any
case, a system built to match the approved specification must be available at the end
of the development phase. The validation phase takes that system and checks,
systematically and in detail, that it does indeed match the agreed study definition.
Once that check is complete, the CDMA can be released for use.
Both phases should be terminated by a formal and clearly documented approval
process. At the end of the first, development, phase, all those involved in creating the
specification should sign to indicate that they are happy with it – the result should
therefore be a multidisciplinary and dated sign-off sheet for the specification. At the
end of the second, validation, phase, someone (often a data manager or operational
manager) needs to sign to indicate that the validation has been successfully com-
pleted and that the system can be released.
CDMA development has to start with the study protocol, because that document
specifies the key outcome and safety measures to be captured, implying the individ-
ual data points that need to be collected, and the assessment schedule that determines
when the data should be collected.
One way to convert a protocol into a study definition would be to simply ask the
investigator and/or statistician to specify the data points needed to carry out the
required analyses, either by setting out a formal set of analysis data requirements, or
more simply by annotating the protocol document. Perhaps because the time avail-
able to both investigators and statisticians is often limited, neither approach seems
very common, though anecdotal evidence suggests that when it is used, it can be
very effective. What often happens, in practice, is that experienced data management
staff take the protocol and, sometimes using previous trials that have covered similar
topics, construct either a spreadsheet with the data items listed, or mock paper CRFs,
annotated with additional details such as the data item type or range limits for values,
or – very often – both, with the spreadsheet providing more details than can be easily
shown on an annotated paper form. These are then presented for review to the
multidisciplinary study management team.
The use of mock paper CRFs is undoubtedly effective, not least because most
people find it easier to review a paper CRF than a series of screens or a spreadsheet,
especially in the context of a meeting. It can, however, increase the danger of
220 S. Canham
Fig. 4 The workflow for CDMA development. There are two iterative loops: the first, usually
longer, results in the approval of a functional specification and the construction of a prototype
system, and the second results in the approval of that system for production use once it has been
validated. (Adapted from Canham et al. 2018)
13 Design and Development of the Study Data System 221
collecting data that is not strictly required to answer the questions posed by a study,
but which is included only because it was part of a previous, similar study, or
because there is a vague feeling that it might be “possibly useful one day.”
Collecting too much data in this way runs counter to data minimization, an
important principle of good practice emphasized in the General Data Protection
Regulations (GDPR) of the EU: “Personal data must be adequate, relevant and
limited to what is necessary in relation to the purposes for which those data are
processed” (GDPR Rec.39; Art. 5(1)(c)) (Eur-Lex 2020). At least within the EU,
collecting unnecessary data may therefore be illegal as well as unethical.
There are circumstances where data can be legitimately collected for purposes
other than answering the immediate research question – for instance, to obtain a
disease-specific “core dataset,” to be integrated with similar datasets from other
sources in the future. But if that it is the case, it should be explicitly mentioned within
study information sheets, so that a participant’s consent is fully informed and
encompasses the collection of such data.
One effective way of reducing the risk of collecting unused or unusable data is to
ensure the study statistician reviews the CDMA’s data points towards the end of the
development process. Far better for spurious data points to be removed before the
study begins, rather than being collected, checked, and queried, only for the statis-
tician – after they receive the extracted dataset – to protest that they would never
make use of that data.
A second document that feeds into CDMA design is the Data Management Plan,
or DMP. All trials should have such a plan, either as a section within the Trial Master
File (TMF) or as a separate document referenced from the TMF. Although a trials
unit or CRO would be expected to have a set of generic SOPs covering different
aspects of data management, there will almost always be study specific aspects of
data management that need planning and recording, and these should be described
within the DMP, which therefore forms part of the input to the design process.
A key aspect of study design is the balance between different methods of ensuring
data quality. Modern CDMS can include sophisticated mechanisms for checking
data, allowing complex and conditional comparisons between multiple data items on
different eCRFs and study events, for instance, to check for consistency between
visits, plausible changes in key values, and adherence to schedules. The fact that
these complex checks can be designed, however, does not necessarily mean that they
should always be implemented. The more complex a check, the more difficult it is to
implement and the harder it is to validate. There is also a possibility that data entry
may become over-interrupted, and take too long, if the system flags too many
possible queries during the input process.
The alternative to checking data as it is input is to check it afterwards, by
exporting the data and analyzing it using statistical scripts. For complex checks,
this has several advantages:
• It usually allows simpler, more transparent design of the checks than the often
convoluted syntax required within CDMS systems.
• It is easier for the checks to be reviewed, e.g., by another statistician, and
validated.
222 S. Canham
• It gives the statisticians, as the consumers of the data, greater knowledge of and
confidence in the checks that have been applied.
• Most importantly, it allows checks to be made across the study subject popula-
tion, for instance, when identifying outliers (CDMS based checks can only
usually be applied within a single individual’s data).
The final point, coupled with the need for statistical monitoring to compare site
performance (to help manage a risk based monitoring scheme), means that some
level of “central statistical monitoring” of data quality is almost always required (an
exploration of the use of central statistical monitoring is provided by Kirkwood et al.
2013). The question is how far it should be extended to include the checks that might
otherwise be designed into the study definition. Clearly the availability of statistical
resource to help design (if not necessarily run) the checks will influence the approach
taken. There is also the issue that queries discovered using statistical methods need
to be fed back into the CDMS system so that they can be transmitted to the sites, and
few CDMS allow this to be done automatically. Whatever the decision of the study
management team, it should be documented as part of the Data Management Plan
and then taken into account during the development of the functional specification.
The phrase “User Acceptance Testing” has been put into quotes to emphasize that it
is such a misleading and potentially confusing phrase that it really should be
avoided. There are three significant problems with the term:
• Different people refer to different groups as “users.” IT staff often, but not
consistently, refer to data management staff as the users, but the data management
staff usually mean the end users of the system at the clinical sites.
• Whoever makes the final acceptance decision for a system, it is almost certainly
not the users – it is more likely to be a sponsor’s representative, the study project
manager, or the unit’s operational manager.
• Users, especially end users at the clinical site, rarely test anything. They may
inspect and return useful feedback – about eCRF design, misleading captions,
illogical ordering of data items, etc., but they can rarely be persuaded to system-
atically test a system, and one would not normally expect them to do so.
Having said all of that, “input from site-based users” can often be a useful thing to
factor into the end of the development phase. The system obviously needs to be up
and running and available to external staff, and system development should be in its
final stages – there is little point in asking end users to comment on anything other
than an almost completed system. Normally only a small subset of site-based staff
should be asked to comment, drawn from those that can be expected to provide
224 S. Canham
feedback in a timely fashion. Such feedback is best kept relatively informal – emails
listing queries and the issues found are usually sufficient.
The decision to use end-user feedback should be risk based. A simple CDMA that
is deployed only to sites that already have experience of very similar trials will have
little or no need for additional feedback from end users. But a CDMA that includes
novel features or patterns of data collection could benefit from site-user feedback. If
new sites are being used for a trial, especially if they are from a different country or
language group, then user feedback can be very informative in clarifying how the
eCRFs will be interpreted and in identifying potential problems.
The key point is that this is late input into the design and development phase
– it is not “testing.” It is obtaining feedback from the final group of stakeholders.
The difficulty that many trials units have is that they seek end-user feedback at
the same time as beginning the testing and validation phase. They sign off the
design as approved and start to test the system. Then feedback from end users
arrives and results in changes (most commonly over design issues but more
fundamental changes to data items may also be required) and then they have to
start the testing again. The work, and its documentation, expands and risks
becoming muddled. One of the basic principles of any validation exercise is
that it must be against a fixed target – hence the need to garner all comments,
from all stakeholders, and complete the entire design and development process
before validation begins.
One of the ways of making a study data system easier and quicker to develop, and of
making the resulting system and the data exported from it easier to understand, is to
establish conventions for the naming and coding of data items, and to stipulate
particular “controlled vocabularies” in categorized responses. That can provide a
consistency to the data items that can be useful for end users and a consistency to the
data that can be useful for statisticians.
Consistency can be extended into a “house style” for the eCRFs, with a standard
approach to orientation, colors, fonts, graphics, positioning, etc. (so far as the CDMS
allows variation in these) and to the “headers” of the screens, that usually contain
administrative rather than clinical data (e.g., study/visit/form name). This simply
makes it easier for users to navigate through the system and more easily interpret
each screen, and to transfer experience gained in one study to the next.
Establishing conventions for data items can provide greater consistency within a
single trials unit, but at a time when there is increased pressure for clinical
researchers to make data (suitably de-identified) available to others, the real value
comes from making use of global, internationally recognized standards and con-
ventions, which allow data to be compared and/or aggregated much more easily
across studies.
Fortunately, a suite of such global standards already exists. These are the various
standards developed by CDISC, the Clinical Data Interchange Standards Consor-
tium. The key CDISC standards in this context are CDASH (CDISC 2020b), from
the Clinical Data Acquisition Standards Harmonization project, and the TA or
Therapeutic Area standards (CDISC 2020c). Both are currently used much more
within the pharmaceutical industry than the noncommercial sector. The FDA, in the
USA, and the PMDA, in Japan (though not yet the EMA in the EU), have stipulated
that data submitted in pursuance of a marketing authorization must use CDISC’s
Study Data Tabulation Model (SDTM), a standard designed to provide a consistent
structure to submission datasets. Creating SDTM structured data is far easier if the
original data has been collected using CDASH, which is designed to support and
map across to the submission standard.
Trials units in the noncommercial sector do not generally need to create and
document SDTM files, and consequently have been less interested in using CDASH,
although many academic units have experimented with using parts of the system.
The system is relatively simple conceptually, but it is comprehensive and growing,
and it does require an initial investment of time to appreciate the full breadth of data
items that are available and how they can be used. The nature and use of data
standards are treated in more detail in the chapter on the long-term management of
data and secondary use. The key takeaway for now is that an evaluation of CDASH
and its potential use within study designs is highly recommended.
Along with the CDISC standards and terminology, other “controlled vocabular-
ies” can also help to standardize trial data. For the coding of adverse events and
serious adverse events, the MedDRA system (MedDRA 2020) is a de facto standard.
Drugs can also be classified in various ways, though the ATC (Anatomical
226 S. Canham
Therapeutic Chemical) scheme is the best known (WHO 2020). Other systems
include the WHO’s ICD for disease classification, and MESH and SNOMED CT
for more general medical vocabularies, though in general, the larger the vocabulary
system, the more difficult it is to both integrate it with a CDMS and ensure that staff
can use it accurately.
MedDRA is the most widely used of all these systems and is, for example,
mandated for serious adverse event reporting in the EU. Its effective use requires
training, however, and a variety of study specific decisions need to be considered and
documented (in the DMP). For instance, what version of MedDRA should be used
(the system is updated twice a year) and how should upgrades, if applied, be
managed? How should composite adverse event reports (“vomiting and diarrhea,”
“head cold and coughing”) be coded? Probably most critically, which of the higher
order categories, used for summarizing and reporting the adverse events, should be
used when categorizing lower level terms? MedDRA is not a simple hierarchy, and a
lower level term can often be classified in different ways. A “hot flush” (or “flash”)
can be related to the menopause, hyperthyroidism, opioid withdrawal, anxiety, and
TB, among other things – so how should it be classified? The answer will normally
depend on the trial and the participant population, but where there is possible
ambiguity, a documented decision needs to be taken, so coding staff have the
required guidance. Such ambiguity is also the reason why MedDRA auto-coding
systems should be treated with caution, unless they can be configured or overridden
when necessary.
Once the functional specification has been approved, the prototype that has been
built upon it needs to be validated. During the build, or during successive pro-
totypes if that has been the approach taken, the basic functionality will have been
tested by the staff creating the system, but this is normally an informal process and
unlikely to have been documented. Validation, in contrast, requires a systematic,
detailed, and documented approach to testing all aspects of the system. It is
intended to verify that:
• The build has been implemented correctly – i.e., that it matches the specification,
and is therefore fit for purpose.
• The detailed logic built into the system, e.g., the consistency checks between data
items, or the production of derived values, works as expected.
Systems used for CDMA development and those used for CDMA production use
should be isolated from each other. Development and production environments have
very different user groups and (assuming there is no real data in the development/test
environment) different security requirements. The system should be developed on
machines specifically reserved for development and testing, and there should be no
possibility, however unlikely, of any problems in a developing CDMA spilling over
to affect any production system. Similarly, there should be no possibility of users,
including IT staff, inadvertently confusing development and production systems.
This allows the production servers to be kept in as simple and as “clean” a state as
possible, unencumbered by additional versions of the same study system, making
their management easier and providing additional reassurance that their validation
status is being maintained (see Fig. 5).
With the virtual machines that are now commonly used, “isolated from” means
logically isolated rather than necessarily on different physical hardware. That means
distinct URLs for the web-based components, distinct connection strings for data-
base servers, and different users and access control regimes on the different types of
server. It is also a good idea if development systems can be clearly marked as such on
screen (e.g., by the use of different colors and labels).
Note that if end-user input is required, so also is remote access to the development
and test system. In some IT infrastructures, this may be problematic and necessitate a
separate, third, testing environment, specifically for external access to a copy of the
developing system. This environment could therefore be used for the final stage of
system development, but once this was complete and the specification approved, it
would also be the obvious place in which to carry out validation.
This “Final Development” environment can also be useful for training purposes –
giving external users access to the fully validated system so that they can familiarize
themselves with it. Site-based users may want to input real data into the system
during this training phase, one reason why it should be under the same tight access
control within the IT infrastructure as the production system. The other benefit of a
secure testing/training environment is that the system can also be used for “backfill”
of data during revalidation exercises, as described in more detail in the section
“Change Management.”
Figure 5 illustrates such a combination of systems, but it is stressed that it is
only an example of many possible ways of arranging development, test, and
production environments. The optimum will depend on the server, security, and
access options available. If the initial development environment can handle exter-
nal users, then the A and B environments can be merged into one, as long as the
possibility of that environment having sensitive personal data in it is fully
considered.
It is possible to support training on a production server, by setting up a dummy
site within that system and initially only giving site-based users access to that site.
This can be simpler to manage than using a separate system, and it ensures that the
13 Design and Development of the Study Data System 229
Fig. 5 An example of one arrangement for development and production environments. The DB/
web server combination in A is used for most of the development process but never contains any
real data. The test environment in B can be linked to external users and is used to complete
development. It can also support validation and later testing when backfill with real data may be
useful. C is the “clean” production environment. Both B and C have tightly controlled access
training system will exactly match the production system’s definition. It has the
disadvantage, however, that all the data from the training “site” need to be excluded
from the analysis dataset (during or after the extraction process).
230 S. Canham
Change Management
Once the CDMA’s final specification has been approved, any further changes to that
specification will need to be considered within a formal change management pro-
cess, to ensure that all stakeholders are aware of proposed changes and can comment
on them, and that the changes are validated.
Any request for a change in the system should therefore be properly described and
authorized, so a paper or screen-based proforma needs to be available, to be
completed with the necessary specification and justification for the change. Changes
may be relatively trivial (a question caption clarified) or substantial (additional
eCRFs following a protocol amendment). Whoever initiates the change process,
staff should be delegated to assess the possible impacts of the change and identify
any risks that might be associated with it.
Risk-based assessment is the key to change management. The easiest way of
handling and documenting that process is to use a checklist that considers the
common types of potential impact. These can include:
• Impact on data currently in the database. Any change that dropped a data item or
a category from the system and “orphaned” existing data would not normally be
allowed and should be rejected. In fact, many CDMS would automatically block
such a change, though many do allow a field to be hidden or skipped within the
user interface.
Other changes may have less obvious consequences. For instance, if new options
are added to a drop-down to give a more accurate set of choices to the user, does
the existing data need to be reclassified? If so, how and by whom?
• Impact on validation checks and status. If a new consistency test is added, how
can the existing data be tested against it? If a new data item is added, does it need
new consistency checks to be run against other data? Detailed mechanisms are
likely to be system dependent but need to be considered and the resulting actions
planned.
• Impact on data extraction. In many cases, extraction will use built-in automatic
mechanisms, but if any processing/scripts are used within the extraction process,
will they be affected by the change? If additional fields are added, will that data
appear in the extracted datasets?
• Impact on metadata. A metadata file, or at least a “data dictionary,” should always
be available, for instance, to support analysis. Any change will render the current
metadata out-of-date and require the production of a new version.
• Impact on analysis. If the statistician has already rehearsed aspects of analysis and
has the relevant scripts prepared, how will any additional data items be included?
Will any hidden, unnecessary fields still be processed?
For example, changing an item’s data type could make existing data invalid and is
not normally allowed, or even possible in many CDMSs. If it transpires that an
integer field needs to hold fractional values, and thus must be changed into a real
13 Design and Development of the Study Data System 231
number field, it may therefore be necessary to add a new real field and hide or skip
the original integer one. The database ends up with two fields holding data for the
same variable, meaning that the statistician needs to combine them during the
analysis.
• Impact on site-based end users. The staff inputting the data need to be informed
of any change and its implications. When and how?
• Impacts on system documentation and training. For substantial changes, simply
informing end users is unlikely to be enough. Study-specific documentation and
training may need also need changing.
Considering possible risks in this systematic way provides a solid basis for
identifying and documenting the possible sequelae of a proposed change, deciding
if the change should be allowed and, if it is allowed, identifying the follow-on
actions that will be required. Those actions are likely to include testing of the revised
system, with the test results documented and retained. It also means that the change
management process needs to involve statisticians as well as trial managers, and the
IT and/or data management staff who usually implement the change. The key staff
involved should explicitly “sign-off” the change.
Substantial system changes often result from protocol amendments and cannot
be released into the production version of the system until those amendments have
been fully approved. It can sometimes happen, though it is relatively rare, that a
requested change implies a change in the protocol, even when it has not been
presented or recognized as such. This is another reason for the change manage-
ment process to include review by experienced staff (usually the trial manager and
statistician), or even the whole trial management team, to ensure that any need for
protocol amendment is recognized and acted upon before the change is
implemented. In other words, change management should never be seen as a
purely technical process.
Implementation of any change should always occur in all the environments being
used – i.e., in the development environment, in any intermediate test and training
environment, and finally in the production system. The flow of changes should be
unidirectional, with in each case a revised study definition exported to the destination
system. It can be tempting, for a trivial change, to shortcut this process and just (for
example) change a caption in the production system. But this then risks being over-
written back to the previous version, when, following a more substantial change
elsewhere in the system, a new study definition is imported from the development
environment.
The testing required will occur in the development and any test system. For some
changes, it may be considered more realistic, and therefore safer, to test against the
whole volume of existing data, rather than just the small amount of dummy data that
usually exists within development environments. Backfilling the test/development
server, or at least one of them if there are multiple development environments, with
the current set of real data can therefore be a useful way of checking the impact of
changes on the current system. This does depend, however, on the test server having
a similar level of access control as the production system, otherwise there is a risk
232 S. Canham
that sensitive personal data is exposed more widely than it should be, and that
nonspecialist staff are unfairly exposed to sensitive data.
A coherent and consistent versioning system can help to support any change
management process. All versions of the study definition should be clearly labelled
and differentiated, for instance, by adapting the three part “semantic versioning”
system used for software (Semver 2020). In this scheme,
At the end of a trial, the data needs to be extracted for analysis, usually in a generic
format (csv files, CDISC ODM) or one tailored to a particular statistical package (e.
g., SAS, Stata, SPSS or R). Because most statistics packages can read csv or similar
text files, the ability to generate such files accurately is the key requirement.
Data extractions can take place before this of course, e.g., for interim safety
analysis by a data monitoring committee, for central statistical monitoring, and to
support risk-based monitoring decisions. In the noncommercial sector, trials may
also be extended into long-term follow-up, so that data is periodically extracted and
analyzed long after the primary analysis has been done and the associated papers
published.
The extraction process, especially when supplying data for the main analysis,
needs to be controlled and documented. An SOP should be in place outlining roles,
responsibilities, and the records required, often supported by a checklist that can be
used to document the readiness of the database for extraction. The checklist should
confirm that:
• All data has been signed off as correct by the principal investigators at sites.
• Serious adverse event data (transmitted via expedited reporting) has been recon-
ciled with the same data transmitted through standard data collection using eCRFs.
Any exceptions to any of the above should be documented. Most CDMS include
a “lock” facility which prevents data being added or edited, and this can be applied at
different levels of granularity, e.g., from an individual eCRF, to a whole participant,
to a clinical site, to the whole study. Once the issues listed above have been checked,
one would expect the entire database to be locked (with any later amendments to the
data rigorously controlled by an unlocking/relocking procedure which clearly
explained why the data amendments were necessary).
The extraction process results in a series of files, with traditionally the data items in
each file matching a source eCRF, or a repeating question group within a CRF.
Although the data appears to be directly derived from the eCRFs, the extraction usually
requires a major transformation of the data, because in most cases, the data is stored
quite differently within the CDMS database. Internally most systems use what is called
an entity-attribute-value (EAV) model, with one data row for each data item, and often
with all the data, from all subjects, visits, and eCRFs, stored in the same table.
The EAV structure is necessary to efficiently capture the audit data that is a
regulatory requirement, to more easily support various data management functions
like querying, and to provide the flexibility that enables a single system to store the
data from different studies, each with a wide variety of eCRF designs. It is almost
never evident to the end users, who instead see the data points neatly arranged within
each eCRF, the system consulting the relevant study definition to construct the
screen and place the data items within it as required.
When the data is extracted, the audit and status data for each item is usually left
behind, and the data is completely restructured as a table per eCRF or repeating
group as described above. This underlines the need for the validation of data
extraction, because not only is the output data central to the research, the process
by which it is created is complex. Extraction mechanisms will usually be tested
within the initial validation of the CDMS, but this often involves just a very small
data load from a simple test CDMA. Extractions from real CDMAs should undergo a
risk-based assessment of the need for additional, study specific, validation. The
validation does not usually need to be extensive or burdensome, but it is worth
checking (and documenting) that, for instance:
As more extractions are performed and checked, the level of confidence in the
system will grow, and the need for validation can become less, especially if a trial is
234 S. Canham
• From collaborators: Although sometimes such data may be imported into the
CDMS, more often it will be imported by aggregating extracted records. Care
must be taken that the extractions are fully compatible.
• From treatment allocation records: Up to this point, this data may have been kept
separately to preserve blinding.
• From laboratories: It is usually simpler to add data from external laboratories at
this stage rather than trying to import it into the CDMS, but this is a study-specific
decision, and may depend on lab preference and the need to carry out range and
consistency checks on the data.
• From coding tools, because in some trials units and CROs, coding is done on the
extracted data rather than within the CDMS.
Exactly how this data is aggregated with that from the CDMS should be planned
and documented within the data management plan. It is important that a description
of the newly combined data is included within the metadata documents for the study,
so in these cases, if metadata is normally generated by the CDMS, it will need to be
supplemented by additional documents.
The final analysis dataset, comprising the data from the CDMS and any additional
material integrated with it, needs to be safely retained. This is partly for audit or
inspection purposes, and partly to allow the reconstruction of any analysis using the
same extracted data, if that is ever required. In practice, it can be done by adding the
analysis dataset, in a folder, clearly labelled and date stamped according to an agreed
convention, into a read-only area of the local file system. A group (usually the IT staff,
who are deemed to be uninterested in the data content) has to have write privileges on
this area for the data to be loaded, but all other users, including the statisticians who
need to analyze the data, must take copies of the files if they wish to work on them.
Though obviously not part of the CDMS, the procedures and infrastructure required
to implement the safe storage of the output data, so it is protected from accidental
modification, as well as any suspicion of intentional edit, are an important part of the
total study data system. They form the final link in the chain that begins with the study
protocol, stretches through system design, definition, and testing, moves on to months
or years of data collection, with maximization of data quality, and finally ends with the
primary function of the system – the delivery of data for analysis.
A study data system is centered around a specialist software tool – the Clinical Data
Management System or CDMS – that provides the core functionality required to
guarantee the regulatory compliance of data collection, plus the flexibility needed to
13 Design and Development of the Study Data System 235
support a wide range of different study designs and data requirements. CDMSs or,
increasingly, externally hosted CDMS services, are usually purchased from special-
ist vendors. The CDMS is the core component but by no means the only one:
supporting sub-systems, e.g., for coding, file storage, backup, and metadata produc-
tion, may also be involved. The “system” also includes the competences of the staff
that operate it and, crucially, the set of policies and procedures that govern workflow.
It is these policies, more than the technical infrastructure, which determine the
quality of any study data system.
Procedures are especially important for supporting the workflows around devel-
oping and then validating the systems constructed for individual studies, ensuring
that these activities are done in a consistent, clear, reliable, and well-documented
fashion. They are also key to the systematic consideration and application of (for
example) data standards, systems for managing data quality, procedures for change
management, import and aggregation of externally derived data, preparation for data
extraction, and the extraction process itself.
The data flow of modern study data systems is now dominated by a web-based
approach (eRDC) that removes the need to install anything at the clinical site or
provide additional hardware, as was the case in the past. Over the last 20 years,
eRDC has almost entirely supplanted traditional paper based data transfer. There is
growing interest in extending this approach directly to the study participants, to
capture directly from them using smart phones or portable monitoring devices. The
major current trend in study data systems, however, is the growing use of externally
hosted systems, so that the coordinating center or trials unit, as well as the clinical
sites, access the system through the internet. This approach can bring greater
flexibility and reduced costs, but it carries potential risks, for example, around
communication, responsiveness, and quality control. Developing the technical and
procedural mechanisms to better manage these risks is one of the biggest challenges
facing vendors and users of study data systems today.
Key Facts
1. The core software component of any study data system is a specialist tool known
as a Clinical Data Management System, or CDMS, usually purchased on a
commercial basis.
2. A web-based data management workflow known as eRDC, for electronic remote
data capture, is used in the great majority of clinical studies.
3. It should be noted that the data system also consists, in addition to the CDMS
software, of the people managing it and the policies and procedures that govern
workflows and data flows.
4. Increasingly, study data systems are provided remotely, as “software as a
service” or SaaS.
5. SaaS offers advantages (e.g., reduced system validation load) but can carry
risks. A range of communication problems have been identified in SaaS
environments.
236 S. Canham
6. A trials unit retains the overall responsibilities for safe, secure, and regulatory
compliant data management, as delegated from the sponsor, even when some of
the functions involved are subcontracted to other agencies. Its quality manage-
ment strategy therefore needs to include mechanisms for monitoring the work of
these subcontractors.
7. Developing a successful study data system for any specific study requires a clear
separation between the development of a detailed specification for the required
system, requiring input and agreement from all important stakeholders, and a
second validation step, requiring detailed, systematic testing of the completed
system.
8. The use of data standards can reduce system development time and increase the
potential scientific value of the data produced.
9. The production version of the study data systems should be maintained sepa-
rately from the development and/or training versions of the same systems, and
be accessed using different parameters.
10. Proposed changes in the study data system need to be managed using a clear and
consistent risk-based change management system.
Cross-References
▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ Documentation: Essential Documents and Standard Operating Procedures
▶ Long-Term Management of Data and Secondary Use
▶ Patient-Reported Outcomes
▶ Responsibilities and Management of the Clinical Coordinating Center
References
Blumenberg C, Barros A (2016) Electronic data collection in epidemiological research, the use of
REDCap in the Pelotas birth cohorts. Appl Clin Inform 7(3):672–681. https://doi.org/10.4338/
ACI-2016-02-RA-0028. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5052541/. Accessed
31 May 2020
Canham S, Bernalte Gasco A, Crocombe W et al (2018) Requirements for certification of ECRIN
data centres, with explanation and elaboration of standards, version 4.0. https://zenodo.org/
record/1240941#.Wzi3mPZFw-U. Accessed 31 May 2020
Capterra (2020) Clinical trial management software. https://www.capterra.com/clinical-trial-man
agement-software. Accessed 31 May 2020
CDISC (2020a) The operational data model (ODM) – XML. https://www.cdisc.org/standards/data-
exchange/odm. Accessed 31 May 2020
CDISC (2020b) Clinical data acquisition standards harmonization (CDASH). https://www.cdisc.
org/standards/foundational/cdash. Accessed 31 May 2020
CDISC (2020c) Therapeutic area standards. https://www.cdisc.org/standards/therapeutic-areas.
Accessed 31 May 2020
13 Design and Development of the Study Data System 237
Dillon D, Pirie F, Rice S, Pomilla C, Sandhu M, Motala A, Young E, African Partnership for
Chronic Disease Research (APCDR) (2014) Open-source electronic data capture system offered
increased accuracy and cost-effectiveness compared with paper methods in Africa. J Clin
Epidemiol 67(12):1358–1363. https://doi.org/10.1016/j.jclinepi.2014.06.012. https://www.
ncbi.nlm.nih.gov/pmc/articles/PMC4271740/. Accessed 31 May 2020
ECRIN (2020) The European Clinical Research Infrastructure Network. http://ecrin.org/. Accessed
31 May 2020
El Emam K, Jonker E, Sampson M, Krleža-Jerić K, Neisa A (2009) The use of electronic data
capture tools in clinical trials: web-survey of 259 Canadian trials. J Med Internet Res 11(1):e8.
https://doi.org/10.2196/jmir.1120. http://www.jmir.org/2009/1/e8/. Accessed 31 May 2020
Eur-Lex (2020) The general data protection regulation. http://eur-lex.europa.eu/legal-content/en/
TXT/?uri¼CELEX%3A32016R0679. Accessed 31 May 2020
Fleischmann R, Decker A, Kraft A, Mai K, Schmidt S (2017) Mobile electronic versus paper case
report forms in clinical trials: a randomized controlled trial. BMC Med Res Methodol 17:153.
https://doi.org/10.1186/s12874-017-0429-y. Published online 2017 Dec 1. https://www.ncbi.
nlm.nih.gov/pmc/articles/PMC5709849/. Accessed 31 May 2020
Green J (2003) Realising the value proposition of EDC. Innovations in clinical trials. September
2003. 12–15. http://www.iptonline.com/articles/public/ICTTWO12NoPrint.pdf. Accessed 31
May 2020
Kirkwood A, Cox T, Hackshaw A (2013) Application of methods for central statistical monitoring
in clinical trials. Clin Trials 10:703–806. https://doi.org/10.1177/1740774513494504. https://
journals.sagepub.com/doi/10.1177/1740774513494504. Accessed 31 May 2020
MedDRA (2020) Medical dictionary for regulatory activities. https://www.meddra.org/. Accessed
31 May 2020
Mitchel J, You J, Lau A, Kim YJ (2001) Paper vs web, a tale of three trials. Applied clinical trials,
August 2001. https://www.targethealth.com/resources/paper-vs-web-a-tale-of-three-trials. Accessed
31 May 2020
OpenClinica (2020). https://www.openclinica.com/. Accessed 31 May 2020
RedCap (2020). https://www.project-redcap.org/. Accessed 31 May 2020
Semver (2020) Semantic versioning 2.0.0. https://semver.org/. Accessed 31 May 2020
Walker P (2016) ePRO – An inspector’s perspective. MHRA Inspectorate blog, 7 July 2016. https://
mhrainspectorate.blog.gov.uk/2016/07/07/epro-an-inspectors-perspective/. Accessed 31 May 2020
WHO (2020) The anatomical therapeutic chemical classification system, structure and principles.
https://www.whocc.no/atc/structure_and_principles/. Accessed 31 May 2020
Yeomans A (2014) The future of ePRO platforms. Applied clinical trials, 28 Jan 2014. http://www.
appliedclinicaltrialsonline.com/future-epro-platforms?pageID¼1. Accessed 31 May 2020
Implementing the Trial Protocol
14
Jamie B. Oughton and Amanda Lilley-Kelly
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Protocol Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Site Selection, Feasibility, and Set Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Site Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
Timing of Site Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Registration/Randomization System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Risk and Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
Trial Risk Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Trial Monitoring Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Trial Oversight . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
Project Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Trial Management Group (TMG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Independent Data Monitoring Committee (IDMC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Independent Trial Steering Committee (TSC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
Trial Promotion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Trial Website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
Press . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Investigator Meeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
Abstract
This chapter outlines the steps required to bring a protocol to life as a clinical trial.
Developing the protocol is a multidisciplinary effort which must be approached in
an ordered and logical way with clear leadership. Poor site feasibility is a
common reason for trial failure when performed badly and is crucial to capture
a generalizable population. The key site feasibility assessment issues are outlined.
The chapter goes on to give advice on data collection and how this should be
planned alongside developing the trial protocol. Trial processes must be ade-
quately described and trial staff trained well to maximize efficiency and minimize
error. Strategies to identify and mitigate risks to participant safety and trial
integrity are discussed along with techniques that can be implemented to monitor
the identified risks. Typical trial oversight groups and processes are provided to
reach a structure that is effective and proportionate to the level of trial risk.
Finally, suggestions for how to manage trial promotion to maximize engagement
with investigators, potential participants and other stakeholders are discussed.
Keywords
Protocol · Feasibility · Monitoring · Oversight · Publicity · Training program ·
Competency
Introduction
This chapter attempts to bridge the gap between the finalized protocol and patient
accrual. It outlines necessary considerations for identifying and selecting participat-
ing centers to optimize trial delivery. Similarly, it covers the vital task of risk
assessment and monitoring and the establishment of effective oversight bodies.
Finally, the topic of trial promotion is discussed, with suggestions based on the
needs and resources of a trial.
Protocol Development
The protocol is the most important document in a clinical trial and sufficient time and
expertise must be allocated to its development. The nature of the trial will dictate
which specialties will contribute but typically this will include: clinical, regulatory,
laboratory, statistical, operations, funder, and safety. A member of the coordinating
team, for example, the project manager, should take responsibility for making sure
each party has reviewed the protocol at the appropriate development stage. Omis-
sions or mistakes can be costly in time taken making amendments and therefore it is
useful to obtain a final review from a member outside of the immediate trial team.
The most appropriate protocol structure must be chosen at the start of develop-
ment. For example, is it reasonable to contain all the required information in one
14 Implementing the Trial Protocol 241
Site Characteristics
The site must have sufficient staff (e.g., trial coordinators, research nurses, data
managers, trial pharmacists) to deliver the trial. For some trials, it may be appropriate
to recruit additional staff and time must be allowed for this. There must be sufficient
enthusiasm from management and the site investigators for the trial to succeed.
Sponsors should be sensitive to the motivations of a site to participate in research, be
they prestige, financial, or patient-driven. Where there are likely to be barriers to set
up or recruitment (for example, excess treatment costs in the UK), these should be
highlighted by the coordinating center at an early stage in set up and addressed with
appropriate guidance and mitigation.
For international trials, a range of issues may become relevant. Is the treat-
ment environment (e.g., political, financial, or environment) likely to remain
stable during the lifetime of the trial? Are there factors that will limit the
delivery of a trial? For example, there are financial incentives towards hospi-
tal-based interventions in the United States or lack of refrigeration facilities in
some settings for a vaccine trial. International trials also bring significant
complexities due to multiple regulatory requirements; for example, in the United
States, each recruiting institution may require separate ethics approval, although
there has been a recent change for trials sponsored by the National Institutes of
Health (NIH) where single IRBs are encouraged/required. More detailed infor-
mation on setting up international trials is available elsewhere (Minisman et al.
2012; Croft 2020).
Once the sites have been selected, there needs to be a systematic approach to site set
up based on the resource available. It may be necessary to take a phased approach
rather than seeking to open all sites at the same time. This is often necessary because
of finite resources to complete all necessary tasks, e.g., on-site training, intervention
availability, or temporary low capacity at site.
It is important for sponsors and trial management teams to keep up to date and
engage with such schemes for prompt set up. For example, in England, the Health
Research Authority (HRA) made changes to the way research is initiated in the
National Health Service (NHS). The approach now is that the HRA give permission
on behalf of the NHS, potentially making it faster for individual hospitals to
participate.
Trial-wide approvals from the ethics committee and/or regulatory authority must
have been received before a site is permitted to start recruiting participants. Sponsors
must have a robust process to ensure both site level and trial level approvals have all
been received.
14 Implementing the Trial Protocol 243
Registration/Randomization System
Data Collection
All trials require a robust system for collecting the data in a format that is transparent
and suitable for the analysis. Clinical trial data is almost always collated in an
electronic database nowadays, and many trials also use electronic data capture
systems to collect data from participating sites. It is necessary to have a firm idea
of the data items required before developing the database, and this usually begins
once the protocol has been finalized. The data collection process should be mapped
out from the source data to the final analysis. The source of the data will dictate the
format of data collection tools. For example, laboratory results can occasionally be
provided electronically as a table to the sponsor, whereas a physical examination
may require a paper/electronic form to report the result. The complexity of the
database will be dictated by the sources of data and risk assessment. The database
and data collection instruments should be finalized and fully tested before opening to
recruitment.
Training
•(Only if applicable)
• Patient registered / randomised to treatment
Randomisation
•Endpoint assessment
Follow-up
Fig. 1 Flow diagram showing the stages of a clinical trial (Collett et al. 2017)
principles that need to be covered to ensure the trial is conducted appropriately? For
example, will the research team include clinical disciplines that are not often
involved in clinical research, such as allied health professionals (e.g., paramedics,
radiotherapists, or dieticians)? If so, should the training include an overview of Good
Clinical Practice (GCP) to underpin their participation in trial activities?
Dependent upon the trial, there may also be a need to consider additional training
requirements relevant to the trial population. For example, if participants will be
recruited from an older population that may have a high incidence of cognitive
impairment, it is important to provide an overview of any local legislation governing
participants with limited cognitive capacity and the process of informed consent
within a vulnerable population. Establishing expertise required is essential to the
development of the content of training; however, it is important to be mindful of
existing training programs available – balancing trial requirements with practical-
ities. It is often best practice to direct researchers to additional sources of information
that can cover complex topics in greater detail than required for specific trials.
Once the team delivering the trial and the expertise and knowledge required has
been defined, the content and structure of the training program can be developed. A
key consideration for the content of the training is the time available to support
training – how long will it take for each member of the team to be properly trained in
their elements of the trial? If the trial involves frontline clinical staff, there may be a
need to adapt training to accommodate other commitments. If there is a large team of
clinicians that would like to train together, what is an acceptable / practicable
duration? These considerations need to be balanced with the key components of
14 Implementing the Trial Protocol 245
Trial Overview
Background Participant Recruitment
Trial design
Inclusion / exclusion Registration / Randomisation
Additional elements
criteria
Overview of process / Intervention
(i.e. Process
evaluation / economic Participant
Analysis) Recruitment systems
processes Development of Follow-Up
- Screening intervention
(Background)
- Consent (incl. Types) Timeline
Intervention schedule
- Eligibility (i.e. number of Type
assessments contacts) Process
Intervention
Data collection
training, ensuring that topics are covered in order of priority, being mindful of the
logical flow of information and the burden of training on the target audience.
To establish a logical flow of information, it is often best to link back to
considerations of key trial elements (Fig. 2) and the relevant information around
these topics. An example of these and associated topics to include are outlined in
Fig. 2 – these broad topics should be tailored to include trial specifics and time
allocated dependent on content. It is also important to consider the expertise required
within the training team to deliver these sessions, potentially using changes in trainer
as natural breaks to avoid audience burden. The content of training often evolves as
the program develops, and it is beneficial to gain input from the wider trial team to
review content as it develops.
Methods of delivery for the training should be considered during development of
the training content, taking the audience and amenities available. Often a slideshow
is developed that can be adaptable for presentations where facilities exist or as
preprinted handouts if not. However, other options include online presentations
that could be delivered remotely (i.e., webinars/video blogs) and which also have
the added benefit of being reusable and readily available.
As part of the training program, materials need to be developed to support
training. For example, training slides, reference data collection instruments specific
to the trial, trial promotional materials, and site-specific documentation (i.e., Inves-
tigator Site File – ISF). Training development also often highlights procedures that
are more complex and require additional supporting information to ensure standard-
ized completion throughout the trial, in the form of a guidance manual or standard
operating procedure (SOP) with these materials provided as part of the training pack.
Once the training package is developed, it is important to consider how attendance
at training will be documented, and how robust this process needs to be. A list of
246 J. B. Oughton and A. Lilley-Kelly
Table 1 Risk stratification in noncommercial trials. (Adapted from the Brosteanu article)
Trial categories based on associated risk Examples of types of clinical trials
Type A: No higher than the risk of Trials involving licensed products or off-label use if
standard medical care this use is established practice
Type B Somewhat higher than the risk Trials involving licensed products if they are used
of standard medical care for a different indication, for a substantial dosage
modification or in combinations where interactions
are suspected
Type C Markedly higher that the risk Trials involving unlicensed products
of standard medical care
14 Implementing the Trial Protocol 247
For every trial, there must be an attempt to identify the potential hazards and an
assessment of the likelihood of those hazards occurring and resulting in harm. Risks
fall into two main categories: those that affect patient safety and those that affect the
integrity of the trial. Appropriate control measures should be documented for each risk.
It may be appropriate to include key elements of the risk assessment as part of the trial
protocol so that all stakeholders are fully informed. Key risks to participants should be
explained in the patient information or consent form. The risks described in the patient
information should be presented in the context of the disease and standard treatment.
Generally only risks that are common (between 1/1 and 1/100) or thought to be
particularly serious should be detailed in the patient information. Patient groups are
valuable to ensure patient information is appropriate and directed towards the needs of
patients.
The risk assessment should then be used to develop the trial monitoring plan. The
level of monitoring will depend upon the level of risk and the resources available.
Monitoring falls into two categories: on-site source data verification and central
monitoring. Pivotal trials that will be used as evidence to support a marketing
application and phase I trials will usually contain substantial on-site monitoring,
whereas an interventional trial evaluating two interventions already used in standard
care or where endpoints can be collected centrally from routine data may require
none. On-site monitoring should be complemented by centralized monitoring, where
the sponsor or delegate is provided with source data by the site (e.g., the laboratory
or imaging reports) in order to validate key endpoints.
On-site monitoring can be separated again into two categories: planned visits and
triggered visits.
Trial Oversight
of reference. For a large research group, it may be efficient to review similar trials at
the same meeting (i.e., review multiple trials using the same committee).
Project Team
The staff responsible for carrying out the day-to-day management of the trial should
meet to review key performance indicators from accumulating data. The group
would usually include the data manager, on-site monitor, trial manager, statisti-
cian/methodologist, and team leader.
The TMG consists of the project team with the addition of the clinical investigators,
patient representatives, and sub-study collaborators with the objective of reviewing a
higher level summary than the project team meetings. As the TMG is made up of the
staff running or leading the trial, it is not independent.
The IDMC periodically reviews accumulating summaries of data with the purpose of
intervening in the interests of trial participants. Unlike the project team or the TMG,
the IDMC may review data presented by study arm. The group consists of disease-
specific experts and statisticians experienced with the trial design, none of whom
have any involvement in delivering the actual trial.
To avoid any uncertainty, it is recommended that the trial team/sponsor prepare
explicit guidelines outlining how the IDMC should operate (Sydes et al. 2004b). The
use of an IDMC should be briefly described in the trial results. Notably only 18% of
662 RCTs in a review done in 2000 did so (Sydes et al. 2004a). However, this is likely
to have improved substantially over the last 17 years with the advent of initiatives,
such as CONSORT (Hopewell et al. 2008) which aim to standardize reporting.
The purpose of the TSC is to take advice from the IDMC and make key decisions for
the trial. The TSC usually has the power to terminate the trial or require other actions
to protect participants.
The Medical Research Council in the UK has published guidelines for appro-
priate oversight structures (MRC 2017), and a recent survey has suggested wide-
spread compliance with academic trials units in the UK (Conroy et al. 2015).
Members of the independent groups should generally not be involved with the trial
250 J. B. Oughton and A. Lilley-Kelly
Funder Sponsor
Project team
Trial/data manager
Trial monitor
Statistician
in any way, be from outside the investigator’s institution and ideally have an
excellent knowledge of the relevant disease area. For large multisite trials, it may
be necessary to consider international colleagues or those recently retired from
practice (Fig. 3).
14 Implementing the Trial Protocol 251
Trial Promotion
At the start of a project, and throughout delivery, it is important to consider the end
impact of the results. Trial publicity and dissemination of information is essential to
support publication in high-impact journals and ensure future research for patient
benefit. The trial team, including any oversight committees, should develop a
promotional strategy, which could include a schedule of press releases to support
key milestones (i.e., launch/participant recruitment/analysis) disseminated by orga-
nizations associated with the trial (i.e., co-applicants/charitable organizations/dis-
ease-specific groups).
Large multisite trials often develop a brand identity. This starts with having a
trial name that is an accessible shorthand for everyone to refer to the research trial
quickly and easily. Trials with acronyms were more likely to be cited than those
without (Stanbrook et al. 2006). Convention dictates that the name is an acronym
using the letters from the trial’s full title, ideally with some link to the subject area,
though this is not essential. Others have written about what can be humorously
known as acronymogenesis (Fallowfield and Jenkins 2002; Cheng 2006) but
essentially avoid anything that could discourage potential patients (e.g.,
RAZOR) or that could be perceived as coercive (e.g., HOPE, LIFE, SAVED,
CURE, IMPROVED).
Following the trial acronym is often the trial logo. Those with access to a graphic
designer can have more elaborate designs but the trial acronym in a special font may
be sufficient emphasis. A couple of examples of trial logos are displayed below in
Figs. 4 and 5.
Trial Website
A trial website is a helpful way for people to access trial information. This can be
targeted towards investigators, with password protection as required, and/or towards
Fig. 5 Example of trial logo (Oughton et al. 2017) (N.B. that this was an antibody trial, hence the
shape of the spacecraft)
252 J. B. Oughton and A. Lilley-Kelly
patients to encourage interested patients to volunteer for the trial and/or an alternative
means of providing information. It is very common for patients who have been invited
into the trial to do their own internet research, so it is important that any online publicly
available information compliment patient materials, such as the patient information
sheet. Websites have more flexibility in methods for presenting the information than a
paper information sheet. The website can aid the dissemination of the results to
participants by linking to published results or lay summaries of the findings. An
exemplar of good practice in this area can be found for the INTERVAL blood donation
frequency study (University of Cambridge 2017; Moore et al. 2014). Websites can be a
vital channel of communication for sites, patients, and the media. Contents of official
websites may need to have to be approved by an Ethics Committee/Institutional
Review Board depending on local requirements.
With appropriate access controls, there are potential gains to be made from having
trial documents accessible to investigators via the website. The website can provide a
link to remote data capture systems and online registration/randomization services.
Training videos can be hosted from the website for investigators as can participant
questionnaires.
Social Media
The use of social media has increased over the last two decades. Patients frequently
use the internet and social media as a primary source of information. Patient support
groups often have a significant online presence, with forums to facilitate discussions
about a wide variety of topics. Some patients even blog about their experience as trial
participants.
Some have expressed concern that there is potential to compromise the integrity
of clinical trials. However, a review of more than one million online posts found that
discussions of active clinical trials were rare and no discussions were identified that
risked unblinding of clinical trials (Merinopoulou et al. 2015). The authors go on to
recommend basic training for trial participants on the risks of social media discus-
sions and also that sponsors should consider periodic monitoring of social media
content.
There is the potential that participants may disclose adverse events on social
media that would otherwise be unreported. A systematic review (Golder et al. 2015)
review found that adverse events are identifiable within social media and that mild
and symptom-related adverse events are overrepresented online when compared
with traditional data sources, or perhaps alternatively that the lower end of side
effects are underrepresented in trial reporting. Undoubtedly, pharmaceutical compa-
nies are working to develop tools to aggregate social media data, but at present, this
approach is still in its infancy. There is currently only a regulatory responsibility to
report events that are reported to a sponsor/Marketing Authorization Holder rather
than to actively seek out events online. An active approach would also be con-
founded by the difficulty in matching a report to a specific research participant and
there would be ethical issues of perceived intrusive monitoring.
14 Implementing the Trial Protocol 253
Press
To accomplish wider reach, a press release for either regional or national news
outlets could be considered. To maximize the chances of the story being published,
it may be helpful to include a patient interest angle. For example, a trial patient that
has done well in the phase I trial and is now excited that the trial has expanded to
phase II. Photographs or willingness to be photographed/interviewed are key to
success. Permission for the press release must be obtained from all those involved
and an institution’s press office will often be able to provide support. If the press
release is directed towards recruiting patients, the relevant trial ethics committee
should give approval beforehand.
Investigator Meeting
Investigator meetings are useful to provide information about the trial, to foster a
collective commitment to the trial aims, and as a way of recognizing investigator’s
commitment. They are often timed to occur before recruitment commences but can
also be valuable during the lifetime of the trial to provide updates or to help publicize
the results. The organization and resources required to host an investigator meeting
are significant, and it is therefore important to consider carefully what goals and
achievements are important for the meeting and to select an appropriate venue. Costs
can be minimized by holding investigator meetings alongside scientific conferences
where investigators are already likely to attend.
Key Facts
Cross-References
References
Brosteanu O et al (2009) Risk analysis and risk adapted on-site monitoring in noncommercial
clinical trials. Clin Trials 6(6):585–596
Cheng TO (2006) Some trial acronyms using famous artists’ names such as MICHELANGELO,
MATISSE, PICASSO, and REMBRANDT are not true acronyms at all. Am J Cardiol 98
(2):276–277
Collett L et al (2017) Assessment of ibrutinib plus rituximab in front-line CLL (FLAIR trial): study
protocol for a phase III randomised controlled trial. Trials 18(1):387
Conroy EJ et al (2015) Trial Steering Committees in randomised controlled trials: a survey of
registered clinical trials units to establish current practice and experiences. Clin Trials 12
(6):664–676
Croft J. International surgical trials toolkit. cited 2020. Available from: https://
internationaltrialstoolkit.co.uk/
European Commission (2014) Risk proportionate approaches in clinical trials-recommendations of
the expert group on clinical trials for the implementation of regulation (EU) no 536/2014 on
clinical trials on medicinal products for human use. 2014 08/08/17. Available from: http://ec.
europa.eu/health/files/clinicaltrials/2016_06_pc_guidelines/gl_4_consult.pdf
Fallowfield L, Jenkins V (2002) Acronymic trials: the good, the bad, and the coercive. Lancet 360
(9346):1622
Golder S, Norman G, Loke YK (2015) Systematic review on the prevalence, frequency and
comparative value of adverse events data in social media. Br J Clin Pharmacol 80(4):878–888
Hopewell S et al (2008) CONSORT for reporting randomised trials in journal and conference
abstracts. Lancet 371(9609):281–283
Merinopoulou E et al (2015) Lets talk! Is chatter on social media amongst participants compromis-
ing clinical trials? Value Health 18(7):A724–A724
MHRA (2011) Risk-adapted approaches to the management of clinical trials of Investigational
Medicinal Products Ad-hoc Working Group and the Risk-Stratification Sub-Group. 2011. cited
2019. Available from: https://assets.publishing.service.gov.uk/government/uploads/system/
uploads/attachment_data/file/343677/Risk-adapted_approaches_to_the_management_of_clini
cal_trials_of_investigational_medicinal_products.pdf
Minisman G et al (2012) Implementing clinical trials on an international platform: challenges and
perspectives. J Neurol Sci 313(1–2):1–6
Moore C et al (2014) The INTERVAL trial to determine whether intervals between blood donations
can be safely and acceptably decreased to optimise blood supply: study protocol for a
randomised controlled trial. Trials 15:363
MRC Guidelines for Management of Global Health Trials Involving Clinical or Public Health
Interventions. Medical Research Council (2017) https://mrc.ukri.org/documents/pdf/guidelines-
for-management-of-global-health-trials/
14 Implementing the Trial Protocol 255
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Recruitment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Consenting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
Enrollment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
Planning Recruitment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
The “Recruitment Funnel” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
The Recruitment Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
Recruitment Planning Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Identifying Trial Candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Recruitment Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Planning the Screening End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Screening Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Enrollment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Enrollment Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Enrollment Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Monitoring Enrollment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
Retention Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Examples of Retention Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Recruitment Issues and Their Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Risk Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
Issue Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
P. Wermuth (*)
Basel, Switzerland
e-mail: [email protected]
Abstract
Participant recruitment and retention are key success factors in a clinical
trial. Failure to enroll the required number of participants in a timely manner
can have significant impact on trial budget and timelines, and data gaps due to
under-recruitment or early dropout of participants may lead to misinterpretation
or unreliability of trial results.
Prior to the start of a trial, a thorough recruitment plan should be set up,
considering the screen failure rate, the potential need to replace early dropouts,
and the geographical distribution of participants, timelines, and budgetary con-
straints. Understanding how to calculate the recruitment rate based on the number
of enrolled participants per site per month will help in assessing the probability of
successful recruitment. In addition, recruitment planning tools and services such
as comparison with benchmarking data from analogous historical or ongoing
trials, simulation tools, and specialist service agencies may support the setup of
a robust recruitment plan. Risk factors with the potential of leading to under-
recruitment, over-recruitment, or recruitment of unsuitable participants should be
identified upfront to allow risk mitigation as far as possible, e.g., through protocol
amendments or increasing the number of participating centers. Evaluation of
the most appropriate channels to identify and contact trial candidates will ensure
optimal turnout in relation to the financial and resource investments. Strategies
for participant screening, enrollment, and retention are reviewed in this chapter.
Keywords
Recruitment · Recruitment plan · Screening · Enrollment · Recruitment rate ·
Retention · Benchmarking · Simulation · Recruitment issues
Introduction
Once the design of a clinical trial has been defined and the final trial protocol has
received all required approvals, after participating centers have been set up and
trained appropriately, and when all trial supplies (including study drug) are available,
the trial is ready to start being populated with participants.
Studies conducted on metadata or on data obtained from local or national
registries do not require the active involvement of any individuals, and recruitment
as such will not be needed for these studies. However, most studies, interventional
and non-interventional, require the identification and enrollment of individual trial
participants, carefully selected according to trial-specific eligibility criteria, to allow
collection of relevant data that will address the objective of the trial. Failure to get
the right participants enrolled into a trial, or to retain participants in the trial, may
have significant impact on the quality of the trial results, as reliability and power of
the trial outcome may decrease (Little et al. 2012). Likewise, failure to enroll
within an appropriate time frame can have significant impact on the trial budget if
15 Participant Recruitment, Screening, and Enrollment 259
additional costs and resources are required to bring recruitment back on track and, in
the case of new therapies being investigated, can cause costly delays in time to
market for these therapies. Consequently, planning for successful recruitment and
participant retention will need to start at the very beginning of the conceptualization
of a clinical trial.
This chapter looks into the details of recruitment including planning tools such
as benchmarking and simulation, highlighting the various challenges that are so
often experienced in recruitment and how to avoid them, and how to identify
candidates for a trial. It also describes methods that can be applied to support the
accrual of participants and their retention in the trial until its completion. For further
reading refer to Anderson (2001), Bachenheimer and Brescia (2017) and Friedman
et al. (2015).
Definitions
Recruitment
Describes the overall process from the point of identifying a candidate for a clinical
trial (either a volunteer or a person diagnosed with the disease or condition under
investigation), through the steps of obtaining their informed and written consent
and verifying their eligibility (“screening”), up to the candidate’s inclusion into the
clinical trial (including randomization and/or treatment assignment where applicable).
Consenting
Describes the process of informing a candidate of the specifics of the clinical trial
(including the objectives of the trial, potential benefits and risks, the assessments and
procedures involved and their impact on the participant, and the participant’s rights
and responsibilities) and of obtaining the candidate’s (or their legal representative’s)
consent to collect personal data and biological samples and to analyze and publish
the derived data. This process is mandatory according to ICH GCP E6(R1).
Screening
Describes the process of ensuring an identified candidate meets all trial inclusion and
exclusion criteria, including obtaining written consent and confirmation of the
candidate’s willingness and ability to adhere to the trial requirements. Screening
assessments requiring any type of intervention not considered routine procedure or
standard of care can only be performed after a participant’s consent has been
obtained. Candidates not meeting all of the eligibility criteria are considered screen
failures and cannot be enrolled into the trial.
260 P. Wermuth
Enrollment
Enrollment is the process of actually including an individual into the trial after
their identification and verification that all trial-specific eligibility criteria are met.
This can include randomization and/or treatment assignment where applicable and
usually marks the point from which on data can be collected (both pro- and
retrospectively). The number of enrolled participants typically does not include
screen-failed individuals, but will include participants that drop out of the trial for
any reason at any time after enrollment (Fig. 1).
Planning Recruitment
• Numbers: The trial protocol will define the number of participants required
according to the statistical sample size calculations. In addition, the expected
screen failure rate will need to be established as the number of candidates to be
15
identification
Fig. 1 Overview of the recruitment process for enrolled participants documentation of reportable adverse events may be required from the beginning of
screening (retrospectively), up to the end of the clinical trial
261
262 P. Wermuth
screened in total will need to include the candidates not subsequently enrolled.
Also, the protocol may require replacement of enrolled participants who drop out
of the trial prior to reaching a specific milestone (e.g., the end of a certain
observation/treatment period or exposure duration). In this case, the expected
dropout rate should also be assessed, and the number of enrolled participants
increased accordingly (Fig. 2).
• Timelines: Restrictions on duration of the recruitment period can be defined by
budget and/or resource limitations, by ethical and/or regulatory requirements (e.
g., post approval commitments to health authorities), or by statistical require-
ments (e.g., occurrence of endpoints to be observed within a specific time frame).
Planning of timelines should also take into consideration the time needed to
identify and screen candidates (i.e., how often are candidates seen by trial
investigators; how long do trial-specific screening assessments take).
• Geographical distribution: Is the trial a single- or a multicenter trial, a local,
national, or international trial? Are there any specific geographic considerations
from an epidemiological stand point? Are there any logistical restrictions
to the distribution of the trial such as language constraints, limitations in
clinical research associate (CRA) monitoring resources, challenges in supply of
the investigational medicinal product, or differences in standard of care (e.g.,
availability of comparator treatment) that would influence the geographical
spread of the trial? Are there any regulatory requirements for acceptance of
data for licensing or marketing considerations (e.g., minimum number of
participants from a certain country required to allow filing)?
Example: some countries tend to be strong and fast recruiters but may have long
start-up timelines, potentially precluding their involvement in trials where the
overall recruitment period is expected to be short. On the other hand, other countries
may be included due to short start-up timelines, despite a perhaps limited enrollment
potential.
Required no. of
enrolled participants + Expected no. of screen
failed candidates + Expected no. of drop
outs = No. of candidates to
be screened
Typically, it has to be expected (and planned for) that not all identified candidates
will end up being enrolled into a clinical trial, and similarly, not all enrolled
participants will complete the trial. The phenomenon of the number of candidates
decreasing between pre-screening and enrollment, and again after randomization, is
often referred to as the “recruitment funnel.”
The first sweep of candidates will fall off the radar during the pre-screening phase.
Often site staff overestimate the number of potential trial participants they
can contribute to a trial, not fully taking into consideration competitive trials
being conducted at their site on a similar patient population, or the site’s resource
constraints and the related limitations in their ability to oversee the often time-intensive
management of participants in a clinical trial. A further proportion of candidates
will drop out during the actual screening and consenting phase, as not
264 P. Wermuth
Sites’ anticipated
screening potential
100
-30%
Pre-screened
candidates 70
-50%
Consented
candidates 35
-57%
Enrolled
participants 20
-15%
Participants
completing the trial 17
Fig. 3 The recruitment funnel (indicated numbers and percentages are examples)
Recruitment
Rate = Total no. of
participants
/ No. of
contributing sites
/ Time Unit
all candidates will meet all the eligibility criteria and/or are willing to enter a
trial. A reason for this is that during initial protocol review, investigators might
underestimate the stringency of the trial’s eligibility criteria. For potential enrollment
barriers, see Brintnall-Karabelas et al. (2011) and Lara et al. (2001). Thirdly, even after
randomization, the number of participants is likely to decrease over time, through early
dropouts due to various reasons such as adverse reactions experienced during the trial,
withdrawal of consent, or participants being lost to follow-up. The dropout rate
generally increases with the longer duration of a trial (Fig. 3).
The most often used metric in recruitment is the recruitment rate. The recruitment
rate of a clinical trial, both for single-center and for multicenter trials, is defined by
the following factors (Fig. 4):
There are different methodologies and tools available that can be used to generate a
differentiated approximation of the estimated recruitment rate for a planned trial.
Ideally two or more of these are used in combination as this will allow the most
complete coverage of possible eventualities and the development of the best possible
recruitment strategy.
Benchmarking
Researching the trial landscape by identifying clinical trials (either historical or
currently running) that are analogous to the trial at hand will provide benchmarking
data on what can be expected with regard to recruitment metrics. Benchmarking data
of clinical trials can be obtained either from government-mandated reportable trial
266 P. Wermuth
Simulation
Another useful tool is the simulation of recruitment by feeding different sets of
variables into a model and thus imitating different potential scenarios. Modifying
factors such as number of sites, recruitment rates, ramp-up times for individual sites,
etc. will help visualizing their impact on the recruitment duration and will allow
refinement of one’s assumptions (DePuy 2017) (Fig. 5).
10
40
5
20
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar
Fig. 5 Visualization of simulation outputs for two different scenarios: scenario A ¼ 12 activated
sites; scenario B ¼ 18 activated sites. Target sample size ¼ 68; assumed screen failure rate ¼ 20%
15 Participant Recruitment, Screening, and Enrollment 267
With the advance and increasing dispersion of genomic profiling and personalized
health care, the concept of finding a trial for a patient rather than finding patients for a
trial will become more prevalent, with computational platforms allowing real-time
matching of patients to trials using genomic and clinical criteria (Fig. 6).
Recruitment Material
Once the source (or sources) and contact points have been established, communication
material to support recruitment can be developed accordingly. Content as well as
review and approval processes will differ depending on who the targeted recipients
are.
Note: any patient-facing material needs to be approved by IRBs/ECs.
Examples of written material for the promotion of a trial include flyers, cards,
posters, or ads with a trial overview and trial contact details. Target audiences can be
participating centers, referring physicians, the broad community, and/or preselected trial
Trial A
Finding trials for
patients: Trial B
Trial C
Investigated population
268 P. Wermuth
candidates. Material supporting the informed consent process can be booklets or cards,
or short videos, with explanations and schemas of the scientific or medical background
of the trial, the underlying disease, or the treatments/medical procedures in question.
Other means of outreach include printed media (such as newspapers and
magazines), television, and radio broadcasts, as well as multimedia and online
platforms such as community web sites, forums, social media, apps, etc. (see, e.g.,
Katz et al. 2019) (Table 1).
The choice of formats used in a trial often depends on the budget and available
resources (e.g., consider the need for translations in multinational studies) but
should also reflect the specifics of the target audience to allow maximization of
the uptake of the information. To increase the success of the communication
campaign, it is often useful to use more than one means of communication (see, e.
g., Kye et al. 2009).
It should be noted that the choice of communication pathway could have a pre-
selective impact on the trial candidates identified (e.g., limited access to online media
for certain social or age groups).
Examples:
• For a single-center phase I study with healthy volunteers, a useful approach could
be to hand out study flyers and hang up posters in close-by universities or sport
centers.
• For a global phase III study in hemophilia (a well-connected and well-informed
community), ads on dedicated web sites such as association portals or patient
forums could be useful.
Often, recruitment material will be the primary tool to promote the clinical trial,
and therefore the relevance of the trial should be included in the communication.
Where possible a visual identity (e.g., trial logo) can be used that will help the trial
in standing out from others. However, it is important that the material, texts, and
visuals strictly promote only the trial and not the therapy or treatment under
investigation. There are a set of legal regulations that need to be followed when
authoring communications around clinical trials. These regulations may vary locally
but in general include points such as:
Screening
The objective of the screening process is to ensure only eligible candidates enter the
trial. Therefore, the assessments required to identify eligibility of candidates will
need to be defined, and the necessary processes and systems will need to be set up
accordingly. Questions to be addressed include the following:
• Are the required screening assessments standard of care and can be expected to be
performed as part of routine clinical practice, or are they trial specific and will
require prior consenting by the candidates? Will eligibility tests be performed
locally at the sites or will they need to be performed centrally (e.g., due to limited
availability of technology or the need for standardization for comparability)?
• What is the time frame in which screening assessments need to be performed?
Are there any time constraints in having to get the candidate’s eligibility con-
firmed (e.g., is there a maximum time window between diagnosis and start of
therapy during which screening will need to be completed)?
up being enrolled and for screen failures (including the reason for screen failure).
Also, especially for multi-site trials, a method on how to track screening activities
should be implemented, allowing (ideally real-time) monitoring of progress against
projections.
The aim is to hit the protocol required number of enrolled patients as closely as
possible, for budgetary reasons, but also in order not to unnecessarily expose
participants to experimental therapy. Toward the end of the enrollment period for a
trial, screening should be monitored carefully, as, for ethical reasons, all informed,
screened, and eligible candidates should be allowed to enter the trial. Therefore,
screening activities should stop once a sufficient number of candidates are in the
screening pool to fill the remaining enrollment slots. Especially in fast- recruiting
trials, this requires extrapolation of the screen failure rate observed previously.
Example: If the screen failure rate during the last period of enrollment in a trial
was 10%, and 9 participants are still required to complete the trial, screening should
stop when 10 candidates are in screening.
Screening Tools
• While for small trials screening progress can be tracked manually, larger
trials with several participating sites are often managed with the support of an
Interactive Voice or Web Response System (IxRS). This will allow sites to enter
into the system when a candidate has been identified (“screening call”) and then
again when the outcome of the screening assessments is established, either
leading to enrollment or to screen failure. Screen failure reasons, usually in
categories, can also be tracked this way. The use of an IxRS allows real-time
tracking of screening progress but can be costly and time-intensive to set up.
• A tool allowing close control of screening activities is the screening request form:
this is to be completed by a site once a candidate has been identified and to be
sent to the central trial team for approval. Screening assessments for a candidate
can only start once the screening request is approved by the central team.
Screening request forms are typically used in trials requiring only small numbers
of participants (e.g., phase I cohort studies with less than ten participants
per cohort) or when a certain lag time is required between enrollment of
individual participants (e.g., where tolerability of a treatment needs to be
established prior to enrollment of subsequent participants).
• The allocation of screening slots can also be useful in trials with small numbers
of participants. Here, the central trial team will allocate individual screening
slots to a small number of sites at a time, allowing only these sites to screen
one candidate at a time. Once a candidate is either enrolled or screen failed, a next
screening slot will be allocated to another site.
15 Participant Recruitment, Screening, and Enrollment 271
Enrollment
Enrollment Strategies
Enrollment Procedures
Monitoring Enrollment
• Close monitoring of enrollment progress on study and site level, and comparison
with projections as defined at trial start, will allow early detection of any devia-
tions from the recruitment plans. The setup of appropriate reports (e.g., through
the IxRS if used) should be included in the trial preparation activities as they
272 P. Wermuth
should be available early on. Such reports will also be useful for regular reporting
of enrollment progress to trial stakeholders, e.g., participating sites, sponsor
management, and regulatory authorities.
• Specific attention should be given when getting close to the end of recruitment.
As described under the screening procedures, over-enrollment should be avoided,
both for budgetary reasons and to avoid unnecessary exposure of participants to
the treatment under investigation.
Retention
Retention Strategy
A retention strategy should aim at positively influencing the trial experience for
participants and at establishing a rapport between the participant and the trial and/
or the trial team. An integral part of the retention strategy will therefore include
communication with the trial participants beyond their enrollment into the trial, ideally
through different pathways and at various time points throughout the trial duration.
Being the main trial contact point for participants, keeping local site staff
engaged, well informed of the global trial progress, and fully trained on trial
processes is key.
Note: any material distributed to participants will need prior approval by IRBs/
IECs. Also, local regulations will apply, including that the material cannot promote
the treatment under investigation, only the trial itself, and that the value of the
material cannot be seen as persuasive to participation in the trial. The material
should be strictly trial related and not exceed a certain financial value.
Supporting site staff:
• Participants’ visit reminders (e.g., phone calls or email notifications)
• Template forms to capture participant’s contact details (including those of
close relatives where provided by the participants)
• Pocket summaries, schedules, and charts providing an overview or quick refer-
ence to the protocol procedures
Recruitment has been one of the most common challenges in clinical trials in the past, and
the changing environment is not making it any easier. Factors contributing to the
increased complexity are an increasingly demanding regulatory environment and
274 P. Wermuth
competitive pressure caused by the increasing number of clinical trials being run. Also,
the tendency to tailor studies to more and more specific target populations (i.e., the trend
toward individualized health care), as well as the patients’ increased literacy and desire to
be involved in their treatment decisions, requires differentiated planning of recruitment.
Recruitment issues can be grouped into the following three categories:
Risk Mitigation
Some potential risks to the planned recruitment can already be identified prior
to the start of a trial, during comparison with analogous trials (e.g., spread of
15 Participant Recruitment, Screening, and Enrollment 275
competitive trials), during protocol feasibility (e.g., sites’ capability to run the
trial), and during interaction with patient advocacy groups, scientific experts,
and advisory boards (e.g., trial alignment with current standard of care), see e.
g. National Institute of Mental Health (2005). While not all risks can be
avoided fully, some mitigating actions can be implemented, either through
modifications of the protocol or by adding (or exchanging) countries and sites
to the trial.
Examples of risks that may require mitigation:
• Exclusion of patients with active, controlled hepatitis will significantly reduce the
participant potential in some countries.
• High frequency of radiology assessments or too high blood sample drawing
burden might not be approved by some IRBs/ECs
The common need for a speedy study setup also bears risks that can lead to an
enrollment backlog from the very beginning of the trial, including starting trial
recruitment preparations with a nonfinal protocol version, underestimating the
time needed for the setup of sites (e.g., contract negotiations taking longer than
anticipated), and not having contingency plans in place early on. Also, ensuring all
involved stakeholders such as different departments of a hospital and referring
institutions are informed of the upcoming trial and are able to communicate with
each other will help in avoiding any delays.
Engaging participating investigators early on in trial planning can positively
impact recruitment. Key opinion leaders, by their reputation in the community,
may influence their colleagues to promote the study protocol and enrollment into
the trial. Also, additional motivation might be gained if there are opportunities for
authorship on study-related publications for participating investigators (in alignment
with applicable publication authorship guidelines).
Issue Management
Once recruitment is at risk of getting off track or is off track already, different corrective
actions can be taken, with often a combination of several actions being the most
successful approach. Ideally, these actions (as well as the trigger points for their
implementation) are defined in the recruitment plan, but there should also be flexibility
to adapt the corrective actions to the current situation (Bachenheimer 2016).
Possible measures to address delays in recruitment include:
• Review of the protocol to identify areas that could hinder recruitment and assess
if these can be modified (e.g., less stringent eligibility criteria, simplified, and/or
less frequent study assessments)
• Increase incentive for participating investigators where and as possible (payment,
authorship on planned publications)
• Facilitate study participation for participants (e.g., payment where allowed,
reimbursement of expenses, material to help understand the informed consent
form, and the study specifics; see Fiminska 2014; Kadam et al. 2016)
Recruitment has been the main challenge in the management of clinical trials in
the past and is expected to become even more difficult with an increasing regulative
environment and more trials being run in highly specified populations. Therefore,
thorough upfront planning of recruitment is key to ensure enrollment of the
right participants in time and within budget. The factors to be considered include
evaluation of the quality and number of sites needed, feasibility of the protocol,
and mitigation of any potential risks as much as possible through communication
with the key stakeholders. The recruitment plan should also include a strategy to
access candidates, to monitor enrollment progress, and to retain participants in
the trial. The use of supporting tools such as benchmarking and simulation should
be factored in when setting up the trial budget.
Key Facts
of appropriate sites, the protocol design, and the communication with key stake-
holders of the trial. Retention of trial participants for as long as mandated by the
protocol is important to minimize gaps in data collection.
Cross-References
References
Anderson DL (2001) A guide to patient recruitment: today’s best practices and proven strategies.
CenterWatch, Boston
Bachenheimer JF (2016) Adaptive patient recruitment for 21st century clinical research. Available
via Applied Clinical Trials. http://www.appliedclinicaltrialsonline.com/adaptive-patient-recruit
ment-21st-century-clinical-research. Accessed 08 Jan 2019
Bachenheimer JF, Brescia BA (2017) Reinventing patient recruitment: revolutionary ideas for
clinical trial success. Taylor and Francis BBK Worldwide, Needham MA, USA
Brintnall-Karabelas J, Sung S, Cadman ME, Squires C, Whorton K, Pao M (2011) Improving
recruitment in clinical trials: why eligible participants decline. J Empir Res Hum Res Ethics 6
(1):69–74. https://doi.org/10.1525/jer.2011.6.1.69
DePuy V (2017) Enrollment simulation in clinical trials. SESUG Paper LS-213-2017. Available via
https://www.lexjansen.com/sesug/2017/LS-213.pdf. Accessed 08 Jan 2019
Fiminska Z (2014) 5 tips on how to facilitate clinical trial recruitment. EyeForPharma. Available at
https://social.eyeforpharma.com/clinical/5-tips-how-facilitate-clinical-trial-recruitment.
Accessed 08 Jan 2019
Friedman LM, Furberg CD, DeMets D, Reboussin DM, Granger CB (2015) Fundamentals of
clinical trials. Springer International Publishing, Cham
Graham LA, Ngwa J, Ntekim O, Ogunlana O, Wolday S, Johnson S, Johnson M, Castor C,
Fungwe TV, Obisesan TO (2017) Best strategies to recruit and enroll elderly blacks into clinical
and biomedical research. Clin Interv Aging 2018(13):43–50
Kadam RA, Borde SU, Madas SA, Salvi SS, Limaye SS (2016) Challenges in recruitment and
retention of clinical trial subjects. Perspect Clin Res 7(3):137–143
Katz B, Eiken A, Misev V, Zibert JR (2019) Optimize clinical trial recruitment with digital
platforms. Dermatology Times. Available via https://www.dermatologytimes.com/business/opti
mize-clinical-trial-recruitment-digital-platforms. Accessed 08 Jan 2019
Kye SH, Tashkin DP, Roth MD, Adams B, Nie W-X, Mao JT (2009) Recruitment strategies for a
lung cancer chemoprevention trial involving ex-smokers. Contemp Clin Trials 30:464–472
Lara PN Jr, Higdon R, Lim N, Kwan K, Tanaka M, Lau DHM, Wun T, Welborn J, Meyers FJ,
Christensen S, O’Donnell R, Richman C, Scudder SA, Tuscana J, Gandara DR, Lam KS (2001)
Prospective evaluation of cancer clinical trial accrual patterns: identifying potential barriers to
enrollment. J Clin Oncol 19(6):1728–1733
Little RJ, D’Agostino R, Choen ML, Dickersin K, Emerson SS, Farrar JT, Frangakis C, Hogan JW,
Molenberghs G, Murphy SA, Neaton JD, Rotnitzky A, Scharfstein D, Shih WJ, Siegel JP, Stern
H (2012) The prevention and treatment of missing data in clinical trials. N Engl J Med 367
(14):1355–1360
278 P. Wermuth
National Institute of Mental Health (2005) Points to consider about recruitment and retention
while preparing a clinical research study. Available via https://www.nimh.nih.gov/funding/
grant-writing-and-application-process/points-to-consider-about-recruitment-and-retention-whil
e-preparing-a-clinical-research-study.shtml. Accessed 08 Jan 2019
National Institute on Aging (2018) Together we make the difference – National Strategy for
recruitment and participation in Alzheimer’s and related dementias clinical research. Available
via U.S. Department of Health & Human Services. https://www.nia.nih.gov/sites/default/files/
2018-10/alzheimers-disease-recruitment-strategy-final.pdf. Accessed 08 Jan 2019
Thoma A, Farrokhyar F, McKnight L, Bhandari M (2010) How to optimize patient recruitment.
Can J Surg 53(3):205–210
Administration of Study Treatments and
Participant Follow-Up 16
Jennifer J. Gassman
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Administration of Study Treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Introduction to Administration of Study Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Verification of Site Readiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Inclusion and Exclusion Criteria Focused on Treatment Administration . . . . . . . . . . . . . . . . . . 283
Eligibility Checking and Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Getting the Treatment to the Participant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
Promoting Treatment Adherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286
Monitoring Treatment Adherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Monitoring Early Treatment Discontinuation and Tracking Reasons for Discontinuation . . . 289
The Role of the Study Team in Enhancing Treatment Adherence . . . . . . . . . . . . . . . . . . . . . . . . . 290
The End of Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Participant Follow-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
Planning the Follow-Up Schedule During Trial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Making Trial Data Collection as Easy as Possible for the Participant . . . . . . . . . . . . . . . . . . . . . 294
Training the Participating Site Staff on Follow-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Retention Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Factors Related to Predicting Retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
The Role of the Study Team in Promoting Retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Interrelationship Between Treatment Discontinuation and Dropouts . . . . . . . . . . . . . . . . . . . . . . . . . . 298
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
J. J. Gassman (*)
Department of Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH, USA
e-mail: [email protected]
Abstract
After clinical trial participants have consented, provided baseline data, and been
randomized, each participant begins study treatment and follow-up. This chapter
covers administering a participant’s randomly assigned treatment regimen and
collecting the participant’s trial data through the end of their time in the study,
along with tracking and reporting data on timeliness and quality of treatment
administration and of follow-up visit attendance and trial data collection. Treat-
ment administration can include providing study medications or, in a lifestyle
intervention trial, teaching the participant to follow a diet, exercise, or smoking
cessation intervention. Trial data collection includes, for example, questionnaires
completed via smartphone, laboratory sample collection details and results of lab
analyses, imaging data, treatment adherence data, measurements taken at clinic
visits, and adverse event data. Monitoring participant follow-up and capturing
reasons why patients discontinue treatment or end follow-up early can aid in
interpretation of a trial’s results. Treatment administration, treatment adherence,
and participant follow-up metrics should be captured in real time, with the Data
Coordinating Center (DCC) providing continuous performance feedback to study
leadership and to the participating sites themselves. These aspects of trial conduct
are described in the context of a multicenter trial in which two or more clinical
sites enroll participants and study data are managed in a centrally administered
database. Timeliness and accuracy of study treatment administration is key to the
success of a trial. Participants providing required data according to the protocol-
defined schedule allow a trial to attain its goals.
Keywords
Run-in · Pill count · Treatment crossover · Unblinding · Treatment
discontinuation · Visit schedule · Visit window · Close out · Drop out (no longer
attending clinic visits) · Withdrawal of consent (no longer willing to provide data)
Introduction
Training
During staff training, study coordinators and other team members should be taught
how treatment is administered and the importance of treatment adherence (as well as
study visit schedule implementation and the importance of retention, covered in
section “Participant Follow-Up” of this chapter). Study coordinators and investiga-
tors experienced with the treatment being studied should be invited to present
segments of training related to treatment administration including blinding and
treatment challenges.
Regarding blinding, the training should include (1) Who will know what treat-
ment each participant receives, particularly with regard to whether the study be
unblinded/open label; single blinded, where the participant does not know their
treatment but members of the study team do; or double-blinded, where neither the
participant nor any member of the study team knows the participant’s treatment
assignment, (2) managing unblinding (see ▶ Chaps. 43, “Masking of Trial
Investigators,” and ▶ 44, “Masking Study Participants”), and (3) a discussion of the
role of study blinding in preventing subconscious bias in outcome assessment.
Regarding challenges which are associated with the study treatment, study
coordinators who are less familiar with the trial treatment will benefit from
hearing from those with on-the-ground experience using this treatment or similar
treatments. For lifestyle intervention trials, the systems in place implementing the
experimental and control treatment arms should be covered in detail, including
the study coordinators’ role in encouraging adherence to the intervention. For
medication trials, coordinators should be trained to teach participants require-
ments for their assigned treatment, e.g., whether a study drug should be taken on
an empty stomach or with a meal as well as how and when dosage is to be
increased or reduced. At training, study coordinators should also be informed of
the study plans for monitoring and reporting treatment adherence. The Data
Coordinating Center (DCC) team should present templates of treatment admin-
istration and treatment adherence-related tables from planned feedback reports/
electronic weekly reports.
Once the trial is under way, coordinators at one site may be able to offer valuable
tips to those at other sites based on their experience with participants who have had
difficulties with treatment adherence. Study coordinators and other participating site
team members can learn additional strategies for improving treatment adherence
informally, as part of a structured routine study coordinators’ web meetings or calls,
and at annual training/retraining meetings.
Study coordinators should be fully engaged in helping to ensure participants
receive their treatments on schedule and maintain the trial’s treatment blinding
as they work to enhance treatment adherence. Coordinators should also be
encouraged to engage and garner support from the participant’s family when
possible.
Training should also include a description of closeout in which a participant’s
treatment ends, remaining medications – if any – are returned, and study visits
cease.
16 Administration of Study Treatments and Participant Follow-Up 283
Before a participating site’s subjects are enrolled and randomized, site readiness
should be assessed. A site “Ready to Enroll” table should be included in trial
monitoring (e.g., in the weekly report) showing whether each site has met require-
ments to begin consenting participants. These requirements will vary from study to
study but should include IRB/ Ethics Committee approval, completion of staff
training requirements, capture of needed site delivery addresses and staff contact
information, and (for medication trials that include a starter supple of medication)
receipt of the initial supply of medication, or information on how study drug is
ordered, and clear plans for appropriate secure storage of medication and participant
study documents as required by local regulations. Participants should not be
consented until a site is ready to begin randomization/recruitment. Site initiation
visits are sometimes conducted to ensure site-readiness.
The trial protocol should ensure that participants who consent and are random-
ized to study treatment are likely to be able to comply to study treatment.
Treatment-related inclusion and exclusion criteria can help ensure participants
who are not likely to be able to comply with study treatment requirements are
not randomized. A first step in this process is to ensure that the potential
participant can safely follow the treatment; participants who are allergic to or
have had side effects when on the study treatment will generally be excluded, as
will participants who require treatment with a concurrent medication that is
contraindicated for a participant on the active treatment arm. A second step
would be to ensure that the participant is likely to follow the treatment regimen;
participants who have a history of nonadherence to the type of treatment being
administered will generally be excluded. A third step would be to ensure that the
participant will be available to take the treatment; for many studies, participants
who spend several months each year away are problematic. Study investigators
should consider whether, for example, a participant who spends winter in Florida
or a participant who is away at school during the academic year would be
appropriate for randomization and should adjust exclusion criteria accordingly.
(Note that this third criterion is also important for complete participant follow-
up, i.e., retention, as described in section “Participant Follow-Up.”)
Informed consent is, of course, part of a participant’s time line in a trial. Study
procedures may not be performed and data may not be submitted to the DCC
until the participant has consented. The timing of informed consent is covered
284 J. J. Gassman
in this book’s ▶ Chap. 21, “Consent Forms and Procedures.” A trial partici-
pant’s data collection time line generally consists of screening, a baseline visit
or visits, eligibility confirmation, randomization, and subsequent follow-up
visits. Ideally, the participating site’s team will have at least two contacts
with a potential enrollee prior to randomization (Meinert 2012). This will
allow time for the participant to be certain they understand the requirements
of the trial and are fully on board and have had all their trial-related questions
answered, and time for the study team to fully consider the participant’s
suitability for the trial.
Trials in which the protocol includes baseline visits allow an opportunity to
test the participant’s ability to follow trial requirements during a run-in, or test,
period, and many trials include a run-in trial of patient requirements, as
described in this book’s ▶ Chap. 42, “Controlling Bias in Randomized Clinical
Trials.” For example, in the Frequent Hemodialysis Network Daily Trial, the
experimental treatment arm featured 1 year of six shortened dialysis sessions per
week versus the usual care control arm (three standard sessions per week).
During baseline, participants were required to visit the dialysis unit daily for 6
consecutive days. Three participants dropped out or were excluded (FHN Trial
Group et al. 2010) because of unanticipated difficulties in getting to their dialysis
unit 6 days a week. In the Frequent Hemodialysis Network Nocturnal Trial, each
participant’s home water supply needed to be evaluated for appropriateness of
use of in-home hemodialysis (Rocco et al. 2011). In FONT II (Trachtman et al.
2011), each participants’ Angiotensin Converting Enzyme (ACE) inhibitor/
Angiotensin Receptor Blocker (ARB) was followed through a series of baseline
visits to ensure the regimen was effective and stable. In medication trials,
participants sometimes take placebo medication during Baseline. A run-in period
is particularly useful if the treatment may be unacceptable to some patients
because of pill/capsule size or the number of pills that must be taken each day.
Such a Baseline run-in period may include specified adherence criteria such as
“Pill count must show 80% adherence to treatment during run-in,” as was
required in the COMBINE Trial (Ix et al. 2019) or may include a check-in
with the participant and a record of whether the participant reported any diffi-
culties taking the study medication. Once the Baseline period is complete, the
participant will be randomized to treatment if study eligibility confirmation, e.g.,
a “Ready to Randomize” interactive program, has verified that all required
baseline data have been collected and the site has verified that, logistically, the
participant is available to start their randomized treatment, e.g., the participant is
not currently traveling or hospitalized. At this point, the participant is random-
ized and irrevocably part of the study, and the site is notified of the patient’s
treatment assignment; in a blinded medication trial, the site would receive the
bottle or bin number of the medication to be provided to the participant. In
studies where randomization is carried out online, the treatment assignment
should be e-mailed to the study coordinator in addition to being displayed on
the randomization screen to ensure that the study coordinator can easily confirm
the assignment.
16 Administration of Study Treatments and Participant Follow-Up 285
In studies with drug treatments, each participating site will provide an address for
shipment of drug; at many sites, this will be the address of a hospital’s research
pharmacy. Drugs will come from their manufacturer or a study’s central pharmacy
(as described in this book’s ▶ Chap. 7, “Centers Participating in Multicenter Trials”)
and may be provided to the sites in bulk (for local blinded distribution) or in coded,
numbered bins or numbered bottles. Details on how drugs come to the participating
sites are in this book’s ▶ Chap. 11, “Procurement and Distribution of Study
Medicines.” On-site options for getting the treatment to the participant include,
having the participant pick their medications up from the site’s pharmacy, or having
the study coordinator pick the medication up for the patient so the coordinator can
hand the medication to the participant. When possible, handoff by the study coor-
dinator is easier for the participant and ensures that the participant leaves the site
with study drug.
Participants should begin treatment as soon as possible after randomization. In
studies with a baseline period, final baseline eligibility data and baseline values for
study outcomes including, lab results or imaging, must be captured prior to random-
ization, so the participant timeline will include a final baseline visit shortly prior to
randomization. If all required results are expected to be available before the partic-
ipant leaves the clinic and arrangements can be made such that study drug is
available on-site at the clinic, it may be possible for a participant to be randomized
at the end of this last baseline visit and go home with medication that day. If
inclusion criteria include data resulting from images or lab tests done at the last
visit of baseline, eligibility cannot be determined until after the final baseline visit. In
such a case, procedures should include randomizing the participant as soon as
possible, but at a time when the participant can begin taking drug (i.e., when the
participant is in town and not in the hospital) and getting the treatment to the
participant as soon as possible after randomization. A participant’s appointment
schedule will be based on the date of randomization, not on the day he or she started
treatment, i.e., the target date for a 1-year follow-up visit should be 1 year from
randomization. The study protocol may include a visit held shortly post-
randomization in which the participant receives their study medication (often
referred to as the “Follow Up 0” or the “Week 0” visit as in the AASK Trial
(Gassman et al. 2003) or FSGS Trial (Gipson et al. 2011)). A face-to-face visit
will be required if the treatment must be delivered under medical supervision, i.e., an
IV infusion. Alternatively, the protocol may allow for the treatment being delivered
to the participant or the participant picking the drug up from the site’s research
pharmacy. It may be helpful for a participant to interact with a study team member
when a treatment is provided, particularly under protocols in which there is some
complexity to treatment administration, e.g., in the case of double-dummy system
where the participant must take two different types of medication or in the case
where it is critical that the medication be taken under specific circumstances, such as
on an empty stomach or with a meal. Whichever method is used, the date the
participant begins taking medication should be captured in the study database.
286 J. J. Gassman
1 California 20 20 2.2
2 Colorado 48 46 1.3
3 Connecticut 17 17 1.0
4 Delaware 34 33 1.0
5 Florida 47 45 1.6
6 Georgia 16 15 1.8
7 Illinois 7 7 2.4
It may be shocking for those new to the conduct of clinical trials to learn that
sometimes participants who are randomized to a study treatment do not adhere to
their treatment assignment. A trial’s Coordinating Center and Steering Committee
should implement multiple systems to enhance treatment adherence. As a first step
for any medication trial, efforts should be made to ensure that the participant does
not simply forget to take their pills. This can be customized to the treatment
requirement. For example, if a pill is to be taken in the morning, the study coordi-
nator might review the participant’s morning routine and determine where the study
medication should be kept, e.g., next to the coffee pot. If a pill is to be taken multiple
times a day, it may be useful to provide the participant with a pillbox; 2 7 and
4 7 weekly pillboxes are readily available. Smartphone applications for reminders
are available (Dayer et al. 2013; Ahmed et al. 2018; Santo et al. 2016) and have been
used successfully in randomized clinical trial settings (Morawski et al. 2018). Some
medications have clear requirements for successful administration. For example,
phosphate binders must be taken with a meal containing phosphorus in order to
reduce the possibility of GI side effects, and all patients randomized to the COM-
BINE Study (Isakova et al. 2015) were reminded at each visit to make sure to take
their blinded study medication with a meal. Ensuring requirements such as this are
met can also prevent treatment unblinding; e.g., a participant who takes placebo
16 Administration of Study Treatments and Participant Follow-Up 287
phosphate binders on an empty stomach will not experience GI side effects where as
a patient who takes active phosphate binders on an empty stomach likely will.
Study protocols and manuals of operations will include steps related to reducing or
temporarily stopping medication when a participant reports mild side effects poten-
tially related to treatment, seeing if the side effect goes away, and then up-titrating
back, possibly to a lower dose. Reducing or temporarily stopping medication may also
be helpful for a participant who suspects a symptom he or she is experiencing is caused
by the study medication, even if the study team sees no pathway by which the drug in
use could cause that symptom.
In a long-term study, investigators might consider studying coordinators
suggesting a brief “pill holiday,” to allow participants who are at risk of ending
participation to take a week or a month off their study drug. In long-term studies,
when participants have stopped medication (treatment discontinuation) for reasons
unrelated to the study drug and continue attending study visits, it is useful to ask the
participant at subsequent visits if he or she might now consider going back on the
medication, perhaps at a lower dose than previously.
Participants may refuse the treatment to which they have been randomized or
become a treatment crossover, i.e., a participant who has switched to another study
treatment arm. Such a participant is sometimes called a drop-in, defined by
Piantadosi (2017) as a participant who takes another treatment that is part of the
trial instead of the treatment he or she was randomized to, and can be followed for
study outcome. Drop-ins cause treatment effect dilution whereby the estimate of the
difference between the effect of experimental treatment and the control treatment is
reduced.
Finally, for many types of participants (e.g., adolescents, the elderly) and many
types of treatments (e.g., antihypertensives, antirejection drugs, retroviral drugs, and
dietary interventions), there is a full body of research on barriers to adherence. This
is beyond the scope of this chapter, but the DCC and the study leadership should be
aware of the literature on adherence related to the participant group and the treatment
under study.
Drugs do not work in participants who do not take them. – C. Everett Koop, M.D.,
US Surgeon General, 1985 (Osterberg and Blashchke 2005).
A variety of methods are available for treatment adherence monitoring, and the
method selected depends on the type of study and the type of information required
(Zulig and Phil Mendys 2017). Medication electronic monitors, also called MEMS
or “smart bottles,” are expensive but can provide precise information on when a pill
bottle or a particular section of a pillbox is opened, allowing investigators to assess
adherence to days and adherence to times of day pills are taken (Schwed et al. 1999).
Methods such as pill counts and weighing medication bottles require participants to
remember to return their “empty” bottles and are logistically cumbersome for the site
staff, and these methods are easily influenced by participants who know their
288 J. J. Gassman
Total 204 187 163 89.2 +/- 14.3, 21.9, 100.0 82.21
16 Administration of Study Treatments and Participant Follow-Up 289
80% of noncompliers if pill count is used as a gold standard. Stewart’s question was
phrased as “How many doses might you have missed in the 10 days?” and was asked
as follow-up to an affirmative answer to a nonjudgmental question regarding
whether the participant might have missed some doses. It is important that medica-
tion interrogation be carried out in a nonjudgmental way. Kravitz, Hays and
Sherbourne (1993) reported that when a cohort of 1751 patients with diabetes
mellitus, hypertension, and heart disease was surveyed on their adherence to med-
ications, more than 87% reported they had taken their medications as instructed by
their doctors “all of the time. A review by Garber et al. (2004) found a wide range of
level of agreement between participant-reported adherence by interview, diary, or
questionnaire and more objective measures of adherence such as pill count, canister
weight, plasma drug concentration, or electronic monitors.
In a pragmatic trial, a participating site may be able to track whether a participant
has picked up their treatment or refilled their prescription.
Direct signs of treatment adherence include laboratory measures of levels of the
drug itself or of one of its metabolites (Osterberg and Blashchke 2005). Biomarkers
of treatment response may also be useful as an indirect sign of treatment adherence.
When monitoring treatment adherence, it is traditional to report adherence on
those for whom the adherence was measured. That is, if 100 participants were
assigned to a treatment and pill counts are available for 50 of these participants,
treatment adherence is reported for the 50 who provided pill count data. The “zero”
pill count adherence for those who did not return their pills to the participating site,
or those whose pills were not counted for some other reason, is discussed in a
nonquantitative way rather than being “averaged in” as pill counts of zero.
Early treatment discontinuation and treatment crossover can dilute the estimate of a
true treatment effect. In early discontinuation, a participant stops taking their
assigned treatment. With early discontinuation of active treatment, the treatment
effect crosses over to a different treatment effect which may be less than or equal to
the treatment effect observed in the study’s placebo group. Treatment crossover is
worrisome when a participant randomized to the placebo group seeks out and begins
taking the treatment being used in the active treatment group. Discontinuations and
treatment crossovers can lead to effective interventions being found ineffective and
should be carefully tracked to allow for sensitivity analyses and to inform investi-
gators planning future trials of similar treatments.
Sometimes, early treatment discontinuation is clearly related to an adverse event
(AE) or a serious adverse event. Examples include AEs related to lab values or
symptoms. Lab Value AEs requiring treatment discontinuation may be specified in
the study protocol. For example, in the AASK Study, participants (who may have
been randomized to Lisinopril) discontinued their ACE inhibitor arm if their serum
potassium was over 5.5 (Weinberg et al. 2009). In situations such as this, a physician
290 J. J. Gassman
may choose to stop study treatment if a participant is near the protocol-required cut-
point as well. Either way, the primary reason for such treatment discontinuations
should be tracked in the study database and should be tracked separately as treatment
stopped due to lab value as defined by protocol, or treatment stopped due to lab value
observed, physician judgment. Similarly, treatments stopped due to the appearance
of specific symptoms or potential medication side effects could be categorized as
having been stopped due to symptoms with separate categories for protocol-defined
discontinuation, physician judgment, or participant preference.
Other reasons for treatment discontinuation include a participant becoming burnt
out by the medication requirements of a study in the absence of abnormal lab values
or side effects. Such discontinuations may be flagged as discontinuation due to “pill
burden.” The study database should explicitly track the Participating Site’s evalua-
tion of the primary and secondary reason a participant stopped taking study drug.
These data should be captured in real time.
When participants stop taking medications because they have stopped coming to
visits, this should also be tracked. Table 3 shows an example of tabulation of the
Primary reason a study participant was not on study medications at the final
study visit including both cases thought to be related to study medication (stopped
drug due to lab adverse event, stopped drug due to patient-reported side effects,
stopped drug due to patient-reported pill burden) versus cases where a patient
stopped attending visits early but did not withdraw consent (discontinued active
study participation, allowing for passive follow-up only) or declared withdrawal of
consent (would no longer provide study data) or can no longer be located or
contacted.
Regarding the topic of adherence in statistical analyses of clinical trials, the reader
is referred to the ▶ Chap. 93, “Adherence Adjusted Estimates in Randomized
Clinical Trials” in the Analysis section of this book authored by Sreelatha Meleth.
Every member of the clinical trial team has a role in enhancing treatment adherence.
Treatment adherence issues should be discussed on study coordinators conference
calls; it is particularly useful for coordinators who have had success in adherence-
related issues to share their experiences with those who have had less success.
Principal investigators (PIs) should become personally involved in providing posi-
tive feedback for high adherence, as well as encouraging and strategizing with
participants who have had difficulties with adherence. PIs should routinely discuss
each participant’s adherence with the team. An adherence committee made up of
study coordinators, physicians, and data-coordinating center staff members may be
able to come up with suggestions as a brainstorming group. Treatment adherence
should be a topic on the agenda of every steering committee meeting. The data
coordinating center should ensure that the trial’s manual of operations includes
strategies that will assist with adherence for the treatments being studied, and the
DCC is responsible for providing feedback on every aspect of adherence and
16 Administration of Study Treatments and Participant Follow-Up 291
Table 3 Primary reason a study participant surviving to final visit (end of study) was not on
randomized study medication at final visit
For each trial participant, the last on-treatment-study visit marks the participant’s
closeout, and remaining medications should be collected at the participating site. If
the participant forgets to bring their medications in to their last visit, effort should be
made to collect these medications.
Many trials include an off-treatment observation period after treatment ends. If a lab
test is to be taken at the end of a specified observation period (a final off-treatment
visit) to check for the persistence of a biomarker after treatment ends, as in the BASE
Trial (Raphael et al. 2020), the target date for the final off-treatment visit may depend
on the date the last dose of study medication was taken. For example, a month 13 final
off-treatment visit may have a target date of 4 weeks after the month 12 visit, rather
than 13 months after randomization, to allow for reduction of biomarkers or signs
4 weeks after treatment. This should be considered during protocol development and
incorporated into the participant appointment schedule described in section “Partici-
pant Follow-Up” below.
Participant Follow-Up
Introduction
Every trial has a goal of complete follow-up for all participants, and complete collection
of the primary outcome for the study intent-to-treat analysis is the goal. In order to
achieve this goal, the participant follow-up visit schedule must be well-defined and
reasonable from both the patient and the participating site team’s point of view, and the
team will need to focus on visit attendance and prevention of incomplete follow-up
throughout the course of the trial. For a full discussion of intention to treat analysis, see
this reference book’s Analysis Section’s chapter ▶ Chap. 82, “Intention to Treat and
Alternative Approaches” by J. Goldberg. Incomplete follow-up carries with it a risk of
bias in the primary outcome, particularly when the number lost is substantially different
between the two treatment groups and the question of whether the experimental
intervention influenced attrition (Fewtrell et al. 2008). Even when retention rates are
the same in two treatment groups, if retention is not high, study power is reduced and
16 Administration of Study Treatments and Participant Follow-Up 293
generalizability can be harmed (Brueton 2014). The Special Topics section of this
reference book includes the chapter on Issues in ▶ Chap. 113, “Issues in Generalizing
Results from Clinical Trials” by Steve Piantadosi. Statistical methods are available to
handle missing data and are discussed in this book’s Analysis Section’s ▶ Chap. 86,
“Missing Data” by A Allen, F Li, and G Tong, but clearly prevention of missing data is
the goal and the impact of some missing data in the middle of a patient’s follow-up is
less of an issue than the loss of a patient’s final data. Results from trials with retention
rates of 95% or greater will generally be considered to be valid, particularly when
retention is similar in all treatment groups. Retention rates of 80% or lower call into
question the validity of results (Sackett et al. 1997). During trial design, staff training,
and throughout participant follow-up, retention is key.
The follow-up schedule requires careful consideration during study design. Visits
held prior to randomization include screening visits and baseline visits. Visits held
after randomization (when a randomized treatment is allocated to the patient from
the study’s randomization schedule) are designated as follow-up visits. If the treat-
ment is provided to the patient on the day of randomization, the visit may be referred
to as the “randomization visit.” The number of visits and other contacts included in a
trial’s visit schedule must be frequent enough to allow for treatment administration
and, where necessary, dose adjustment, as well as patient training and collection of
needed adherence, process, safety, and outcome data. Trials sometimes have more
visits early in follow-up as participants learn to follow their assigned intervention
and/or ramp up dosages.
The visits should therefore be often enough but not too often, and in cases where
no hands-on data collection is required, phone or electronic contact can be
substituted for visits. If participant contact or participant visits occur only once a
year, more participants will become difficult to locate or contact because they have
moved or changed phone numbers since their last contact. Infrequent visits also
make it difficult to capture complete and accurate information on adverse events
(AEs) and serious adverse events (SAEs). On the other hand, if participant contact or
participant visits occur frequently (e.g., weekly) throughout a trial, this may be too
much for a participant to bear. This is particularly true if a number of participants
have long travel times due to distance or traffic; it is useful to collect information on
participant travel time to the clinic so that if during study conduct, visit adherence
becomes an issue, the site can check on the relationship between travel time and visit
attendance and, if necessary, limit enrollment of participants who live in areas that
are problematic for follow-up visits. The schedule of follow-up visits will depend on
the complexity of the study intervention and the disease being studied.
All participants should have the same visit schedule; this will prevent bias in AE
and SAE detection [“The more one looks, the more one sees” (Meinert 2012)] and
will reduce follow-up bias associated with more time and attention spent on inter-
vention group participants. The duration of the visit schedule should be well-defined;
294 J. J. Gassman
follow-up either continues until a common calendar date for all patients or continues
to a common time point for each patient, e.g., 24 months postrandomization. Follow-
up should continue according to protocol regardless of patient adherence to treat-
ment or patient attainment of a given outcome.
As noted, once a study begins, recruitment, retention, and adherence are key. The
plans for visits and collection must be safe for the patient and should be kept as easy
as possible for the participant. When multicenter trials are being designed, steering
committees balance their hopes for pragmatism with special research interests, and a
study protocol may be full of visit requirements including, questionnaires, lab tests,
physical function tests, imaging, and clinical measurements.
The desire to capture a large amount of data is sometimes addressed by having
shorter routine visits and collecting more data remotely and collecting more data at
annual visits. Collecting more data remotely is helpful. Questionnaires can be
completed from home during the week before a visit, online via smartphone, tablet,
or computer, or can be sent and returned by mail. Collecting more data at annual
visits can cause annual visits to be overwhelmingly long. The steering committee
should think outside the box. In a trial with brief quarterly visits, it may be possible
to collect some of the “annual extra data” at months 0, 12, 24, and 36 and other data
at, say, months 0, 9, 21, and 33. If participants decide that their visits are too long,
they will be less likely to attend all of their visits.
Reminders can help with visit attendance. Coordinators can customize reminders
prior to visits to the participant; some participants will prefer reminder via text
message rather than phone call, for example. Reminders are also helpful when a
patient must bring along a sample (24-hour urine jug, for example) or their pill bottle
(s) for counting or weighing.
Consideration should be given to ensure that requirements are convenient and
that participants will be comfortable. The site should consider not only covering
parking expenses but also making sure the participant can park in a convenient area.
Childcare expenses could be covered. Holding evening and weekend visits will be
helpful for working participants. If, for example, a visit includes a test that requires a
12-hour fast, visits should be scheduled in the morning. If a trial requires going to a
distant part of the medical campus, the study coordinator should arrange for a shuttle
ride. Making visits as convenient as possible for participants will pay off in
retention.
Initial site-staff training, and retraining at staff annual meetings, should include a
review of the trial’s visit schedule and the trial’s retention plan as well as training in
methods known to facilitate retention. The expectation should be that each
16 Administration of Study Treatments and Participant Follow-Up 295
participant will attend all visits, with a recognition that of course some participants
will miss some visits. When a participant is randomized, the DCC should provide the
site with the participant’s appointment schedule showing both the target date for each
visit (e.g., the target date for the 12-month visit is 12 months postrandomization) and
the study-specified visit window (e.g., plus or minus 1 week of the target date). It is
also helpful to have a master schedule with the start of visit windows and target
appointment dates for each participating site.
Methods to facilitate retention that should be covered at training could include as
examples:
In a long-term study in which patients may become burned out over time, training
in retention should include prioritization for outcome measures. The study will
always accept whatever data a not-fully-adherent participant is willing to provide,
and obtaining the primary outcome measure for each patient will always be top
priority. However, if there are multiple secondary measures, the study leadership
should provide guidance to the sites on the relative importance of each planned
measure or the value of a surrogate for the primary outcome if the primary outcome
is not available. Such prioritization will help the site team negotiate with patients
who reach a point in a long-term study where they will no longer comply with all of
the study requirements.
As an aside, trial leadership should also ensure that study coordinators and other
participating site personnel feel valued and appreciated.
296 J. J. Gassman
Retention Monitoring
The DCC should design their retention monitoring feedback at the start of study
and provide an example retention table at training and review how to read it. It may
be helpful for the feedback to include missingness for a single visit and a summary
of those who have missed their last two (or more) visits; the first is noteworthy but
may be easily explained by the circumstances of the participant, e.g., they were on
vacation or hospitalized for much of the visit window, but the participating site
team expects them back for their next visit. The second note, of one who has
missed their past two or more visits, flags a participant at risk of becoming lost to
follow-up.
It is helpful to report on missed visits both looking at each visit (month 6, month 12)
and looking overall (across all visits). An example of a table monitoring missed visits
by visit and by site is shown in Table 4. The first column would list the participating
sites. The second column would show how many randomized participants would have
been expected to have that visit, i.e., the number of randomized participants who have
been in follow-up through the end of that visit window as of a data entry lag time such
as 2 weeks previous to the report being run. The third column shows the number and
percent of expected visits held. The fourth column shows the number of visits known to
have been missed, based on site reporting; it is useful to have an item at the beginning
of a study’s visit form on status of visit – held or missed to document the cases where
the visit window is past and the form is not pending; the site confirms that the visit was
not held. A fifth column can be used to document cases where the site has submitted a
form for that visit but the visit was held so far outside of the window that it is unlikely to
be used in analyses, e.g., the visit intended for Month 3 was held in the beginning of the
Month 6 target window. The sixth column would show counts of participants with an
unknown visit status, flagging cases where the visit form has not yet been submitted.
This table is appropriate for the study-wide weekly report; site personnel will also need
the details for columns 4–6, so they are reminded of which participants missed the visit
1 California 19 16 (84.2%) 1 1 1
2 Colorado 52 49 (94.2%) 2 0 1
3 Connecticut 18 16 (88.9%) 1 0 1
4 Delaware 41 38 (92.7%) 1 0 2
5 Florida 49 45 (91.8%) 1 2 1
6 Georgia 18 16 (88.9%) 1 1 0
7 Illinois 7 7 (100%) 0 0 0
1 California 19 0 19 0 19 0
2 Colorado 47 0 44 0 37 0
3 Connecticut 16 0 16 0 15 1
4 Delaware 33 0 32 1 22 1
5 Florida 46 0 42 0 41 1
6 Georgia 15 0 13 0 13 0
7 Illinois 6 0 6 0 5 0
(column 4), know the IDs of those whose visit was done so outside the window that it
cannot be used as data for that visit (column 5), and have a pending visit form (column
5). If a study’s recommended visit windows are strict and the report shows a high
proportion of visits as having been missed, it is useful for the weekly report to include
two versions of these tables, one showing visits missing under the study’s strict visit
window limits and one showing visits missing under a broader window indicating that
the data are close enough to the target date to be used for some statistical analyses.
Early detection of patients at risk for becoming drop outs (randomized patients
who have stopped attending study visits) is critical. It is helpful to report on those
who have missed their last two visits. Table 5 shows an example tallying this by site
and identification of the last two visits missed. The IDs of those who have missed the
last two visits should be provided to each site. Participants will “fall off” Table 5 and
the listing when they resume visit attendance. Participants who have died (or been
censored) are reported separately rather than in these tables, which are focused on
dropouts.
It is helpful to have participating site personnel investigate and provide an expla-
nation of why a participant has missed the last two visits. The process used to get this
information from the site ensures that the site realizes that the participant has missed
two visits and requires the site team to investigate. This can help detect cases where a
participant has moved, had an extended hospitalization or rehab visit, or died and will
focus the site on retention of participants at the individual participant level.
Sites should publish efforts taken to enhance retention. They are more likely to be
applicable to other trials in other populations or other disease areas than methods
used to enhance recruitment and adherence (Fewtrell et al. 2008).
298 J. J. Gassman
When reporting on retention, if the number of patients available through the end
of a trial for full analysis of study safety and other outcomes (treatment adherence,
quality of life) differs from the number of patients available for the trial’s primary
outcome, both should be reported as was done in the BID Trial (Miskulin et al.
2018); in this study, fewer patients had data for the primary outcome of change in LV
Mass than were available for other study measurements because of difficulties with
scheduling and measurement of baseline and month 12 MRI. In studies with
mortality outcomes, trialists may be able to capture the primary outcome in more
patients than one can evaluate for other outcomes.
Each member of the Study team has a role in promoting retention. At the partici-
pating site, study coordinators should engage the patient. As noted, the study
coordinators and site investigators could set up a system such that reimbursement
for expenses such as parking and payment, gift cards, and other incentives are
provided to the patient. The data coordinating center should provide retention
feedback and facilitate discussion of retention on study coordinator calls and at
Steering Committee meetings, and the Study DSMB should highlight retention
issues and emphasize the importance of retention in its recommendations back to
the steering committee.
The challenges in getting the participant to comply with their study treatment
(section “Administration of Study Treatments”) and getting the participant to attend
study visits and provide data (section “Participant Follow-Up”) are related. These are
directly related on a patient level in that a patient who misses visits may also be
noncompliant to treatment. Difficulties with these two may go hand in hand at the
study or site level as well. A trial or a participating site in a trial that is having
significant trouble with adherence may also have difficulties in retention and vice
versa. Of course, it is hoped that those who discontinue treatment remain available
for follow-up and obtaining the primary result variable, but studies with a higher
number of patients who do not comply to treatment may also have more patients who
stop attending visits and providing follow-up data. Site personnel should be
reminded often that if a patient says they will no longer follow study treatment per
protocol, they should be encouraged to follow any level of treatment. If the patient
will no longer accept any study treatment, they should be counseled on the impor-
tance of continuing to provide follow-up data. As noted, it may be helpful for the
study leadership to provide a prioritized list so patients who are reluctant to provide
full data care can be asked to provide as much as possible, in the order of importance.
Every study should follow up on drop outs in any way possible, even if all that can
16 Administration of Study Treatments and Participant Follow-Up 299
be done is to check for vital status at the end of the study. Unless a patient has
withdrawn consent and refuses to allow the study to capture any information, it is
likely that at least some data will be available on most patients who drop out, and, as
noted, patients who stop attending visits may be quite willing to allow for passive
follow-up whereby their local medical charts are used to provide information on
blood pressure, lab measures, and hospitalizations, for example. The DCC should
take care to report on two types of protocol nonadherence as separate issues, so the
study leadership and DSMB can consider why participants are not following treat-
ment or why they have stopped attending visits so it is clear which patients who have
discontinued treatment are available for continued follow-up for the primary out-
come variable (and have the potential to return to adherence) and which patients are
no longer willing to attend visits.
All of the steps in study design and organization leading up to the initiation of a
study are critical, yet once trial randomization begins, achieving the study’s planned
recruitment, adherence, and retention and accurately capturing data on these factors
are key to meaningful study results. Training sessions, meetings, and conference
calls of the Steering Committee, subcommittees, and study coordinators should
include agenda items focused on these areas of trial conduct. Treatment administra-
tion, treatment adherence, and participant follow-up data should be shared using
metrics and should be captured in real time, with the Data Coordinating Center
providing continuous performance monitoring of these data to study leadership and
to the participating sites themselves to optimize performance for valid study conduct
and complete and accurate data capture.
Key Facts
Cross-References
References
Ahmed I, Ahmad NS, Ali S, George A, Saleem-Danish H, Uppal E, Soo J, Mobasheri MH, King D,
Cox B, Darzi A (2018) Medication adherence apps: review and content analysis. JMIR Mhealth
Uhealth 6(3):e62. https://doi.org/10.2196/mhealth.6432
Booker C, Harding S, Benzeval M (2011) A systematic review of the effect of retention methods in
population-based cohort studies. BMC Public Health 11:249. https://doi.org/10.1186/1471-
2458-11-249
Brueton VC, Tierney JF, Stenning S, Meredith S, Harding S, Nazareth I, Rait G (2014) Strategies to
improve retention in randomised trials: a Cochrane systematic review and meta-analysis. BMJ
Open 4(2):e003821. https://doi.org/10.1136/bmjopen-2013-003821
Byar DP, Simon RM, Friedewald WT, Schlesselman JJ, DeMets DL, Ellenberg JH, Gail MH, Ware
JH (1976) Randomized clinical trials – perspectives on some recent ideas. N Engl J Med
295:74–80. https://doi.org/10.1056/NEJM197607082950204
Dayer L, Heldenbrand S, Anderson P, Gubbins PO, Martin BC (2013) Smartphone medication
adherence apps: potential benefits to patients and providers. J Am Pharm Assoc (JAPhA) 53
(2):172–181. https://doi.org/10.1111/j.1547-5069.2002.00047.x
Dunbar-Jacob J, Rohay JM (2016) Predictors of medication adherence: fact or artifact. J Behav Med
39(6):957–968. https://doi.org/10.1007/s10865-016-9752-8
Farmer KC (1999) Methods for measuring and monitoring medication regimen adherence in
clinical trials and clinical practice. Clin Ther 21(6):1074–1090. https://doi.org/10.1016/
S0149-2918(99)80026-5
Ferris M, Norwood V, Radeva M, Gassman JJ, Al-Uzri A, Askenazi D, Matoo T, Pinsk M, Sharma
A, Smoyer W, Stults J, Vyas S, Weiss R, Gipson D, Kaskel F, Friedman A, Moxey-Mims M,
Trachtman H (2013) Patient recruitment into a multicenter randomized clinical trial for kidney
disease: report of the focal segmental glomerulosclerosis clinical trial (FSGS CT). Clin Transl
Sci 6(1):13–20. https://doi.org/10.1111/cts.12003
Fewtrell MS, Kennedy K, Singhal A, Martin RM, Ness A, Hadders-Algra M, Koletzko B, Lucas A
(2008) How much loss to follow-up is acceptable in long-term randomised trials and prospective
studies? Arch Dis Child 93(6):458–461. https://doi.org/10.1136/adc.2007.127316
FHN Trial Group, Chertow GM, Levin NW, Beck GJ, Depner TA, Eggers PW, Gassman JJ,
Gorodetskaya I, Greene T, James S, Larive B, Lindsay RM, Mehta RL, Miller B, Ornt DB,
Rajagopalan S, Rastogi A, Rocco MV, Schiller B, Sergeyeva O, Schulman G, Ting GO, Unruh
ML, Star RA, Kliger AS (2010) In-center hemodialysis six times per week versus three times per
week. N Engl J Med 363:2287–2300. https://doi.org/10.1056/NEJMoa1001593
Garber M, Nau D, Erickson S, Aikens J, Lawrence J (2004) The concordance of self-report with
other measures of medication adherence: a summary of the literature. Med Care 42(7):649–652.
https://doi.org/10.1097/01.mlr.0000129496.05898.02
Gassman J, Agodoa L, Bakris G, Beck G, Douglas J, Greene T, Jamerson K, Kutner M, Lewis J,
Randall OS, Wang S, Wright JT, the AASK Study Group (2003) Design and statistical aspects of
the African American Study of Kidney Disease and Hypertension (AASK). J Am Soc Nephrol
14:S154–S165. https://doi.org/10.1097/01.ASN.0000070080.21680.CB
16 Administration of Study Treatments and Participant Follow-Up 301
Gipson DS, Trachtman H, Kaskel FJ, Greene TH, Radeva MK, Gassman JJ, Moxey-Mims MM,
Hogg RJ, Watkins SL, Fine RN, Hogan SL, Middleton JP, Vehaskari VM, Flynn PA, Powell
LM, Vento SM, McMahan JL, Siegel N, D’Agati VD, Friedman AL (2011) Clinical trial of focal
segmental glomerulosclerosis (FSGS) in children and young adults. Kidney Int 80(8):868–878.
https://doi.org/10.1038/ki.2011.195
Isakova T, Ix JH, Sprague SM, Raphael KL, Fried L, Gassman JJ, Raj D, Cheung AK, Kusek JW,
Flessner MF, Wolf M, Block GA (2015) Rationale and approaches to phosphate and fibroblast
growth factor 23 reduction in CKD. J Am Soc Nephrol 26(10):2328–2339. https://doi.org/10.
1681/ASN.2015020117
Ix JH, Isakova T, Larive B, Raphael KL, Raj D, Cheung AK, Sprague SM, Fried L, Gassman JJ,
Middleton J, Flessner MF, Wolf M, Block GA, Wolf M (2019) Effects of nicotinamide and
lanthanum carbonate on serum phosphate and fibroblast growth factor-23 in chronic kidney
disease: The COMBINE trial. J Am Soc Nephrol 30(6):1096–1108. https://doi.org/10.1681/
ASN.2018101058
Kravitz RL, Hays RD, Sherbourne CD (1993) Recall of recommendations and adherence to advice
among patients with chronic Medical conditions. Arch Intern Med 153(16):1869–1878. https://
doi.org/10.1001/archinte.1993.00410160029002
Meinert CL (2012) Clinical trials: design, conduct and analysis, 2nd edn. Oxford University Press,
New York
Miskulin DC, Gassman J, Schrader R, Gul A, Jhamb M, Ploth DW, Negrea L, Kwong RY, Levey AS,
Singh AK, Harford A, Paine S, Kendrick C, Rahman M, Zager P (2018) BP in dialysis: results of a
pilot study. J Am Soc Nephrol 29(1):307–316. https://doi.org/10.1681/ASN.2017020135
Morawski K, Ghazinouri R, Krumme A et al (2018) Association of a smartphone application with
medication adherence and blood pressure control: the MedISAFE-BP randomized clinical trial.
JAMA Intern Med 178(6):802–809. https://doi.org/10.1001/jamainternmed.2018.0447
Osterberg L, Blashchke T (2005) Adherence to medication August 4, 2005. N Engl J Med 353:487–
497. https://doi.org/10.1056/NEJMra050100
Piantadosi S (2017) Clinical trials a methodologic perspective. Wiley series in probability and
statistics, 3rd edn. Wiley, New York
Raphael KL, Isakova T, Ix JH, Raj DS, Wolf M, Fried LF, Gassman JJ, Kendrick C, Larive B,
Flessner MF, Mendley SR, Hostetter TH, Block GA, Li P, Middleton JP, Sprague SM, Wesson
DE, Cheung AK (2020). A Randomized Trial Comparing the Safety, Adherence, and Pharma-
codynamics Profiles of Two Doses of Sodium Bicarbonate in CKD: the BASE Pilot Trial. J Am
Soc Nephrol 31(1):161–174. https://doi.org/10.1681/ASN.2019030287
Rocco MV, Lockridge RS, Beck GJ, Eggers PW, Gassman JJ, Greene T, Larive B, Chan CT,
Chertow GM, Copland M, Hoy C, Lindsay RM, Levin NW, Ornt DB, Pierratos A, Pipkin M,
Rajagopalan S, Stokes JB, Unruh ML, Star RA, Kliger AS, the FHN Trial Group (2011) The
effects of nocturnal home hemodialysis: the frequent hemodialysis network nocturnal trial.
Kidney Int 80:1080–1091. https://doi.org/10.1038/ki.2011.213
Sackett DL, Richardson WS, Rosenberg W (1997) Evidence-based medicine: how to practice and
teach EBM. Churchill Livingstone, New York
Santo K, Richtering SS, Chalmers J, Thiagalingam A, Chow CK, Redfern J (2016) Mobile phone
apps to improve medication adherence: a systematic stepwise process to identify high-quality
apps. JMIR Mhealth Uhealth 4(4):e132. https://doi.org/10.2196/mhealth.6742
Schwed A, Fallab C-L, Burnier M, Waeber B, Kappenberger L, Burnand B, Darioli R (1999)
Electronic monitoring of adherence to lipid- lowering therapy in clinical practice. J Clin
Pharmacol 39(4):402–409. https://doi.org/10.1177/00912709922007976
Stewart M (1987) The validity of an interview to assess a patient’s drug taking. Am J Prev Med
3:95–100
Trachtman H, Vento S, Gipson D, Wickman L, Gassman J, Joy M, Savin V, Somers M, Pinsk M,
Greene T (2011) Novel therapies for resistant focal segmental glomerulosclerosis (FONT) phase
II clinical trial: study design. BMC Nephrol 12:8. https://doi.org/10.1186/1471-2369-12-8
302 J. J. Gassman
Weinberg JM, Appel LJ, Bakris G, Gassman JJ, Greene T, Kendrick CA, Wang X, Lash J, Lewis
JA, Pogue V, Thornley-Brown D, Phillips RA, African American Study of Hypertension and
Kidney Disease Collaborative Research Group (2009) Risk of hyperkalemia in nondiabetic
patients with chronic kidney disease receiving antihypertensive therapy. Arch Intern Med 169
(17):1587–1594. https://doi.org/10.1001/archinternmed.2009.284
Zulig LL, Phil Mendys HB (2017) Bosworth, Medication adherence: A practical measurement
selection guide using case studies. Patient Educ Couns 100(7):1410–1414. https://doi.org/10.
1016/j.pec.2017.02.001. ISSN 0738-3991
Data Capture, Data Management, and
Quality Control; Single Versus Multicenter 17
Trials
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Data Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Data Management Life Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Data Capture Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
Case Report Form (CRF) Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306
Data Management and Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
Risk-Based Monitoring in Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
Data Quality Control Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Data Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Data Management Plan/Data Validation Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
CRF Completion Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
System User/Quick Reference Guides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Training Site Staff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
Site and Sponsor Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Data Management in Single Versus Multicenter Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Future Data Management Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Abstract
Data capture, data management, and quality control processes are instrumental to the
conduct of clinical trials. Obtaining quality data requires numerous considerations
throughout the life cycle of the trial. Case report form design and data capture
methodology are crucial components that ensure data are collected in a streamlined
and accurate manner. Robust data quality and validation strategies must be employed
Keywords
Data management · Data collection · Data quality · Multicenter trial · Case report
form (CRF) · Risk-based monitoring (RBM) · Data review · Electronic medical
record (EMR) · Electronic data capture (EDC)
Introduction
One of the key components to a successful clinical trial is a strong foundation of quality
data. Data management and quality control measures ensure the accuracy and reliabil-
ity of the database used in analyses, which are imperative to the outcome of a clinical
trial. The goal of a data management program is to produce a clean dataset containing
no data entry errors or unexplained missing data points and assure that all the necessary
data to analyze the trial endpoints are collected consistently. The consequences of a
poorly designed or improperly implemented data management and quality control
program are manifested in additional burdens of time, resources, trial costs, and,
perhaps most importantly, in a failure to produce an accurate database for analysis.
Some important considerations for designing a data management program
include the method of data capture, design of case report forms and edit checks,
implementation of data management reporting tools to assess data quality, setting
clear expectations based on trial objectives and sponsor/investigator goals, develop-
ment of training and reference tools for trial stakeholders, and strategies for
conducting data review. In addition, the context of the trial must be considered,
such as whether it is conducted in a single center, multicenter, or network setting and
the platform, format, and methods for sharing trial data.
Data Capture
Data management activities span the life cycle of a clinical trial. Ideally, data
management teams are integrated into the protocol development phase, during
which case report form (CRF) development and database design may begin. Review
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 305
and input from data managers (DMs) during this crucial stage may help identify and
reduce extraneous data collection and anticipate potential data management
challenges.
During implementation, data management documents are developed in conjunc-
tion with other trial management processes, and site training materials should be
developed. Data managers may be involved in the creation of system or CRF user
guides, in addition to establishing data management and data validation plans to
inform the collection and management of data throughout the trial.
At the time of activation and accrual, data management includes implementation
of data collection tools and early monitoring of data to identify potential trends or
issues. During trial maintenance, data management and quality control activities are
ongoing, and data managers typically utilize tools to detect anomalies, resolve
queries, retrieve missing data, and ensure the integrity of the trial database.
In preparation for trial analysis, data quality and cleaning activities may become
more targeted or focused on trial endpoint data to ensure the analysis can be
completed. During trial closure or in preparation for data lock, all remaining data
quality items are resolved or documented.
The method of data capture and data storage should be considered during the design
of the trial and prior to the development of CRFs to ensure that information is
efficiently collected. Potential data capture methods include traditional paper-based
data collection, electronic data capture (EDC), electronic health record (EHR)
integration, and external data transfer/upload or any combination of these methods.
While clinical trials historically used paper CRFs, there has been an increasing trend
toward digital integration due to the enhanced quality control and real-time commu-
nication features available. EDC offers centralized data storage which will speed
analysis and distribution of results at the end of a trial. Additionally, the availability
of tablets and other portable devices has made this a cost-effective and practical
option.
An EDC system has become the gold standard for use in clinical trials, where site
staff or participants enter the data directly into the system, staff collect data on paper
CRFs and then enter data in the system, or data is transferred through an upload into
the system. Many EDC systems contain additional tools for managing data quality,
including features for real-time front-end data validation, shipment and specimen
tracking, transmission of non-form or imaging data (e.g., blood/tissue samples, X-
rays), report management and live data tracking, query resolution, adjudication of
trial outcomes, scheduling for participant visits, and other trial management tools
(Reboussin and Espeland 2005). Having a direct data entry system reduces the
potential for transcription errors when the data are recorded in EDC and provides
the ability to perform real-time data checks such as those for values outside the
expected ranges (e.g., a weight of 950 pounds). As worldwide availability of internet
capabilities and prevalence of EDC systems have expanded over the years,
306 K. Knust et al.
traditional paper CRF data entry has diminished. In circumstances where EDC is the
preferred method of data capture but internet access is unreliable or limited, offline
data entry may be used to collect data and then transmit once an internet connection
is established.
The technological enhancements available through EDC systems have a signif-
icant advantage for multi-center trials in that it provides a mechanism for Data
Managers (DMs) to view the data across all sites in real-time and perform quality
control checks such as identifying missing values, performing contemporaneous
data audits, or issuing queries. Specifically, it allows the DM to review current data
across all sites to identify trends or potential process issues earlier in the data
collection stream.
A variety of commercial EDC applications are available for use. Some are
available as “off the shelf” software and are free of charge, although features may
be limited. Other proprietary systems are available and may be customizable to the
needs of a client. Many factors go into choosing the appropriate application,
including the complexity of the trial and the number of participants and sites. It is
often more appropriate for a multi-center trial to choose a customizable EDC, as it
offers additional flexibility and reliability in data collection, storage, review,
retrieval, validation, analysis, and reporting as needed. A small, less complex single
site or limited resource trial may use a free commercial off the shelf (COTS)
solution, as it still offers some of the important features such as front-end validation
but may not have the capability for more complex reporting or customizations.
The design of the database will depend on the specific method of data capture and
encompasses a wide range of activities. However, it is important to evaluate whether
the database structure is both comprehensive enough to ensure all trial objectives are
met while minimizing the amount of extraneous data that may be captured. The
volume and complexity of data collected for a given trial should be weighed against
the relative utility of the information. If the data point being collected is not essential
to the outcome of the trial, consider the cost of the data collection burden to site staff,
in addition to the cost of cleaning the data, before including it.
A CRF, also sometimes known as a data collection form, is designed to collect the
participant data in a clinical trial. The International Conference on Harmonization
Guideline for Good Clinical Practice defines the CRF as “a printed, optical or
electronic document designed to record all of the protocol-required information to
be reported to the sponsor on each trial participant.” (ICH 2018). When implemented
in an EDC system, CRFs may be referred to as data entry screens or electronic CRFs
(eCRFs).
The thoughtful design of CRFs is fundamental to the success of the trial. Many
challenges in data management result from poor CRF design or implementation.
Designing a format for CRFs or data entry screens is important, and the basic
considerations are the same with paper or eCRFs. CRF development is ideally
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 307
performed concurrently with protocol development to ensure that trial endpoints are
captured and will yield analyzable data. Ideally, the statistical design section of the
protocol (or statistical analysis plan) will be consulted to map all data points to the
analysis to confirm the data required will be available.
At a minimum, CRFs should be designed to collect data for analysis of primary
and secondary outcomes and safety endpoints and verify or document inclusion and
exclusion criteria. When developing a schedule of assessments (i.e., visit schedule),
the feasibility of data collection time-points should be evaluated. The schedule
should include all critical time-points, while ensuring that the frequency of visits
and anticipated participant burden is considered. Furthermore, the impact of data
collection on site staff should be assessed. When possible, soliciting input on form
content from individuals responsible for entering data may identify problematic
questions and clarify expectations.
While the content is crucial for analysis, structure and setup of the CRF is vital to
collecting quality data. There are no universal best practices for form development,
although the Clinical Data Interchange Standards Consortium (CDISC) has made
significant progress toward creating tools and guidelines (Richesson and Nadkarni
2011). CDISC has also implemented the Clinical Data Acquisition Standards Har-
monization (CDASH) project, which utilizes common data elements and standard
code lists for different therapeutic areas to collect data in a standardized approach
(Gaddale 2015).
The primary objective of CRF design is to gather complete and accurate data.
This is achieved by avoiding duplication of data elements and facilitating transcrip-
tion of data from source documents onto the CRF. Ideally, it should be well
structured, easy to complete without much assistance, and should collect data of
the highest quality (Nahm et al. 2011).
Some basic principles in CRF development and design include the following:
• Identify the intended audience and data entry method (e.g., trial staff direct data
entry or electronic participant reported outcomes/ePRO) and style of the CRF
(interview, procedural, or retrospective). This will determine the question format
and reading level of the question.
• Standardize CRF design to address the needs of all users such as investigator, site
coordinator, trial monitor, data entry personnel, medical coder, and statistician
(Nahm et al. 2011). Review by all affected parties before finalization confirms
usability and ensures complete data elements.
• Organize data in a format that facilitates and simplifies data analysis (Nahm et al.
2011).
• Keep questions, prompts, and instructions clear and concise to assure that data
collection is consistent across all participants at various sites.
• Group-related fields and questions together.
• Use consistent language across different CRFs in the same protocol and across
protocols. Avoid asking the same question in different ways (e.g., a field for “date
of birth” as well as a field for “participant age”) as data provided for these fields
may be inconsistent and creates additional work for the clinical sites.
308 K. Knust et al.
Once data collection begins, the management and quality control of data become the
primary focus. However, the volume and pace of data collection may require data
managers to target data quality efforts, particularly when a trial is conducted in a
multicenter or network setting.
In the Food and Drug Administration Guidance for Industry “Oversight of
Clinical Investigations- A Risk Based Approach to Monitoring” (August 2013),
the agency acknowledges that some data have more impact on trial results; therefore
a risk-based approach can be used. These critical data points include informed
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 309
consent, eligibility for the trial, safety assessments, treatment adherence, and main-
tenance of the blind/masking (FDA 2013).
Risk-based monitoring (RBM) creates a framework for managing risks through
identification, classification, and appropriate mitigation to support improved partic-
ipant safety and data quality. Adopting a targeted RBM approach to data manage-
ment may be appropriate in some settings and can provide significant advantages,
including more efficient use of resources, without compromising the integrity of the
clinical trial. In this approach, a range of metrics known as key risk indicators (KRIs)
may be used in real-time to identify areas of critical importance and are tracked to
flag data that may need additional attention (may be participant, site, or trial level).
KRIs may include protocol deviations, adverse events, missing values, missing
CRFs, or other areas of concern. An example of a report for monitoring KRIs and
identification of performance issues is shown in the Fig. 1 below. In this table, the
values are programmatically compared to pre-determined standards and given a
color code of green (indicating good performance), yellow (problem areas identi-
fied), or red (remedial action required). This allows for continuous monitoring in real
time and increases responsiveness by the clinical team in identifying patterns or
trends that may impact risk assessment of a site or trial, as well as quickly correct and
prevent further issues.
A sample report provides a framework for the general timeline and major
milestones throughout the protocol life cycle, from the time of protocol approval
to the publication of the primary outcome manuscript. This high-level overview
compares data from each trial to the defined expectations of the sponsor to highlight
trial performance (i.e., column displaying initial proposed dates versus column for
actual milestone dates). A column for current projections shifts in relation to current
data such as number of participants enrolled. Significant deviations from this
timeline highlight performance issues and identify the need for additional monitor-
ing (Fig. 2).
While review of all KRIs is important in a risk-based approach, RBM has
significant implications for data management, as data quality metrics may be used
to identify higher-risk sites or data management trends that warrant more frequent
onsite monitoring.
These might include percent of missing CRFs, missing data fields, outstanding
data queries, availability of primary outcome, and number of protocol deviations. If
any of these metrics lie outside the normal range, more frequent review of data
should be performed to determine if there are additional issues.
In an example of how RBM may be used to identify issues, a DM noticed that a
site had exceeded the KRI metric for missing CRFs. Upon further investigation of
the site’s existing CRFs, several other issues were identified. CRFs had been
completed at incorrect visits, and the audit history showed that site staff had
completed ePRO assessments, instead of being entered directly in the ePRO system
by the participant. The DM revealed that the site staff had been completing the CRFs
on behalf of the participants, which was a protocol violation. By using KRIs to flag a
high-risk site, the DM was able to identify and mitigate greater process issues, which
could have had a significant impact during analysis.
Flags and Triggers: Overall and by Site
310
Site Recruitment: Recruitment: Missing Audits Regulatory Primary Primary Treatment Follow-up
Overall Prior 3 Months1 Forms lssues Outcome: Outcome: Prior Exposure Visit
Overall 90 days Attendance
Site #1 90% 78% 0.0% 0.13% None 66% 73% 66% 71%
Site #2 100% 180% 0.0% 0.49% None 61% 58% 70% 72%
Site #3 70% 63% 0.1% 0.60% None 65% 78% 73% 61%
Site #4 84% 42% 0.0% 0.29% None 41% 50% 64% 59%
Site #5 100% 69% 0.2% 0.31% None 79% 71% 71% 70%
Site #6 90% 115% 0.0% 0.50% None 73% 89% 77% 82%
Site #7 60% 43% 1.5% 0.27% None 85% 91% 95% 81%
Site #8 150% 100% 0.0% 0.93% None 71% 78% 60% 45%
Overall 88% 79% 0.5% 0.45% None 69% 72% 71% 68%
1
Update on the 1st of the month to show percent of expected to actual randomizations over the previous 3 calender months.
2
Primary outcome availability for only the prior 90 days calculated as the percentage collected to expected UDS in the past 90 days.
See next page for color definitions
Study Number/Title
Basic Protocol Information and Timeline [Updated Monthly]
Tuesday, August 28, 2018 9:31 PM ET
Initial Current Actual
Proposal Projection
In addition to using RBM, there are several different types of data management tools
that should be implemented to ensure the validity and integrity of data collected,
including a variety of reports and edit checks.
Reports are often developed to track trial metrics for data quality and assess
progress and can provide both real-time and summary information. These reports
can provide high-level data quality information to both the data management team
and site staff collecting the data. While the reports can be made available for review
at any time (e.g., via a website or EDC system), they should also be discussed or sent
to staff collecting the data points on a set schedule or at pre-determined time-points
so that site staff can address discrepancies identified and the data management team
can provide feedback and/or training for collection of data.
312 K. Knust et al.
Reports used to track missing CRFs, missing data points, or numeric data entered
outside of an expected range (e.g., an unexpected date or a lab value that is not
compatible with life) should be implemented as standard tools to assist with data
management review (Baigent et al. 2008). These should be integrated within an EDC
system whenever possible to facilitate real-time review.
Edit checks (also called validation checks or queries) are another tool to look for
data discrepancies. These are employed as a systematic evaluation of the data
entered to flag potential issues and alert the user. Ideally, these checks should be
issued to site staff in real time or on a frequent basis, to facilitate timely resolution.
Data managers are instrumental in writing edit check messages, which should
include a clear description of the issue and the fields (variables) involved in the
check, the reason it was flagged as potentially inconsistent, and indicate the steps
required for resolution. An edit check program can look at a single data point within
a single assessment, multiple data points within a single assessment, or multiple data
points across multiple assessments. These checks should be run frequently (or on a
set schedule) to identify inconsistencies in the data or data that is in violation of the
protocol (Krishnankutt et al. 2012; Baigent et al. 2008).
Edit checks are typically conceptualized during CRF design, as it is important to
identify potential areas for discrepant data and minimize duplicate or potentially
conflicting data collection. Implementing edit checks early on in active data collec-
tion phase will allow detection of trends that may warrant changes to data entry or
retraining of site staff. A high priority early in the trial is to develop edit checks to
query baseline assessments and enrollment data. These discrepancies may uncover
problems with CRFs, misunderstandings at clinical sites, or problems with the trial
protocol that are critical to address. During the conduct of any trial, new checks are
often identified and programmed due to protocol changes, CRF changes, findings
during a clinical monitoring visit, or anomalies noted in trial-related reports.
Data Review
Throughout the conduct of a trial, it is expected that there is ongoing review of data.
This may be conducted by different stakeholders, with the goal of monitoring and
evaluating data collected in a contemporaneous fashion to identify potential
concerns.
Initiating communication with site staff following the enrollment of the first
participant is a simple review strategy to increase the likelihood of accurate and timely
data entry. It is during the first participant enrollment that sites first enact the written
protocol and trial procedures at their institution. Despite discussions regarding imple-
mentation and training prior to activation, the first participant enrolled is often when
sites first experience any challenges with the integration of trial procedures into their
standard practice. This is a critical time for the data manager to be engaged with the
rest of the site team to be sure that any issues are resolved and that all necessary data
are captured. Follow-up contact with site staff in this timeframe provides an opportu-
nity to communicate parts of the enrollment process that went well and those that
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 313
would be helpful to adjust. Immediately following the first participant enrollment, the
site staff is more likely to recall issues and any missing or difficult information. This
feedback is extremely valuable to the data management team, particularly in multi-
center trials. The challenges encountered at one institution may be shared across
multiple sites. Discussing with sites allows for the trend to be identified and potentially
adjusted in real time so that other sites may avoid the same problems. Additionally,
touching base with the site at this early time-point provides the opportunity to
communicate reminders for upcoming assessments and trial requirements.
Another strategy for data review includes performing a data audit after a mile-
stone has been met, such as a percentage of accrual completed or a certain number of
participants reaching an endpoint. In this type of review, a subset of participants is
identified, and data cleaning procedures are performed to ensure all data submitted
through the desired time-point are complete and accurate. The subset of data may be
run through statistical programs or checks to verify the validity of the data collected
thus far. The goal of this type of review is to identify any systematic errors that may
be present. If any errors are identified, there is an opportunity to review the potential
impact and determine whether any changes to the CRFs or system are required.
Endpoint or data review committees may also be convened to provide an inde-
pendent assessment of trial endpoints or critical clinical or safety data. For some
trials where endpoints are particularly complex or subject to potential bias, an
independent review committee may provide additional assurance that trial results
are accurate and reliable. In the FDA’s Guidance for Industry: Clinical Trial End-
points for the Approval of Cancer Drugs and Biologics, it is noted that an indepen-
dent endpoint review committee (IRC) can minimize bias in the interpretation of
certain endpoints. If an endpoint committee is determined to be necessary for a trial,
a charter or guidance document should be in place prior to the start of the trial to
outline the data points that will be adjudicated by the committee, how the data will be
distributed for review, and when the data review will occur. In addition, the charter
should specify how “differences in interpretation and incorporation of clinical data in
the final interpretation of data and audit procedures” will be resolved (FDA 2018).
Depending on the duration of the trial, endpoint adjudication may occur on an
ongoing basis (e.g., as participants reach an endpoint that will be adjudicated) or may
be conducted at the end of the trial (e.g., once a predetermined number of partici-
pants reach an endpoint). The scope of the review is typically limited to the primary
or secondary endpoints of a trial but may include other clinically relevant data
points. A risk-based approach can also be taken for the endpoint review, with a
subset of cases reviewed and the committee adjourned if a certain concordance with
the reported data is met. For example, if independent review of an endpoint dem-
onstrates that committee review of data agrees with site-reported assessment in 95%
of cases, it may not be necessary to review data for every participant in the trial.
When this approach is used, the proposed concordance rate should be included in the
charter. To provide data for independent review, a data listing or similar format is
typically used to incorporate relevant information. Source documents (e.g., imaging,
clinical records) may be included as part of the review but must be appropriately de-
identified to protect participant information.
314 K. Knust et al.
Several resource documents should be created at the start of the clinical trial that will
provide guidance to trial staff on expectations of the data capture system, collection
of data within each of the forms/assessments and general guidelines not captured in
the trial protocol. These documents should be available to trial staff prior to the start
of the trial and should be maintained and updated throughout the trial using version
control to address frequently asked questions and guidance or decisions on how to
enter specific data as needed (Nahm et al. 2011; McFadden 2007).
Ideally, the CRF completion guidelines should contain a table reflecting the
expected list of assessments per the schedule specified in the protocol (Nahm et al.
2011). There should be a section that clarifies when each assessment is expected, the
source data for the assessment, and how data entry will be completed. These sections
should also describe the intricacies of the CRF that are not immediately obvious by
reviewing the questions. This includes explaining any fill/skip patterns or logic that
may be used. For example, if a date is expected for an event, but several dates could
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 315
be applicable, the CRF completion guidelines should identify clearly how the correct
date should be determined and reported (McFadden 2007).
Having a central resource for all trial staff will help ensure that data are collected
in a consistent manner across participants and sites.
When electronic systems are utilized in the conduct of a trial, a system user guide
should be provided to assist staff with guidance on how to access and navigate
through the system. The guide should include a high-level overview of any data
collection system or other tools that will be used and provide more in-depth detail
about specific aspects of the data collection system (e.g., enrolling a participant using
an EDC system) or how to administer a specific assessment tool.
Depending on the complexity of the system being used, or for multicenter trials
which may have many participating institutions, it may be necessary to provide
additional resources or “quick reference guides” to facilitate data submission. This is
typically a short document that provides specific guidance on one or a few specific
tools, assessments, or systems, for example, a quick reference guide on how to
upload files to an electronic data capture system. The guide should provide specific
details but supplements more in-depth documents like a user’s guide or a CRF
completion guide. The goal is to ensure that any system user can quickly understand
key system features and expedite the training process.
Another area of data management support includes training of staff and system users.
Data management training may include instruction on CRF completion, data entry or
system navigation, query resolution, and trial-specific guidance. Training is an
ongoing activity; initial training is typically conducted at site initiation visits,
investigator meetings, or through group training or recorded module/webcast train-
ing modalities. There are many aspects to data collection, including regulatory
documents, safety, the method of data collection, biological and other validated
assessments, and possibly trial drug/intervention. At the beginning of the trial,
stakeholders involved in the creation of the protocol, assessments, system, and
overall trial guidelines should set up a detailed training for all site staff that will be
collecting data and administering the assessments.
Providing a training module for each area can provide a structured training to
staff before the trial start (Williams 2006). It is recommended that comprehension
evaluations are completed for each module and question and answer sessions are
provided to allow trial staff time to review training and ask questions. Providing
trial staff with certification of completion on each module they are trained on
for their records helps ensure that trial staff are prepared for data collection
(Williams 2006).
316 K. Knust et al.
As the trial progresses, it will be necessary to train additional site personnel and
perhaps provide retraining as data quality issues are identified or there are changes to
the trial due to protocol amendments or other updates. All initial training modules
and evaluations should be recorded and readily available for new staff or as a
refresher to existing staff throughout the protocol. Any supplemental training pro-
vided should also be recorded and readily available.
Training documentation includes the management of system access and mainte-
nance of user credentials, when electronic systems are used. It is important to ensure
that all users have appropriate access for their role and departing staff can no longer
access systems.
Traditionally, clinical research associates (CRAs) are involved in training of site
staff and perform on-site reviews to compare source documents to entered data,
monitor data quality, and verify training and regulatory documentation. They serve
as a front line resource to sites to help address any concerns early on and prevent them
from occurring through the life of the protocol. Errors in data collection or sample
storage are often identified and corrected during monitoring visits and should be
communicated to data management staff to determine whether any changes to CRFs
or guidance documents are warranted. Repeated issues may also lead to updates to
training or user materials. Data managers and CRAs provide ongoing training support
throughout the trial and work closely together to manage the overall integrity of data
collection at sites.
The role of data management in clinical trials is essential. Setting up a trial to obtain
quality data begins early with the protocol, SAP, and data capture method selection
and design. Errors or poor judgment at this stage have a significant impact on the
17 Data Capture, Data Management, and Quality Control; Single Versus. . . 319
process and may result in systematic errors that can compromise analysis and trial
results. Thoughtful case report form design is one of the most important aspects of
data management. Following commonly accepted principles of CRF creation and
utilizing standard CRFs whenever possible ensures streamlined data capture and
minimizes negative downstream effects.
Once the trial begins, the implementation of appropriate data quality tools such as
risk-based monitoring, reports, data validation checks, and data review are crucial
components to the data management program. These tools aid in identifying prob-
lematic data, provide information to support sites, ensure the consistency of data
collected, and provide an unbiased assessment of trial endpoints. Data quality reports
and checks should be updated frequently and implemented early. Additionally,
periodic data review is recommended as a method to ensure the integrity and validity
of data collected.
Setting clear expectations for all stakeholders is an important part of data
management. Data management and validation plans, as well as CRF and system
guidance documents, help provide valuable information regarding the flow of
data for the trial and how data should be entered correctly. These documents
should be supplemented by comprehensive training and a clear communication
plan to ensure understanding and agreement on data collection and quality
control measures.
Open and ongoing communication with sponsors and sites is necessary to
establish rapport and encourage collaboration for the duration of the trial. Scheduled
and ad hoc calls and meetings are helpful to build trust and facilitate discussion
regarding data quality issues. CRF and system training must be prioritized at the start
of a trial and is expected to continue throughout as new staff join or as there are
changes to CRFs or system features.
Clinical trials may be conducted in a variety of settings, depending on the nature
of the protocol and the trial objectives. There has been an increase in the number of
trials that are conducted through a multicenter approach to capitalize on centralized
resources and a larger participant population. Although the goal of a data manage-
ment program may remain the same, there are different considerations for single
center versus multicenter or network-conducted trials.
With advancements in technologies such as EMR-EDC integration, artificial
intelligence, and big data analytics, the landscape of data management is changing
rapidly. Utilizing principles of data quality assurance to manage new data sources
and applying understanding of data to large volumes of information will be imper-
ative to future data management programs.
Key Facts
Cross-References
References
Baigent C, Harrel F, Buyse M, Emberson J, Altman D (2008) Ensuring trial validity by data quality
assurance and diversification of monitoring methods. Clin Trials 5:49–55
Chen Y, Argentinis JD, Weber G (2016) IBM Watson: how cognitive computing can be applied to
big data challenges in life sciences research. Clin Ther 38(4):688
Food and Drug Administration (2013) FDA guidance oversight of clinical investigations – a risk-based
approach to monitoring. https://www.fda.gov/downloads/Drugs/Guidances/UCM269919.pdf
Food and Drug Administration (2018) FDA guidance clinical trial endpoints for the approval of
cancer drugs and biologics: guidance for industry. https://www.fda.gov/downloads/Drugs/Guid
ances/ucm071590.pdf
Gaddale JR (2015) Clinical data acquisition standards harmonization importance and benefits in
clinical data management. Perspect Clin Res 6(4):179–183
Goodman K, Krueger J, Crowley J (2012) The automatic clinical trial: leveraging the electronic
medical record in multi-site cancer clinical trials. Curr Oncol Rep 14(6):502–508
International Conference on Harmonisation (2018) Guideline for good clinical practice E6(R2)
good clinical practice: integrated addendum to ICH E6(R1) guidance for industry. https://www.
fda.gov/downloads/Drugs/Guidances/UCM464506.pdf
Johnson K, Soto JT, Glicksberg BS, Shameer K, Miotto R, Ali M, Ashley E, Dudley JT (2018)
Artificial intelligence in cardiology. J Am Coll Cardiol 71:2668–2679
Khaloufi H, Abouelmehdi K, Beni-Hssane A, Saadi M (2018) Security model for big healthcare
data lifecycle. Procedia Comput Sci 141:294–301
Krishnankutt B, Bellary S, Kumar N, Moodahadu L (2012) Data management in clinical trial: an
overview. Indian J Pharmacol 44(2):168–172
McFadden E (2007) Management of data in clinical trials, 2nd edn. Hoboken, NJ: Wiley-
Interscience.
Meinert CL, Tonascia S (1986) Clinical trials: design, conduct, and analysis. New York: Oxford
University Press.
Nahm M, Shepherd J, Buzenberg A, Rostami R, Corcoran A, McCall J et al (2011) Design and
implementation of an institutional case report form library. Clin Trials 8:94–102
Reboussin D, Espeland MA (2005) The science of web-based clinical trial management. Clin Trials
2:1–2
Richesson RL, Nadkarni P (2011) Data standards for clinical research data collection forms: current
status and challenges. J Am Med Inform Assoc 18:341–346
Williams G (2006) The other side of clinical trial monitoring; assuring data quality and procedural
adherence. Clin Trials 3:530–537
End of Trial and Close Out of Data Collection
18
Gillian Booth
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Planning for Trial Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Stage 1: End of Recruitment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Stage 2: End of Trial Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Stage 3: End of Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Stage 4: Trial Reporting and Publishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Stage 5: Archiving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Trial Closure Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Developing a Trial Closure Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Stage 1 End of Recruitment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Stage 2 End of Trial Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
Stage 3 End of Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Stage 4 Trial Reporting and Publishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
Stage 5 Archiving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
Early Trial Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
Planning for Early Trial Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
Communicating with Trial Participants Following Early Trial Closure . . . . . . . . . . . . . . . . . . . . 342
Individual Site Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
Abstract
Trial closure refers to the activities that take place in preparation for the cessation of
trial recruitment through to archiving of the trial. Trial closure can be notionally
divided into five stages: End of Recruitment; End of Trial Intervention; End of
G. Booth (*)
Leeds Institute of Clinical Trials Research, University of Leeds, Leeds, UK
e-mail: [email protected]
Trial; Trial Reporting and Publishing; and Archiving. The length and scheduling of
each stage of trial closure is determined by the trial design and operating model. As
trial closure approaches there is an increased emphasis on monitoring and controls
to ensure the correct number of participants is recruited and data collection and
cleaning in preparation for final database lock and analysis is complete. The End of
Trial is a key ethical and regulatory milestone defined in the approved trial protocol
and has associated time-dependent notification and reporting requirements to the
independent ethics committee and, for regulated trials, the regulator(s). Key steps of
trial closure, reporting, publishing, and data sharing are important mechanisms to
support transparency in clinical trials. A Trial Closure Plan can be used to support
the activities that ensure appropriate control over the final stages of the trial.
Keywords
Analysis · Archiving · Close out · Database lock · Data sharing · End of Trial ·
Public registry · Publishing · Reporting · Transparency · Trial Closure Plan
Introduction
Trial closure refers to the activities that take place in preparation for the cessation of
trial recruitment through to archiving of the trial. During this period a range of
different tasks take place at each of the physical locations (institutions) where the
trial is being conducted in order to:
The specific trial closure activities at each institution will depend upon the trial
design and the role of each institution; however, the overarching aims remain the same.
Clinical trials each have different designs, participant pathways, and risk profiles
therefore the trial closure activities undertaken can be adapted to ensure a risk
proportionate approach appropriate to the trial design and operating model (MRC
et al. 2012). Where risk proportionate approaches are utilized, these should be
documented with the associated decisions in the trial risk assessment. It is also
possible for many typical trial closure activities to be undertaken either remotely
rather than “on site,” thereby permitting the most appropriate method to be utilized
and ensuring the most efficient and effective use of resources.
the lead up to trial closure and the development of a Trial Closure Plan is critical to
ensure appropriate control over the final stages of the trial. A Trial Closure Plan will
include key milestones and deadlines relating to the responsibilities of each institu-
tion, with details of data items for on-site or central monitoring which can inform the
trial closure and analysis timelines.
However, it is not uncommon for trial timelines to change; for example, as a result
of slower than anticipated recruitment resulting in a delay to the recruitment closure
date. Some trial designs also include predefined stopping rules which allow the trial
to be stopped early; although the outcome of these prespecified reviews cannot be
predicted, planning for the various outcomes can and should still take place. It is less
common, but not unknown, for trials to be closed for unplanned reasons where the
opportunity for preplanning can be greatly reduced.
Trial closure activities (for all institutions involved in the trial) take place between
the cessation of trial recruitment to archiving and can be broadly divided into five
stages of activity (Fig. 1). While it is helpful to think about the stages of trial closure
in a linear, predictable way for planning purposes, in practice the design of the trial
may influence the length of each stage and whether the stages overlap. The period of
time between each stage will vary depending upon the length of time the trial is open
to recruitment and also the length of the individual participant intervention, data
collection, and follow-up periods. The design of the trial will also impact the overlap
between the different stages of trial closure; for example, in the case of an adaptive
platform multi-arm-multi-stage (MAMS) trial. A MAMS trial is a platform clinical
trial with a single master protocol where multiple interventions are evaluated at the
same time. Adaptive features enable one or more interventions to be “dropped” (e.g.,
due to futility) or added during the course of the trial. Different “arms” of the trial
End of Recruitment End of Trial End of Trial Trial Reporting & Archiving
Intervention Publishing
Completed: Completed: Completed: Completed: Completed:
Enrolment of trial All trial Data collection Trial analyses Essential
participants participants have and clinical End of Trial documents and
completed the assessments Report to ethics data prepared for
Continuing after this protocol specified End of Trial committee, archive
stage: intervention(s) Notification to regulator, funder Archive period
Protocol specified regulator and Results published agreed with
intervention and Continuing after this ethics committee in peer reviewed Sponsor
ordering of trial stage: Substantial scientific journal Investigator sites
supplies Clinical amendments no Results made notified of end of
Clinical assessments longer permitted available to archive period
assessments Safety monitoring participants
Safety monitoring Data collection Continuing after this Results reported Continuing after this
Data collection Data cleaning stage: on public registry stage:
Data cleaning Investigator site Data cleaning Publications and
Investigator site monitoring Statistical Continuing after this presentations
monitoring Statistical programming stage: arising from the
programming Statistical Publications and trial
analyses presentations Hypothesis
arising from the generation
trial Data sharing for
further research
will remain open to recruitment and intervention, while others are closed and there
may be protocol defined analyses performed and reported prior to the End of Trial
thereby resulting in overlap of the different stages of trial closure.
The end of recruitment is the point at which all trial sites are no longer permitted to
enroll participants into the trial. The trial protocol will specify the sample size, the
number of participants to be enrolled to achieve the sample size, and will describe
the recruitment pathway and related processes to achieve this. In most trials there
will still be participants receiving the intervention and undergoing clinical assess-
ments, safety monitoring, and data collection after the trial has closed to recruitment.
The End of Trial intervention is the point at which all participants enrolled into the
trial have completed the trial intervention as specified by the approved trial protocol.
Depending upon the trial design, it is likely that trial participants will be undergoing
clinical assessments, safety monitoring, and data collection after this time.
The End of Trial is a key ethical and regulatory milestone with associated time-
dependent notification requirements to the independent ethics committee and, for
regulated trials, the regulator(s). There may also be specific contractual requirements
for notification to other bodies, such as funders, at this time.
The End of Trial will be defined in the approved trial protocol; typically this will
be the date of the last “visit” of the last participant or at the time the last data item is
collected for the trial, that is, the point at which all clinical assessments, safety
monitoring, and data collection stops, although there may be different regulatory
requirements in different regions or countries. Preparatory activities for the End of
Trial will therefore focus on monitoring key data associated with the countdown
toward the End of Trial in addition to completing the data collection and cleaning
required for final database lock and analysis.
Trial reporting and publishing of the trial results follow completion of the protocol
specified trial analyses. These are two discrete activities:
typically within 12 months of the End of Trial. Where a trial is intended to support a
regulatory submission (e.g., in support of a manufacturer’s license for a drug or
medical device) the final report will take the form of a Clinical Study Report with
supporting documentation and detailed datasets as required by the regional/country
regulator.
– Publishing refers to publishing trial results, irrespective of the trial outcome, in a
peer-reviewed scientific journal and tends to be an activity primarily, although not
exclusively, associated with academic-led research.
Trial reporting and publishing of the trial results are two of the four key mechanisms
to support transparency in clinical trials (Box 1). The overall aim of transparency in
clinical trials is to ensure that the participants of trials, doctors, the scientific community,
and the public have access to information about which trials have been conducted, how
they have been conducted, and the outcomes of those trials. This builds trust with
patients and the public, informs clinical practice by allowing access to all of the
available evidence about a particular treatment, and minimizes research waste by
ensuring the same trials are not repeated. Transparency is fundamental in meeting the
expectations of research participants, regulators, and the wider scientific community.
Transparency in clinical trials is typically understood to mean registering,
reporting, publishing, and making data from the trial available for further analyses
or for the purpose of undertaking an independent analysis of the trial results (Box 1).
Transparency measures are a regulatory requirement for some trials and a pre-
requisite for publishing in many high-profile scientific journals, for example, to
publish in some high impact journals (ICMJE 2019) the trial must be registered in
a Primary Public Registry prior to the start of recruitment.
(continued)
326 G. Booth
– Making data from the trial available for further research purposes
(Data Sharing)
The quality control and curation of clinical trial datasets typically means
these are valuable resources which can be used for further research, such as
meta-analyses. Any further use of clinical trial datasets must be in line with
participant expectations and always legally compliant; this typically means
taking steps to anonymize a dataset prior to releasing to a third party.
For trial integrity reasons, data is not usually made available for further
research purposes until the protocol specified analyses have been completed
and reported/published.
Stage 5: Archiving
Archiving is the storage and retention of the trial essential documents (ICH 2016) and
data produced in the trial. Retention periods may vary and are dictated by regulators,
sponsors, funders, or institute policies. For regulated trials, the purpose of archiving is
to ensure the records which demonstrate how the trial was conducted and the compli-
ance of all individuals and institutions involved in the trial with good clinical practice
(ICH 2016) and all relevant laws are available for audit or inspection purposes.
There can be many different types of institutions, groups, and individuals involved in
the day-to-day conduct of a trial, for example, pharmaceutical companies, clinical
trials units (CTUs), contract research organizations (CROs), laboratories, investiga-
tors, investigator sites, suppliers/vendors, and independent oversight committees.
18 End of Trial and Close Out of Data Collection 327
Each institution, group, or individual will have agreed role(s) and an agreed set of
responsibilities which are usually defined in contracts, the trial protocol, and other
working documents. As trial closure approaches the lead institution responsible for
trial conduct will instigate the development of a detailed Trial Closure Plan (Box 2)
to ensure appropriate planning and control over the final stages of the trial.
When developing a Trial Closure Plan it is important to consider:
(continued)
328 G. Booth
At Stage 5: Archiving
• Completion of the Trial Master File and Investigator Site File essential
documents and preparation of these and any data files (paper and/or elec-
tronic) for archive.
18 End of Trial and Close Out of Data Collection 329
Communication
Good communication between the different institutions involved in the trial is key to
a successful trial closure; as such it is typical to see the frequency and format of
communications between the different institutions, groups, and individuals increase
and change in the run up to trial closure (Box 3).
The organization primarily responsible for trial conduct writes to all rele-
vant organizations/groups/individuals involved in the trial including the
funder, Sponsor, Independent Oversight Committees, Investigator Sites, and
suppliers to inform them that the trial has closed to recruitment. The letter
includes information such as:
– The date the trial closed to recruitment and the reason for the end of
recruitment
– A summary of overall recruitment for the trial and, where writing to an
individual investigator site, the recruitment summary for that individual site
– Key dates, such as the planned end of intervention period and the End of
Trial date
– A reminder of ongoing activities/obligations such as the management of
trial supplies and data collection
Typical communication to institutions / groups and individuals involved in the day to day
conduct, funding and independent oversight of the trial
• Date the trial closed to recruitment and the reason for the end of recruitment, in particular if the trial has closed early
Stage 1 • Summary of overall recruitment for the trial and where writing to an individual investigator site, the recruitment
End of summary for that investigator site
Recruitment • Key future dates, such as the planned end of the intervention period and the End of Trial date
• Reminder of ongoing activities / obligations, such as the management of trial supplies and data collection
• Where the trial has closed early due to safety concerns, detailed instructions about how the treatment of participants
should be stopped or changed, how and when action should be taken and communicated to participants
• Key dates, such as date the trial intervention delivery period ended, the End of Trial date, planned analyses and final
investigator site monitoring visits
Stage 2 • Reminder of ongoing activities / obligations, such as ongoing follow up data collection, the reconciliation, return or
destruction of trial supplies / equipment and maintenance of the Investigator Site File
End of Trial
• Chase for any outstanding essential documents required for the Trial Master File and to satisfy protocol and
Intervention contractual requirements such as trial logs
• Making provisions to destroy or seek appropriate authorisation to store trial samples beyond the End of Trial
• Provision of information to trial participants about next steps and options once their trial treatment ends
Stages 4 & 5 • Notification of the trial results to the regulator / independent ethics committee / contract partners
Trial • Publication of the trial results
Reporting, • Updating the relevant Public Registry
Publishing & • Provision of information to trial participants about the trial results
Archiving • Providing permission to archive trial documents and notification of the end of archive date
Fig. 2 Typical communication to institutions involved in the day-to-day conduct, funding, and
independent oversight of the trial
independent oversight committees, and funders. Good communication will ensure that
those institutions, groups, and individuals are aware of the planned and final timing of
each stage in the days, weeks, and months leading up to the event. It is good practice,
and indeed an essential part of the audit trail for regulated trials, to write to the
institutions, groups, and individuals involved in the day-to-day running, funding,
and independent oversight of the trial at each stage of trial closure to keep them
appraised of key dates, decisions, and as a reminder of any ongoing activities or
obligations (Fig. 2).
In trials involving supplies, such as drugs and devices, planning during trial setup
will ensure there are sufficient trial supplies for each participant to receive the
intervention as specified in the protocol. Indeed, any risks to the trial supplies
could be classed as a reportable event to the independent ethics committee or
regulator particularly where participant safety or the trial integrity are compromised
as a result of poor trial supplies management (in the European Union and United
Kingdom such events which occur in regulated drug trials are called Serious
Breaches and required expedited reporting to the regulator (MHRA 2018)).
Although careful monitoring and management of each individual participant and
trial supplies is important during trial recruitment, this will become more important
toward the end of the intervention period to ensure efficient use of the trial supplies
and to avoid over-ordering and waste, and this can be particularly important where
there are limited trial supplies.
defined intervention schedule, access to the systems for ordering new trial supplies
will be revoked.
– Ring fence and retain the remaining trial supplies until remote or on-site moni-
toring activities have been completed to confirm correct use and accounting of the
trial supplies (Box 4). After any monitoring activities have been successfully
completed and any arising issues resolved, the Sponsor (or delegate) will give
permission to the investigator site(s) to either destroy or return unused trial
supplies to the Sponsor or supplier for destruction.
– In the case of low-risk trials (i.e., where the intervention was of no higher risk
than standard of care) there may be no accountability logs to monitor in which
case the investigator site(s) will be instructed to destroy surplus supplies or return
them to the Sponsor or supplier for destruction.
– The return of any specialist equipment to the Sponsor or supplier.
– Checking storage locations and temperature logs to verify that trial supplies
were stored and handled according to the manufacturers’ recommendations
and any deviations were notified to the Sponsor.
– Checking the records which detail the traceability and accountability of the
trial supplies, ensuring that trial supplies were not used for participants who
were not enrolled onto the trial.
– Checking logs and other records which verify that any equipment used was
appropriately calibrated and maintained.
Such activities may take place at the Investigator site or written confirmation
or evidence in the form of logs or other paperwork from the Investigator Site
Pharmacist may be requested for remote review by the Sponsor or delegate.
has closed, or how their treatment options may change as a result of participation in
the trial. Depending upon the nature of the intervention and the length of the
intervention period, it can be a long time between the point of consent and the end
of treatment for an individual participant therefore it is good practice to prepare
information at the end of intervention for individual participants to serve as a
reminder of what will happen to them in terms of changes to their treatment, ongoing
clinical monitoring and how they can find out about or opt out of receiving
information about the trial results. This is also a good point in time to thank
participants for their contribution to the research.
– In trials where the timing of the analysis is linked to the event rate; as the target
approaches the frequency of data collection or monitoring at the investigator sites
may need to increase, with a greater focus on individual research site compliance
of the relevant data collection forms.
The mechanisms for reporting also differ by oversight body, region, and country;
this may be via a dedicated reporting system such as the European Union Portal/
EudraCT System, a registry or database such as ClinicalTrials.gov or simply a
standard form completed and emailed to the oversight body.
In practice the End of Trial notification does not mean that ethical and regulatory
oversight ends immediately at this point; there is an obligation for the Sponsor to
provide a written report within a fixed period of time to the independent ethics
committee, and for regulated trials the regulator. In addition active regulatory
inspection or contractual audit periods may extend many years after the End of Trial.
The End of Trial, regardless of definition, only occurs once; thus it follows that
for multicenter trials the End of Trial notification is made once the End of Trial has
occurred in all participating research sites and for multinational trials the notification
is made once the End of Trial has occurred in all participating countries.
Once the official End of Trial notification has been made substantial amendments
are no longer permitted; therefore any amendments to the protocol or other autho-
rized documents must be completed prior to the End of Trial being reached.
The End of Trial will be communicated in writing by the Sponsor or delegate to
all of the institutions and individuals which have been involved in the conduct of the
trial. There may also be specific contractual requirements for notification to other
bodies such as funders at this time.
Research Samples
Many trials include the collection of research samples, such as tissue, blood, or urine
samples which will be used for protocol defined analyses. Trial participants may also
be asked to consent to any samples which are collected being held in a tissue bank
and used for further research projects. The plan for the research samples after the End
336 G. Booth
of Trial will have been originally approved by the independent research ethics
committee and this approval must be adhered to. Where the plans have changed,
an amendment to the original ethical approval or alternative authorization by the
appropriate Authority/Regulatory Body will be needed in order to continue holding
the samples after the End of Trial. Depending upon the specific authorizations in
place, samples may need to be physically moved within or between institutions, for
example, to an authorized tissue bank.
– In trials where there are multiple different analyses taking place, or the analyses
only involve subgroups of participants, developing specific data management
plans directed to each trial analysis may be necessary. This could involve
identifying the specific data items and case report forms required for each analysis
and directing the investigator sites to prioritize certain case report forms for
completion or responding to certain data queries.
• Data items have been received and where they have not, there is a
documented reason why.
• Discrepant data have been queried with the investigator site and the queries
have been resolved.
• All on-site and remote monitoring activities have been completed as per the
Trial Monitoring Plan and any outstanding issues have been resolved.
• That the essential documents held at the investigator site are complete in
case of future audit or inspection.
• The site investigator(s) have confirmed the accuracy and completeness of
key data from their site.
• Any data coding, for example, of free-text fields and adverse events has
been completed.
• The linkage, cleaning and reconciliation of datasets generated by other
collaborators or parties, for example, laboratories, routine data providers
have been completed.
• Where required by the trial protocol, independent verification/adjudication
of outcome measures, for example, interpretation of clinical results has
been completed.
In common with all previous stages of trial closure; clear responsibilities and
communication is critical in achieving a successful database lock and the institution
responsible for data management for the trial will take responsibility for this. For
trials with large and complex datasets it may be necessary to implement a step-wise
approach to final database lock such as:
– Halting further data collection at investigator sites and focusing only on data
query responses
– Where remote data entry systems are in use, locking individual data collection
forms, participants or investigator sites to prevent further data entry or cleaning
activities at the investigator site level whilst continuing to permit time-limited
cleaning activities by the organization/individual responsible for data cleaning,
for example, a Data Manager to complete their activities.
For data integrity reasons, it is important to ensure that when locking the database
in a remote data entry system that each individual research site retains read only
access to the database for their data (MHRA 2018).
Trial Reporting
The End of Trial starts the clock for reporting the summary results of the trial; the
typical expectation being that these are reported onto the relevant public registry /
regulator portal within 12 months of the End of Trial date in the United Kingdom and
European Union and Primary Completion Date (onto ClinicalTrials.gov 2017) in the
United States. There may be exemptions to the requirement or timeframe for
reporting certain trials, for example, to protect commercial interests. In some coun-
tries there are also financial penalties for delays to reporting or not submitting the
report.
The content of the End of Trial Report (sometimes called the Clinical Trial
Summary Report) will take the form as dictated by the regional/country regulator
or independent ethics committee and will typically include information such as:
Reporting summary results via an End of Trial report onto a public registry is an
important mechanism to support clinical trials transparency (Fig. 1). Within the United
Kingdom and European Union emphasis is also placed on the importance of providing
trial results in an appropriate format to the participants of the trial and the wider general
public via a lay-summary (European Commission 2017; HRA 2014).
18 End of Trial and Close Out of Data Collection 339
In most cases data will only be released in an anonymized form and where
released to another organization, further protected by a legally binding data release
agreement.
Making clinical trials data available for further research and hypothesis genera-
tion is an important mechanism to support clinical trials transparency (Box 1). It is a
regulatory requirement in some countries for certain types of trials and an approach
340 G. Booth
Stage 5 Archiving
The documents collected throughout the life of a clinical trial which individually and
collectively permit the evaluation of the clinical trial and the quality of the data
produced are defined as essential documents (ICH 2016). These essential documents
serve to demonstrate the compliance of the Chief Investigator, Investigators, Spon-
sor, and other organizations, groups, and individuals involved in the conduct of the
trial with the standards of Good Clinical Practice (GCP) and with all applicable
regulatory requirements. They are therefore required to be archived for a period
defined by law (for regulated trials) or by the Sponsor (for all other trials) once the
trial has ended.
institutions, which may be in different countries and the documents may be held in
either paper or electronic form. In a well-managed trial how and where essential
documents are stored will have been planned and documented up front and orga-
nized in line with standard operating procedures dictating paper or electronic file
structures or by using an online document management system with a standard file
structure. A standard approach, particularly when used across all contributing
institutions in the clinical trial, mitigates the risk of documents being lost,
unavailable for audit or inspection during or after the trial or documents being
archived or destroyed too early. This approach also makes planning for archive
significantly easier because documents are able to be easily located, collated, and
organized prior to being put into the archive.
The Sponsor, or other organization, responsible for trial conduct will manage the
overall planning for archive, which will typically involve:
– The high costs associated with storing significant quantities of paper documents
over long periods of time
– Storage life, compatibility and accessibility of data, software, and information
technology in the future
– Factors outside of the Sponsor’s control which might lead to an increase in the
archive period such as the risk of future litigation arising from the trial, where the
results are contentious/controversial or may inform national policy and so will be
likely to undergo greater independent scrutiny over an extended period of time
342 G. Booth
Many trial protocols and designs will include provisions for monitoring the safety,
efficacy, or feasibility of the trial at pre-scheduled time points to inform whether it is
safe, ethical, and feasible to continue the trial beyond that point. Examples include:
The Sponsor may decide to close a trial early for various reasons, including poor
recruitment, withdrawal of the intervention, or safety/ethical concerns. The decision
to close a trial will usually include some form of independent oversight such as that
provided by an Independent Data Monitoring Committee or even in some cases the
regulator, depending upon the reason for closing the trial early.
Although it is not possible to always predict exactly if and when a trial will close
early, it is usually possible to undertake some planning activities in the run up to
protocol defined stop: go points such as dose escalation or interim analyses in case
the decision is to temporarily or permanently halt the trial at that time. Early planning
allows for the different scenarios to be worked through for each likely scenario and
can be particularly helpful where a plan of action will require rapid operationa-
lization to protect the safety of participants on the trial (Box 7). The short timeframes
involved when managing early trial closure arising as a result of interim safety or
efficacy monitoring or new adverse safety information which alters the risk: benefit
balance of the trial and safety of current or future trial participants mean careful
preplanning activities cannot always take place; however, the typical planning
activities detailed in Box 7 would still be applicable.
The most important consideration following early trial closure is to assess the impact
on previous, current, and future participants of the trial and understand how their
further participation in the trial will be affected by the decision and how quickly.
Where a trial closes early as a result of a safety concern, the communication to site
investigators and participants would be expedited and carefully consider the most
appropriate mechanism for communicating any new information to participants in a
18 End of Trial and Close Out of Data Collection 343
clear and sensitive way. Depending upon the risk involved, immediate action may
need to be taken to withdraw the trial intervention; changes to the intervention and
communication may need to be made to site investigators and participants immedi-
ately, that is, before seeking ethical or regulatory approval for the changes to be
made. In the European Union and United Kingdom such actions are reportable post
the event as an Urgent Safety Measure, where ethical or regulatory approval for the
actions is obtained within a defined period of time after the action has been taken.
In all cases where a trial is closed early, participants should be provided with clear
information about what is happening, why, further treatment options (where appro-
priate), and ongoing follow-up data collection.
Trials usually include more than one investigator site; these trials are called multi-
center trials. Prior to the End of Trial an individual investigator site may choose to
close or be closed following a decision by the Sponsor, independent ethics commit-
tee, or regulator; the reasons for this are varied; this is called individual site closure.
344 G. Booth
Site closure is achieved at an individual investigator site when (at that trial site):
Key Facts
– Trial closure refers to the activities which take place in preparation for the
cessation of trial recruitment through to archiving of the trial.
– Trial closure can be notionally divided into five stages: End of Recruitment; End
of Trial Intervention; End of Trial; Trial Reporting and Publishing, and
Archiving.
– The End of Trial is a key ethical and regulatory milestone defined in the approved
trial protocol and has associated time-dependent notification and reporting
requirements to the independent ethics committee and, for regulated trials, the
regulator(s).
18 End of Trial and Close Out of Data Collection 345
Cross-References
References
Clinical Trials.gov (2017) FDA 42 CFR Part 11 Final Rule for Clinical Trials Registration and
Results Information Submission. Available via https://prsinfo.clinicaltrials.gov/ Accessed 17
October 2020
European Commission (2017) Summaries of Clinical Trial Results for Laypersons. Available via
https://ec.europa.eu/health/sites/health/files/files/eudralex/vol-10/2017_01_26_summaries_of_
ct_results_for_laypersons.pdf Accessed 17 October 2020
European Medicines Agency (2016) Clinical data publication. Available via https://www.ema.
europa.eu/en/human-regulatory/marketing-authorisation/clinical-data-publication Accessed 17
October 2020
International Committee of Medical Journal Editors (2019) Clinical Trials Registration and Data
Sharing Policies. Available via http://www.icmje.org/recommendations/browse/publishing-and-
editorial-issues/clinical-trial-registration.html Accessed 17 October 2020
International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human
Use (ICH) (2016) Guideline for Good Clinical Practice. Available via https://www.ich.org/page/
efficacy-guidelines Accessed 17 October 2020
Medical Research Council/Department of Health/Medicines and Healthcare products Regulatory
Agency (2012) Risk-adapted approaches to the management of clinical trials of investigational
medicinal products. Available via https://assets.publishing.service.gov.uk/government/uploads/
system/uploads/attachment_data/file/343677/Risk-adapted_approaches_to_the_management_
of_clinical_trials_of_investigational_medicinal_products.pdf Accessed 17 October 2020
Medicines and Healthcare products Regulatory Agency (2018) ‘GXP’ Data Integrity Guidance and
Definitions. Available via https://mhrainspectorate.blog.gov.uk/2018/03/09/mhras-gxp-data-
integrity-guide-published/ Accessed 17 October 2020
MRC Methodology Hubs for Trials Methodology Research (2015) Good Practice Principles for
Sharing Individual Participant Data from Publicly Funded Clinical Trials. Available via https://
346 G. Booth
www.methodologyhubs.mrc.ac.uk/files/7114/3682/3831/Datasharingguidance2015.pdf
Accessed 17 October 2020
Official Journal of the European Union (2010) Detailed guidance for the request for authorisation of
a clinical trial on a medicinal product for human use to the competent authorities, notification of
substantial amendments and declaration of the end of the trial. Available via https://ec.europa.
eu/health/documents/eudralex/vol-10_en. Accessed 17 October 2020
UK Health Research Authority (2014) Information for participants at the end of a study: Guidance
for Researchers. Available via https://www.hra.nhs.uk/media/documents/information-partici
pants-end-study-guidance-researchers.pdf Accessed 17 October 2020.
World Health Organisation (2012) International Standards for Clinical Trial Registries WHO Public
Registries. Available via https://apps.who.int/iris/bitstream/handle/10665/76705/
9789241504294_eng.pdf?sequence¼1 Accessed 17 October 2020
International Trials
19
Lynette Blacher and Linda Marillo
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Challenges of Conducting Trials Internationally . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Trial Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Procedural Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Regulatory Approval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
Investigational Medicinal Product Supply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
Bio-materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Monitoring/Auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Enrollment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
Trial Designs and Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Data Collection Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Mitigation of Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
European Union General Data Protection Regulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Abstract
The number of international clinical trials being activated has increased greatly
over recent years. There are valid reasons for this global expansion, including a
need for greater numbers of subjects enrolled in as short a time as possible,
application to diverse populations, and the potential for cost reduction. However,
conducting trials internationally involves its own set of challenges related to
every aspect of trial conduct, from site activation to data management. Challenges
L. Blacher (*) · L. Marillo
Frontier Science Amherst, Amherst, NY, USA
e-mail: [email protected]
Keywords
International clinical trials · General Data Protection Regulation (GDPR) ·
Regulatory approval · ClinicalTrials.gov
Introduction
Background
Expanding the conduct of clinical trials to the international setting has increased
greatly over recent years. ClinicalTrials.gov, a database of privately and publicly
funded clinical trials conducted globally, lists 298,104 registered clinical trials
conducted in 209 countries, with almost 16,000 trials including both US and non-
US participants, and over 143,000 non-US trials (clinicaltrials.gov) (Fig. 1).
There are several benefits to conducting trials globally. Conducting trials in
multiple countries allows access to a greater number of potential trial participants.
This is particularly important now that research has become more targeted – in other
words, tailored to a population with specific characteristics (such as a certain gene
combination). Identifying participants with these characteristics may be challenging,
and expanding the potential pool to multiple countries helps to alleviate this issue.
Having access to a larger participant pool should speed up recruitment, which in
turn should lead to quicker realization of trial results, and ultimately benefit to the
greater population. Having participants from multiple countries allows greater diver-
sity in terms of ethnicity and disease characteristics and susceptibilities, allowing the
results to apply to a broader population. In some cases, participants that may not
19 International Trials 349
have had access to a certain treatment in their country can benefit by trial participa-
tion (Minisman et al. 2013).
Additionally, in theory, the cost of conducting the trial should decrease with
quicker recruitment. Bringing a new drug to market can cost between $161 million to
$2 billion (Sertkaya et al. 2014). A reduction in these expenses would make
resources available for additional or new research.
Challenges can present themselves in all areas and stages of the trial. This chapter
will focus on the areas of Trial Coordination and Data Management.
Trial Coordination
Trial coordination refers to the oversight and coordination of the logistics of trial
activities. Challenges can arise in several areas.
Procedural Differences
Earlier in this chapter it was discussed that conducting trials internationally could
potentially reduce costs. However, costs can vary between different countries,
making it much more expensive to include certain countries rather than others.
This can be attributed to many factors, including research staff salaries, equipment
expenses, fees for submission to Ethics Committees, and many more.
Infrastructure can also vary between countries. Areas that can pose challenges
should be evaluated when considering sites from a country for participation (Garg
2016). For example, do they have access to the proper equipment; is the equipment
in good condition and of the needed standard? Do they have appropriate storage
350 L. Blacher and L. Marillo
procedures and facilities for the study drug, including a secured area, a reliable
freezer; and a tracking process for receipt, distribution, and destruction or return of
drug?
Quality standards must also be evaluated, as there can be variance between
countries. Do the normal SOPs and procedures in place meet the standard expected
for the trial? Does staff receive adequate training and guidance during trial conduct?
Are staff fully qualified and skilled to perform the necessary procedures and docu-
ment the research?
And most importantly, does the standard of care in the country lend itself to the
trial requirements? (Bogin 2016) If certain procedures dictated by the protocol are
not standard, will the principal investigator and research staff be able to perform
them? Will enough participants be willing to undergo nonstandard procedures?
Regulatory Approval
The area that presents one of the more time-consuming challenges is “activating” a
site to be able to participate in a trial. Activation involves many steps, including
obtaining approval of the protocol and its related documents from regulatory bodies
for each site. Various regulatory bodies may be involved including Ethics Commit-
tees (ECs)/Institutional Review Boards (IRBs), which are independent bodies that
review and approve/disapprove research proposals for human participants. For
certain countries, the trial protocol may also need to be submitted to/receive approval
from Competent Authorities (CA)/Health Authorities (HA) (authorities that review
submitted clinical data and those that conduct inspections), Data Protection Agen-
cies (authorities responsible for upholding the right of data privacy), and/or individ-
ual Hospital Management. Additional review may be needed depending on the trial
treatment or procedures (e.g., if the trial involves radioactive substances or trans-
plant) (campus.ecrin.org).
There may be multiple ECs/IRBs involved as well. Some countries (or states/
regions) have a Central or Lead EC/IRB that performs complete review of the
protocol, and the local ECs can adopt the decision of the Central/Lead EC. In
other cases, local EC approval may be required in addition to the approval of the
Lead EC, though this is usually a simplified review involving site-specific aspects.
Submission to ECs and CAs may occur in parallel or sequentially, depending on
the country. The timeline for review and the fees associated with submission also
vary. Even the submission platforms are different, from paper to CD to entry in an
official database. Table 1 demonstrates these variances between a subset of European
countries (campus.ecrin.org):
To illustrate the process in more detail, we will take Switzerland as an example.
The CA for Switzerland is Swissmedic. (An additional CA, Bundesamt für Gesund-
heit (BAG)/Federal Office of Public Health (FOPH), is involved for trials with
radioactive substances or transplant products.) The EC, Swissethics, is an associa-
tion of the 9 cantonal (i.e., regional) ECs within Switzerland. Some cantonal ECs are
19 International Trials 351
The process to provide an Investigational Medicinal Product (IMP) to the site, and
ultimately the participant, involves a series of steps, as well as several parties. This
process is referred to as the clinical supply chain and follows the IMP from
manufacturer, distribution center, local depots, sites, and participants. It is estimated
that supply chain logistics account for 25% of pharmaceutical research and devel-
opment costs, in part due to the globalization of trials (Fisher Clinical Services).
Adding an international component can increase the complexity of the process, and
issues can develop at or between any of these locations (Arnum 2011).
19 International Trials 353
Appropriate logistics are essential to ensure the timely delivery of the IMP and
any comparator product (current standard of care therapy). All parties involved must
have the knowledge, experience, and imprint to meet the needs of a global setting.
Any delay or shortage can cause delay of the start of a trial, or potentially even halt
an ongoing trial. This, in turn, can affect the well-being and safety of participants.
It is important to start planning early, but not too early. Logistics should
be discussed while the protocol is under development (Fisher Clinical Services).
However, implementation should begin once it is relatively certain there will not be
changes to the protocol and contracts with Research Groups or sites. For example, if
a country decides not to participate, or a new country is added, and you have already
planned the labels, distribution routes, and depots, these areas will need to be re-
planned. Labels that included IMP dose would also need to be changed if the IMP
dosage was changed in the protocol.
Logistical hurdles are many. Differing regulations between countries impact all
areas of the process and considerations must be given to a variety of factors:
Availability of IMP – Sometimes a drug may have approval for the indication in
certain countries and not in others. This could limit the number of countries that
participate in the trial, as the patients already have access to the drug. This may result
in slower recruitment, or not enough potential patients to conduct the trial. In other
cases, the IMP may receive indication approval in one or more countries during the
conduct of the trial, and these sites may cease recruitment.
There can also be the special case when a trial has concluded and the participant is
still doing well, but drug is no longer supplied by the trial. In countries where it has
been approved for the indication, the participant will be able to receive drug through
standard mechanisms. In countries where the IMP does not have indication approval,
an avenue of compassionate use may have to be pursued for these participants.
Compassionate use allows individuals who are seriously ill and have no standard
treatment options available to be treated with an IMP.
Forecasting – Underestimating need for drug (e.g., in case of faster recruitment
or higher retention than expected) leads to participants without supply, whereas
overestimating (in case of slower recruitment or lower retention) leads to unused
material that is ultimately wasted, and again, is a cost issue. The differing infrastruc-
ture and working patterns within the countries can impact forecasting as well (Fisher
Clinical Services). If proper procedures are not in place, the sites may not relay the
necessary supply information to the sponsor and/or supplier to accurately calculate
the IMP need.
Package Labeling – When planning labeling, participating countries must be
selected early enough to ensure they are included on the IMP label in time for
printing. The proper languages for each country must be determined, and translations
validated. Additionally, authorities in each country may require specific terms to be
used (Miller 2010).
If there are multiple countries, booklet labels to hold the volume of information
may be useful, though they require additional time for production and printing.
“Back-up” countries should also be anticipated in case one or more countries decide
not to participate. Should this happen and no back-up countries have been planned,
354 L. Blacher and L. Marillo
Bio-materials
• Obtaining materials: Each trial will have specific requirements for the materials.
However, each country has specific regulations regarding the type of materials
that can be provided to another entity, and specification may differ even between
hospitals/medical institutions within the same country. For example, a block of
tumor may be required for review, analysis, and/or bio-banking for a trial, but
some countries may be allowed to provide only slides of the tumor material. In
this case the material would not be sufficient to meet the requirements of the trial,
and patients from these countries would be excluded from participation.
• Shipping materials: Materials are required to be sent to a central laboratory or
biobank. As was seen with the IMP supply chain process, issues may arise due to
the shipping regulations for each country, including special requirements; for
example, the Ministry of Health must provide permission in Australia, Russia,
and Brazil (export.gov; Fisher Clinical Services). Some countries, such as China,
do not allow export of bio-materials. There can also be confusion regarding the
classification of the material, for example, misconception from the courier that
pathology material is hazardous.
• Retaining materials: Length of pathology material storage differs by trial. Mate-
rials may be needed only for central review (in which case the materials would be
returned after a specified period), or for future use (in which case materials would
356 L. Blacher and L. Marillo
Monitoring/Auditing
• Compliance with Good Clinical Practice (GCP) and regulatory requirements (ich.
org)
• Compliance with the trial protocol and procedures
• Accurate and timely data collection
• Appropriate facilities, staff qualifications, and investigator oversight
• Communication between stakeholders
• Protection of patient safety and well-being
Although monitoring and auditing have the same goals, there is a key difference
between them, in that monitoring is a quality control function and auditing is a
quality assurance function. Monitoring refers to the performance of ongoing over-
sight and operational checks to verify processes are working as intended and in
accordance with the protocol, standard operating procedures (SOPs), GCP, and the
applicable regulatory requirements. Auditing refers to the systematic and indepen-
dent examination of all trial-related activities and documents, to determine if they
were conducted according to the protocol, SOPs, GCP, and the applicable regulatory
requirements. An audit is designed to improve the effectiveness of processes
(Ruppert 2007).
The international setting presents several similar challenges to both monitoring
and auditing. Both processes rely heavily on communication, and language barriers
can be an issue. If the trial is not conducted using a primary language, there may be
the need for a monitor/auditor to be fluent in multiple languages in order to cover
multiple countries; these qualified staff may be more costly and/or more difficult to
find. If multilingual staff are not available, alternatives include hiring several mon-
itors/auditors, each fluent in a different language, or hiring translators.
Scheduling visits may be problematic due to work patterns, holidays, and
religious observances. Additionally, the need to be sensitive to cultural differences
is even more important in the type of face-to-face interaction that takes place
during an on-site monitoring or audit visit. Monitors and auditors also need to be
aware of country-specific regulations regarding how GCP is interpreted and
implemented.
Site monitoring can account for up to 40% of the cost of a clinical trial (Sprosen
2017). The cost of on-site visits particularly can be exacerbated in the international
arena due to the extensive travel. There has been a shift to risk-based monitoring
approach due to increased number, complexity, and globalization of clinical trials
(Beauregard et al. 2018). This approach involves assessing risk, impact, and miti-
gation of the monitoring strategy. The goal is to focus on critical areas that relate to
19 International Trials 357
patient well-being, safety, and privacy. A risk-based monitoring plan often employs
the use of more centralized monitoring (i.e., remote evaluation) where appropriate, to
reduce the frequency and cost of on-site visits.
Although there are several benefits to central monitoring, one potential drawback
can be that site personnel may believe they are not receiving the same level of
support as provided during an on-site visit. It is much easier to develop a rapport with
face-to-face interaction rather than through phone calls and email.
The concept of the remote “visit” has extended to auditing as well (Cobert 2017).
Several considerations need to be given to up-front preparation, including:
Though the cost of travel is reduced with remote monitoring and auditing, there
are still inherent costs in arranging and conducting remote visits.
Data Management
Challenges and costs can arise during data management activities as well.
Enrollment
• Discordant couples
• Index Case/Households
• Index Case/Caregiver
• Perinatal (Mother/Child)
Unique to the IMPAACT Network are trials involving perinatal populations in sub-
Saharan Africa, Brazil, India, and Thailand. To accomplish the rigors of enrollment
and follow-up, the mother and her fetus generally enroll between 28 and 35 weeks
gestation. Both the mother and fetus are assigned a unique participant ID that keeps
personally identifiable information to a minimum. The fetus is automatically assigned
the same race and ethnicity as the mother. The fetus is considered on study at the time
of enrollment, but the clock does not start for the baby until birth, whereas data
collection on the mother begins from time of enrollment. If the birth outcome is not
viable only the date of miscarriage or stillbirth is collected for the baby; all other
information is collected as adverse events for the mother (impaactnetwork.org).
Study CRFs are designed to be completed by site staff and incorporate only
elements necessary to meet trial design questions. These should be developed in the
primary language of the protocol and the protocol team members, usually English. Care
should be taken to minimize repetitive questions on separate CRFs thus avoiding
potential inconsistencies between responses. The placement of questions should also
be considered to provide a logical flow of responses and grouping of like data elements.
Participant questionnaires should be presented in the language of the enrolled
participant. There are qualified translation services available but mostly for Euro-
pean, Chinese, and Japanese languages. For other ethnic or tribal languages, there
would be reliance on local staff to provide the translation and back translation,
utilizing separate staff to perform these activities. This can be tedious, repetitive, and
time-consuming for both the sites and for the DMC staff required to verify the
translations, but aids in maintaining a consistent method of presenting study con-
cepts and questions, and eliciting responses from participants. One problem to avoid
is asking open-ended questions that would require text replies which have to be
translated back into English before being entered into the study database. This is
particularly important to maintain privacy when collecting sensitive information the
participant would not expect to be shared with site staff.
Questionnaires may be collected on paper and submitted to the data center via
mail or facsimile for data keying, or through an online Internet package that prompts
the user to provide responses to the questions – these responses are saved and
downloaded to the database. Since the completed local language form will be
submitted, the formatting of the questions and responses should align with the
English versions to ensure the data are entered into the study database appropriately.
Laboratory test data CRFs should be designed to capture the units of measure-
ments used in local laboratory, allowing for each site to report results as collected
and measured. Upper and lower limits of normal as well as results should have data
fields large enough to capture abnormally high or low results.
Ultimately, careful thought should go into the design of the data collection
instruments to mitigate confusion about the goals of the study and minimize repet-
itive questions. The database should be developed in conjunction with the designing
of the CRFs. A robust EDC system will have built-in validity and QA/QC checks,
whereas data submitted via paper will be centrally checked further downstream in
the data submission process.
Mitigation of Issues
At the beginning of this chapter, we discussed that the potential for reduced cost was
one of the factors leading to the globalization of clinical trials. However, the
numerous challenges of the global arena come with costs of their own. One needs
to employ various mitigation approaches to balance these costs.
It is important to begin with a proactive risk-based strategy for conduct and
monitoring of the trial. This allows any potential hurdles to be identified in advance,
and plans put in place to reduce or eliminate their impact. A risk-based strategy will
19 International Trials 361
also lessen the chance of emergency or crisis situations arising; and if they do arise,
there will already be a plan for addressing them.
Methods to mitigate risk include, wherever possible, creating simplified pro-
tocols, employing user-friendly data collection tools, and developing streamlined
procedures for trial activities. It is extremely important to train – and re-train – the
research team in these areas. This will ensure everyone has the same interpretation of
the protocol and procedures. Requiring a primary language be used for the conduct
of the trial, and mandating that all sites have at least one person on the trial team that
speaks this language, can also reduce the potential for misunderstandings.
The proper selection of partners and collaborators is also important for risk
mitigation. A site feasibility evaluation should be conducted before accepting a
site for the trial, and vendors should be thoroughly researched and vetted. Partnering
with experienced and dedicated collaborators allows for more efficient trial conduct.
A quality management system should be put in place to monitor each area of trial
conduct. The use of technology can greatly aid in this oversight. For example,
tracking systems can be utilized for IMP and bio-materials, and metrics reports
can be created for areas in site performance, such as length of time to activation, data
submission timeliness, query resolution, critical data items, protocol deviations, etc.
The most important facet of risk mitigation, which should be started at the very
beginning of the trial, involves communication. It is critical to develop a rapport and
understanding with all stakeholders. Establishing a clear communication pathway is
key to the conduct – and ultimately the success – of the trial.
A chapter on international clinical trials would not be complete without some discus-
sion of the General Data Protection Regulation (GDPR). The GDPR came into effect
for the European Union (EU) May 2018, replacing the previous EU Directive 95/46/
EC regarding data protection (eur-lex.europa.eu). The primary purpose of the regula-
tion is to harmonize data protection and privacy laws across EU countries. The
regulation covers the protection of natural persons with regard to the processing of
personal data and on the free movement of such data and applies to all organizations
who process personal data of EU subjects, even if the organization is not in the EU.
The main principles of the regulation emphasize transparency of data processing,
legitimate use of data, minimization of data collected (e.g., minimum required for
legitimate use), accuracy of data, security of data, subject consent to use of data, and
limitation for data retention (e.g., retain data only for the length of time required for
purpose of use) (eugdpr.org).
The GDPR strives to protect subjects by outlining their rights in regards to the
processing of their data. Rights of data subjects include:
The GDPR also stresses the accountability of the data controller (person or entity
which determines the purpose and manner of processing personal data, for example,
a sponsor) and the data processor (person or entity that processes data on behalf of
the controller, for example, an organization responsible for quality control or
statistical analysis of the data), including strengthening enforcement and penalties
(Yeomans and Abousahl 2017). Each EU member state must appoint an independent
supervisory authority to enforce GDPR compliance; these authorities cooperate with
each other and report to the European Data Protection Board. The data subject has
the right to lodge complaints against these authorities and/or data controllers and to
receive compensation. Fines may be levied against member states and/or controllers.
Noncompliance to GDPR can lead to fines up to 10,000,000 EUR or a percentage of
an organization’s annual turnover (gdpreu.org).
There are many challenges in interpreting and implementing the regulation. The
sponsor and any other data controllers or processors must determine how to uphold
this regulation in the context of clinical trials. A first step is defining personal data,
which is considered to be data that relate to an identified or identifiable individual
(Advarra Regulatory Team 2018). The regulation applies to the processing of said
data if it is either processed in an automated manner, or processed in a nonautomated
manner such that it becomes part of a filing system (which is considered to be a
system organized by specific criteria).
If there are no identifiers that can link or relate that data to an individual, the data
can then be considered anonymized. Anonymized data are not considered personal
data. On the other hand, pseudonimyzed data are personal data that can no longer be
attributed to a specific individual without the use of additional information, but are
still considered personal data (eugdpr.org/). In order to determine whether pseudo-
nimyzed data are personal, it must be determined if there is information or means
available to identify the participant, and whether these means/information are readily
available. In terms of a clinical trial, participant data is usually coded (e.g., partic-
ipant identification number, randomization code, site identification number) and
would therefore be considered pseudonimyzed (Advarra Regulatory Team 2018).
The concept of personal data applies not only to participants in clinical trials, but
also employees of the sponsor, site staff, and collaborators as well (Gogates 2018).
The collection of names and contact information from these individuals is necessary
for the conduct of the trial. The GDPR does state that the legitimate interests of the
controller may provide a legal basis for processing data, especially if there is a
relevant relationship between the controller and the subject, provided the rights of
the data subject are still upheld.
According to Article 89 of the regulation, there is also some allowance for
derogation regarding rights of data subjects (e.g., rectification, erasure, right to be
forgotten, restriction of processing, portability, objection) when data are processed
19 International Trials 363
for scientific or historical research or statistical purposes, and Recital 156 refers to
clinical trials as such research. Note derogation is allowed only in the case where
complying with these provisions would make it impossible, or would significantly
hinder, the fulfillment of the purpose of the research. Also, the research must still
comply with GCP and appropriate safeguards must be put in place.
The rights of clinical trial participants can be upheld by ensuring the informed
consent clearly states what data are being collected, why it is being collected, and by
whom it will be processed or used (including whether it will be transferred to a third
country) (Gogates 2018). Internally, sponsors, processors, and controllers should
ensure appropriate security measures (including technology, processes, and training)
are in place to maintain the privacy of the data. Data protection impact assessments
should be conducted for each data process (Gogates 2018), to determine its purpose,
management, and risks to rights of data subjects, as well as whether additional
safeguards need to be established.
A Data Protection Officer (DPO), who will serve as the point person to ensure
GDPR compliance, may also need to be appointed (HIPPA Journal 2018). A data
privacy notice should be created and readily available to data subjects and should
include contact information for the data controller (and DPO if applicable), catego-
ries of data that are collected, information regarding data transfer and retention, and
data subject rights as outlined in the GDPR. The means by which requests or
complaints can be made should be indicated (gdpr.eu).
Despite best efforts to ensure data protection, a data breach is still possible.
Should this occur, the controller must inform the authorities within 72 h, unless
the breach is unlikely to result in a risk to the rights and freedoms of natural persons
(data.europa.eu/eli/reg/2016/679/oj). The controller must keep a record of all
breaches and the resulting investigations, regardless of whether they were reported.
If the breach is likely to result in such risk, the controller must communicate the
breach to the subject, including the likely consequences of the breach and steps taken
to mitigate the effects (data.europa.eu/eli/reg/2016/679/oj). Of course in the case of a
clinical trial the controller usually does not have direct contact with the participants
or access to their information, so the communication would be handled by the site
investigator, based on information provided by the controller.
Upholding the principles of GDPR within the clinical trial arena involves and
impacts many stakeholders, including the sponsor and other controllers, data pro-
cessing organizations, investigators and site research staff, and data subjects. The
measures undertaken to understand the regulation and implement it often involve
more complex processes and more personnel, which is an added cost to the conduct
of the trial.
communication pathways with all partners, from site staff to vendors. Risk assess-
ment and mitigation strategies must be put in place. Though implementing these
measures may be costly, the hoped-for benefit of conducting clinical trials globally is
the quicker realization of trial results, and ultimately benefit to the greater
population.
Key Facts
The benefits to conducting trials globally include access to a greater number of trial
participants; greater diversity in terms of ethnicity, disease characteristics, and
susceptibilities; faster recruitment; and quicker realization of trial results.
The challenges involved in conducting international clinical trials are primarily
related to cultural, procedural, and regulatory differences between countries.
Regulatory bodies vary across countries and may include Ethics Committees/
Institutional Review Boards, Competent Authorities, Health Authorities, Data Pro-
tection Agencies, and/or individual Hospital Management.
Logistical variances between countries in the clinical (drug) supply chain include
approval for indication of the investigational medicinal product, impact of infrastruc-
ture on forecasting, languages and terminology on package labels, weather conditions
for cold-chain logistics (packaging), requirements for multiple depots, goods and
services taxes and permits, and local or off-site destruction of unused product.
Country-specific restrictions apply to the use of biomaterials in clinical trials,
such as type of material that can be provided (if provided at all), export regulations,
and length material that can be retained.
The challenges in auditing and monitoring international sites include language
barriers; scheduling visits due to work patterns, holidays, and religious observances;
and the cost of the visit due to international travel.
Challenges in data management of global clinical trials include time differences,
interpretation of race and ethnicity, language issues, and technical capabilities of the
sites (which affect whether paper-based or electronic data capture is possible).
Several considerations must be taken into consideration in terms of Case Report
Form design for international trials, including minimal data collection, participant
questionnaires in the language of the participant, and laboratory units in local
measurements.
The challenges of conducting international clinical trials can be mitigated by
developing a proactive risk-based strategy. Risks can be mitigated by simplified
protocols, user-friendly data collection tools, streamlined procedures, training, and
establishing a clear communication pathway.
The European General Data Protection Regulation (GDPR) strives to protect
subjects by outlining their rights in regard to the processing of their data. Rights of
data subjects include:
Right to Information – right to know how their data are being used
Right to be Forgotten – right to have the data erased/destroyed
Right to Restriction – right to restrict data processing
Right to Portability – right to take data from controller and/or transfer to another
entity
Right to Object – right to object to data processing
Cross-References
▶ ClinicalTrials.gov
▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ Documentation: Essential Documents and Standard Operating Procedures
▶ Implementing the Trial Protocol
▶ Institutional Review Boards and Ethics Committees
▶ Multicenter and Network Trials
▶ Procurement and Distribution of Study Medicines
▶ Qualifications of the Research Staff
▶ Responsibilities and Management of the Clinical Coordinating Center
▶ Selection of Study Centers and Investigators
▶ Training the Investigatorship
References
Advarra Regulatory Team (2018) The GDPR and its impact on the clinical research community
(including non-EU researchers. In: Advarra. Available via https://www.advarra.com/the-gdpr-
and-its-impact-on-the-clinical-research-community-including-non-eu-researchers/
Arnum P (2011) Managing the global clinical-trial material supply chain. In: Pharmtech. Available
via http://www.pharmtech.com/managing-global-clinical-trial-material-supply-chain
Beauregard A et al (2018) The basics of clinical trial monitoring. In: Applied clinical trials. Available
via http://www.appliedclinicaltrialsonline.com/basics-clinical-trial-centralized-monitoring
Bogin V (2016) Feasibility in the age of international clinical trials. In: Applied clinical trials.
Available via http://www.appliedclinicaltrialsonline.com/feasibility-age-international-clinical-
trials
Cobert B (2017) Remote PV Audits & Inspections. In: C3i Solutions. Available via https://www.
c3isolutions.com/blog/remote-pv-audits-inspections/
European Clinical Research Infrastructure Network. Available via http://campus.ecrin.org/
Export.gov Brazil – Import Requirements and Documentation. In: Brazil Country commercial guide.
Available via https://www.export.gov/article?id¼Brazil-Import-Requirements-and-Documentation
Fisher Clinical Services Managing Complex Global Drug Distribution and Expiry. Available via
http://info.fisherclinicalservices.com/clinical-supply-optimization-global-distribution-case-
study-box
Fisher Clinical Services New Challenges for Global Clinical Trials: Managing Supply Logistics in
an Expanding Clinical Trial Universe. Available via http://info.fisherclinicalservices.com/white-
paper-global-clinical-trial-challenges
366 L. Blacher and L. Marillo
Fisher Clinical Services The Challenges of Cold Chain Management. Available via http://www.
fisherclinicalservices.com/content/dam/FisherClinicalServices/Learning%20Centre%20Images/
Latest%20Article%20Images/Latestarticlespdf/CTP012_Fisher%20Clinical_TRIM%20DPS.
PDF
Fisher Clinical Services What Clinical Teams Should Know About Changing Trial Logistics and
How they Will Affect Development. Available via http://info.fisherclinicalservices.com/log
Garg S (2016) An auditor’s view of compliance challenges in resource-limited clinical trial sites. In:
Applied clinical trials. Available via http://www.appliedclinicaltrialsonline.com/auditor-s-view-
compliance-challenges-resource-limited-clinical-trial-sites?pageID¼4
Gogates G (2018) How does GDPR affect clinical trials? In: Applied clinical trials. Available via
http://www.appliedclinicaltrialsonline.com/how-does-gdpr-affect-clinical-trials
HiPAA Journal (2018) GDPR: what is the role of the Data Protection Officer. Available via https://
www.hipaajournal.com/gdpr-role-of-the-data-protection-officer/
International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human
Use (ICH). Available via https://www.ich.org/home.html
Miller J (2010) Complex clinical trials are posting new challenges across the clinical supply chain.
In: BioPharm. Available via http://www.biopharminternational.com/complex-clinical-trials-are-
posing-new-challenges-across-clinical-supply-chain
Minisman et al (2013) PMC US National Library of Medicine National Institutes of Health
implementing clinical trials on an international platform: challenges and perspectives. J Neurol
Sci. Available via https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3254780/
Mongan A (2016) Three factors impacting on the destruction of IMP material. In: Clinical trials
arena. Available via https://www.clinicaltrialsarena.com/uncategorized/clinical-trials-arena/
three-factors-impacting-on-the-destruction-of-imp-material-4839806-2/
NIH U.S. Library of Medicine ClinicalTrials.gov. Available via https://clinicaltrials.gov
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the
protection of natural persons with regard to the processing of personal data and on the free
movement of such data, and repealing Directive 95/46/EC (General Data Protection Regula-
tion). Available via http://data.europa.eu/eli/reg/2016/679/oj
Ruppert M (2007) Defining the meaning of ‘auditing” and ‘monitoring’ & clarifying the appropriate
use of the terms. Available via https://ahia.org/assets/Uploads/pdfUpload/WhitePapers/
DefiningAuditingAndMonitoring.pdf
Sertkaya A et al (2014) Examination of clinical trial costs and barriers for drug development.
Available via https://aspe.hhs.gov/report/examination-clinical-trial-costs-and-barriers-drug-
development
Sprosen T (2017) News: does cutting trial costs by reducing monitoring visits also reduce quality?
In: MoreTrials. Available via https://moretrials.net/news-cutting-trial-costs-reducing-monitor
ing-visits-also-reduce-quality/
The International Maternal Pediatric Adolescent AIDS Clinical Trials (IMPAACT). Available via
https://impaactnetwork.org/
U.S. Department of Health and Human Services Food and Drug Administration (2016)
Collection of race and ethnicity data in clinical trials. Available via https://www.fda.gov/
regulatory-information/search-fda-guidance-documents/collection-race-and-ethnicity-data-clini
cal-trials
Web learning resources for the EU General Data Protection Regulation; Fines and penalties.
Available via https://www.gdpreu.org/compliance/fines-and-penalties/
Weyermann A (2006) Labelling requirements for IMPs in multinational clinical trials: bureaucratic
cost driver or added value? Available via https://dgra.de/media/pdf/studium/masterthesis/mas
ter_weyermann_a.pdf
Yeomans A, Abousahl I (2017) Preparing for the EU GDPR in clinical and biomedical research.
Available via https://www.viedoc.com/site/assets/files/1323/preparing_for_the_eu_gdpr_in_
clinical_and_biomedical_research.pdf
19 International Trials 367
Further Reading
About NIAID Division of AIDS (DAIDS). Available via https://www.niaid.nih.gov/about/daids
ASPE U.S. Department of Health and Human Services (2014) Examination of clinical trial costs
and barriers for drug development. Available via https://aspe.hhs.gov/report/examination-clini
cal-trial-costs-and-barriers-drug-development
Ayalew K (2015) FDA perspective on international clinical trials. Available via https://www.fda.
gov/downloads/Drugs/NewsEvents/UCM441250.pdf
Bioclinica (2017) Collaboration between clinical operations and the logistics and supply chain
teams is key to trial success. Available via https://www.bioclinica.com/blog/collaboration-
between-clinical-operations-and-logistics-and-supply-chain-teams-key-trial
Clinical Trials Guidance Documents. Available via https://www.fda.gov/RegulatoryInformation/
Guidances/ucm122046.htm
ClinRegs. is an online database of country-specific clinical research regulatory information
designed to assist in planning and implementing international clinical research. Available via
https://clinregs.niaid.nih.gov/index.php
Collection of Race and Ethnicity Data in Clinical Trials. Available via https://www.fda.gov/down
loads/RegulatoryInformation/Guidances/UCM126396.pdf
DAIDS Regulatory Support Center (RSC). provides support for all NIAID/DAIDS-supported and/
or sponsored network and non-network clinical trials, both domestic and international. Available
via https://rsc.niaid.nih.gov/
Department of Health and Human Services Office of Inspector General (2001) The globalization of
clinical trials a growing challenge in protecting human subjects. Available via https://oig.hhs.
gov/oei/reports/oei-01-00-00190.pdf
Division of AIDS Clinical Research Policies and Standard Procedures Documents. Available via
https://www.niaid.nih.gov/research/daids-clinical-research-policies-standard-procedures
European Commission, Enterprise and Industry (2009) EU guidelines to good manufacturing
practice medicinal products for human and veterinary use. Available via http://www.gmp-
compliance.org/guidemgr/files/2009_06_ANNEX13.PDF
Foust M (2014) Strengthening the links in the clinical supply chain: aim for transparency through-
out the process. In: Applied clinical trials. Available via http://www.appliedclinicaltrialsonline.
com/strengthening-links-clinical-supply-chain-aim-transparency-throughout-process
George Clinical (2016) Regulatory timelines in the Asia-Pacific. Available via https://www.
georgeclinical.com/resources/research/regulatory-timelines-asia-pacific
Global Health Trials (2012) Destruction of investigational medical product following trial termi-
nation. Available via https://globalhealthtrials.tghn.org/community/groups/group/regulations-
and-guidelines/topics/172/
Henley P (2016) Monitoring clinical trials: a practical guide. In: Tropical medicine and international
health. Available via https://onlinelibrary.wiley.com/doi/full/10.1111/tmi.12781
Leyland-Jones B et al (2008) Recommendations for collection and handling of specimens from
group breast cancer clinical trials. J Clin Oncol. Available via https://www.ncbi.nlm.nih.gov/
pmc/articles/PMC2651095/
Mattuschka J (2016) Clinical supply chain: a four-dimensional mission. In: BioProcess interna-
tional. Available via https://bioprocessintl.com/manufacturing/supply-chain/clinical-supply-
chain-a-four-dimensional-mission/
Muts V (2018) International patient recruitment: the grass is not always greener abroad. In: Applied
clinical trials. Available via http://www.appliedclinicaltrialsonline.com/international-patient-
recruitment-grass-not-always-greener-abroad
National Cancer Institute Division of Cancer Treatment and Diagnosis (2018) Biorepositories and
Biospecimen Research branch best practices. Available via https://biospecimens.cancer.gov/
bestpractices/2016-NCIBestPractices.pdf
368 L. Blacher and L. Marillo
Pharmaceutical Engineering (2016) Clinical labeling of medicinal products: EU clinical trial regulation.
Available via http://www.pharmtech.com/managing-global-clinical-trial-material-supply-chain
Research Conducted in NIAID Labs. Available via https://www.niaid.nih.gov/research/research-
conducted-niaid
The Clinical Data Interchange Standards Consortium (CDISC) is an open, multidisciplinary,
neutral, 501(c)(3) non-profit standards developing organization. Available via https://www.
cdisc.org/
U.S. Department of Health and Human Services Food and Drug Administration (2013) Guidance
for industry oversight of clinical investigations – a risk-based approach to monitoring. Available
via https://www.fda.gov/downloads/Drugs/Guidances/UCM269919.pdf
World Courier (2015) Managing the myths. Available via https://www.worldcourier.com/insights/
managing-the-myths
World Health Organization (WHO). Available via https://www.who.int/topics/epidemiology/en/
Documentation: Essential Documents and
Standard Operating Procedures 20
Eleanor McFadden, Julie Jackson, and Jane Forrest
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
ICH Guidelines on Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
Sponsor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Participating Sites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
Coordinating Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Edit Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Trial Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Independent Statistical Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
Trial Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
Site Binder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
Source Data Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
Monitoring Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
Standard Operating Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
Sponsor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Participating Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Coordinating Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Independent Statistical Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Monitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Document Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Document Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
Maintenance and Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
E. McFadden (*)
Frontier Science (Scotland) Ltd., Kincraig, Scotland, UK
e-mail: [email protected]
J. Jackson · J. Forrest
Frontier Science (Scotland) Ltd, Grampian View, Kincraig, UK
e-mail: [email protected];
[email protected]
Abstract
Documentation is a critical component of clinical trials. There are requirements
not only to be able to verify that the data being analyzed is accurate but that it
was collected and processed in a consistent way. Anyone involved in a trial has
to recognize the documentation requirements and ensure that they are met. The
International Conference on Harmonization (ICH) Guidelines on Good Clinical
Practice E6 provides details of standards to be met along with relevant defini-
tions. This chapter provides guidance on identifying essential documents for a
trial and also on how to develop and maintain systems for standard operating
procedures.
Keywords
Documentation · Standard operating procedures · Trial master file · Essential
documents
Introduction
Documentation is now a fact of life for everyone involved in the conduct of clinical
trials. The Sponsor, the Funder, the Trials Unit coordinating the trial, the investigator
at the site, and often the trial subjects themselves have a responsibility to ensure that
complete and accurate documentation is kept relating to their role in the trial. The
conduct of clinical trials is now a highly regulated industry, and there are many
people who are employed to maintain and oversee the quality of the trial documen-
tation. The basic rule of regulators and other auditors is that if it is not written down,
then it didn’t happen, and they expect to be able to reconstruct the exact conduct of
the trial from the documentation, including source documentation.
As well as creating and revising the documentation, there is a requirement to keep
all documentation and archive it securely for lengthy time periods, which can vary
depending on the type of trial and where it is being conducted. This chapter will
outline requirements for essential documentation for sites, for coordinating centers,
and for the Sponsor. It also includes some guidance for monitors and independent
statistical centers (ISC) if relevant to the study.
The chapter also addresses the need for standard operating procedures (SOPs)
ensuring that routine procedures are always carried out in the same way. We discuss
the types of procedural documentation that are needed, and give suggestions for
20 Documentation: Essential Documents and Standard Operating Procedures 371
systems for the preparation and maintenance of trial documentation. This chapter
will close by providing an overview on the use of document management systems.
Terminology
As there are several different models possible for running a clinical trial, we describe
the model which we will use in this text for explanations. The hypothetical trial is a
multicenter trial with several hospitals/clinics entering patients. The trial has over-
sight by a study Sponsor. Data is submitted to a coordinating center (on either paper
case report forms (CRFs) or electronically via a remote data capture system), and the
coordinating center is responsible for randomization/registration of patients, quality
control of the data, queries to sites, statistical design and statistical analysis, and the
management of those trial-related services. There is a separate independent statistical
group responsible for preparing reports for the independent data monitoring com-
mittee. Site monitoring, including source data verification (SDV), is done by a
separate organization contracted by the Sponsor. The Sponsor is responsible for
provision and distribution of study medications/devices, oversight of the trial and all
trial documentation. The collection of essential documentation required for this trial
will be referred to as the trial master file (TMF). Where relevant, we will describe
how other trial models would address some of the documentation requirements.
Background
In the early days of “modern” clinical trials, there was very little documentation
maintained. Case report forms (CRFs) were relatively short and were all paper-
based. Investigators and their staff at the sites completed them manually with source
data being the patient’s medical record. Sometimes they were signed – by the
investigator, sometimes by a member of the investigator’s staff or a rubber stamp
signature, sometimes not signed at all, but it really didn’t matter – the data was
entered into the computer and used in analysis. This is just one example of how
things have changed over the last 30–40 years, and there are many others.
Why have things changed so substantially?
One reason was the detection of cases of fraudulent data on cancer trials being
submitted to a central trials office in the late 1970s. A result of an investigation into
submission of fraudulent data by one of the US National Cancer Institute-funded
cancer trials groups, the Eastern Cooperative Oncology Group (ECOG), was the
establishment of an ECOG audit process where all sites were visited on site at regular
intervals, and CRFs compared against source data on a random selection of cases.
The US cancer cooperative group program implemented this approach across the
board with involvement from the National Cancer Institute (NCI), and, while it has
been refined and strengthened over the years, this program is still very much in place
for NCI-sponsored clinical trials, and source data verification is now a routine
practice in clinical trials (Ben-Yehuda and Oliver-Lumerman 2017; Weiss 1998).
372 E. McFadden et al.
Another completely different dynamic was the difficulty of doing clinical trials
across borders because of the different regulations and practices around the world.
Sponsors of international trials in the 1970s and 1980s were finding it challenging to
deal with these variations, yet the urge to complete large trials more quickly meant an
increase in companies and researchers wanting to use this cross-border model and to
be able to use the same standards when submitting new drug applications in separate
countries.
The biggest influence in addressing this has come from the International Confer-
ence on Harmonization, starting as a meeting in 1989 in Brussels of representatives
from the pharmaceutical companies and regulatory authorities from Europe, Japan,
and the USA. This followed early work to harmonize procedures in the European
Union and strengthening of regulatory requirements by the US Food and Drug
Administration (FDA). This conference generated a set of guidelines for clinical
trials which have been widely adopted throughout the world as standards for the
conduct of clinical trials (Good Clinical Practice Guidelines, E6 (R2) 2016). The
ICH organization is still in place and constantly working to update and improve their
guidelines.
In parallel, we saw the growth of an industry of contract research organizations
(CROs) providing support services to the pharmaceutical industry in the conduct of
clinical trials, including on-site monitoring and source data verification. The rapid
and constant development of relevant technology has also had a huge impact as it is
now feasible to collect and manage data and documentation electronically rather
than on paper.
One of the GCP principles, in ICH E6, Sect. 2, is that systems should be
implemented with procedures that assure the quality of every aspect of the trial.
This principle is expanded in ICH E6, Sect. 5.1 where it is stated that it is the
responsibility of the Sponsor to implement and maintain quality assurance and
quality control systems with written standard operating procedures (SOPs). We
will explore SOPs later in this chapter.
Another principle to follow has become known as ALCOA. Documents should
be:
• Attributable: can the data be traceable to the person responsible for recording a
patient visit/event, along with the relevant date and time?
• Legible: can the data/information be easily read?
• Contemporaneous: was the data recorded at the time (or close to the time) that it
happened, and not recorded a long time after?
• Original: is the source or first-captured data available for review?
• Accurate: are the details recorded complete and correct?
Sponsor
The Sponsor of a clinical trial has ultimate responsibility for ensuring that all
required documentation is available at the end of the trial. In addition, during the
trial conduct period, the Sponsor usually has primary responsibility for the trial
protocol, the Informed Consent Form and any patient information sheets, the
Investigator’s Brochure (if relevant), and some oversight plans for the conduct of
374 E. McFadden et al.
1. Training records have to be maintained and updated over time to show that all
staff involved in the trial have the appropriate training and qualifications to fulfil
their trial-related responsibilities. This is required to meet one of the ICH GCP
key principles. Training records for former staff involved in the trial should be
maintained.
2. Retention of trial-related records is critical, and guidance should be sought from
the Sponsor about the retention period. In many instances, this can be for a
minimum of 25 years or until 2 years after the “last” regulatory submission
involving trial data. As it is very difficult to assess whether a submission will
be the “last” one, this effectively means that the records should be retained until
the Sponsor has said they can be destroyed. Remember that records have to
remain legible and accessible, to ensure that ink is not fading on handwritten
documents and that electronic media can still be read. If copies are made of
original paper records, the copies must be certified as exact copies of the original.
3. There must always be a complete audit trail of any changes made to clinical trial
CRF data. With electronic remote data capture systems, this type of audit trail is
usually built in, and a record will be kept of the original value, the new value, the
name of the person making the change, and the date and time stamp of the change.
Any eCRF system which does not meet these criteria should probably not be used
for any trial where regulators may eventually review and adjudicate the data and the
trial conduct. With paper records, the same level of information should be recorded
on the paper record to ensure transparency, with a single line through an original
value so that the value is not obliterated; the new value clearly written beside the
original, along with initials/name of the person making the change; and the date and
time of the change being made. The change must be supported by source data.
20 Documentation: Essential Documents and Standard Operating Procedures 375
4. Electronic trial data handling systems used in support of clinical trial activities
must be validated including those handling essential documents. A validation
document set confirming the systems fitness for purpose should be created; ICH
E6 GCP Sect. 5.5 provides detail on the documentation and SOPs required.
The following sections describe the responsibilities of each component of the trial
management team, but it is the Sponsor who defines these study-specific responsi-
bilities and also the Sponsor who should have systems to ensure that the required
documentation is created and available for review and inspection.
Participating Sites
1. Most “formal” medical records do not contain sufficient information to allow the
complete reconstruction of a patient’s journey through a clinical trial. In such a
situation, the site should maintain a “research record” containing documentation
for the trial which is not recorded in the medical record. Information recorded
should follow the ALCOA+ rules defined above.
2. The principal study investigator at the site is responsible for all the trial activities
at that site. However, responsibilities can be delegated to other staff at the site. It
is essential that all such delegations be written down in a log, showing the name
of the person to whom a responsibility is delegated, the delegated responsibility
and the relevant dates. Lack of a delegation log is a common finding at site
monitoring visits/audits. Figure 1 shows an example of a site delegation log.
3. As well as CRF data, the site will be responsible for ensuring that all required
documentation is available for the specific trial. This could include some or all of
the following:
(a) Approved informed consent forms and original signed patient consent forms
(b) Ethics/Institutional Review Board (IRB) approvals
(c) Contracts, agreements, indemnity, insurance, financial aspects of the trial
376 E. McFadden et al.
Very often, the Sponsor will provide a site binder for the participating sites to use
for documentation, with instructions on the documentation to be maintained and
stored in the binder. If none is provided, it would be beneficial for the site to create
their own prior to the start of the trial. There are also now electronic site binders
being used so that all documents are organized and stored electronically rather than
on paper.
Procedural documentation (or SOPs) is relevant to all parties involved in a trial
and is covered in a separate section in this chapter.
Coordinating Center
Our hypothetical trial has a coordinating center which is responsible for data
collection, quality control, randomization/registration, trial management, and statis-
tical design and analysis. As well as development of SOPs (see section in this
chapter), what other documents will be needed for the TMF? The following is a
summary of typical documentation needs by function.
Data Collection
To collect data, a case report form (CRF) must be designed, tested, and implemented.
If the CRF is electronic, validation documentation of its design and implementation
should be maintained. Key correspondence relating to the design and content of the
CRF should be kept in an organized way such that it is easily retrievable at any time.
If a CRF is updated after data collection begins, the CRF should be version
controlled with dates of implementation and a record kept of all changes made as
this could be important when using data for reporting and analysis. The statisticians
need to know which version was being used when data was collected. It would also
be important to document whether changes made were to be applied for all patients
(including those already entered) or applied prospectively for new data only. There
will also be a need for procedural documentation to provide guidance on CRF
completion. Much of this can be covered in a data management plan for a study.
Edit Checks
Whether data is collected via an electronic data capture (EDC) system or on paper,
the coordinating center will develop edit checks on the data as part of its quality
control process. The edit checks can be either electronic (programmed to run
automatically) or be manual checks implemented on review of the data or be a
combination of both. The suite of edit checks will almost certainly evolve over time,
and again, it is essential to maintain documentation on the edit checks which are
implemented (including the testing and validation process), when they were
implemented and on which subjects they apply (all or only prospectively entered).
378 E. McFadden et al.
Trial Management
Much of the documentation around trial management will revolve around project
and quality process documents and can include topics such as risk, communication,
training, and TMF management and is covered later in the chapter. As with the sites,
the coordinating center will maintain contractual documents including documenta-
tion of responsibilities delegated by the Sponsor.
Statistics
The documentation required from the statistical team includes the following:
Most Phase III trials have an independent data monitoring committee (IDMC), also
referred to as a data and safety monitoring committee, to review study data and
progress periodically. This committee is independent from the Sponsor. The role and
responsibilities of the IDMC are well described in other publications (Ellenburg et al.
2003). The IDMC meets periodically and reviews reports prepared by one or more
statisticians who are not involved in the conduct of the study. The reports for the
IDMC are confidential and not seen by the Sponsor and are usually prepared by
statistician(s) who are independent of the Sponsor and independent of the protocol
statisticians. We refer to these independent statisticians as the independent statistical
center (ISC) in this chapter.
The IDMC has a very important role in the trial and can make recommendations
to modify or to stop a trial based on the information with which they are presented.
They make these decisions based on the data prepared for them by the ISC. The ISC
therefore has to maintain documentation of all the reports prepared for the IDMC
along with corresponding statistical programs and datasets. Quality control records
and validation reports on the programs should also be maintained for each version.
Minutes of the closed and open sessions of the IDMC meetings would also need to
20 Documentation: Essential Documents and Standard Operating Procedures 379
be saved and all documentation available for inclusion in the TMF. The role and
responsibilities of the ISC and the IDMC will often be documented in an IDMC
Charter. The documents maintained by the ISC are usually kept confidential from the
Sponsor until the end of the trial.
Trial Monitors
In our hypothetical trial, monitors visit sites to complete source data verification (or
carry out remote monitoring visits, depending on the model in place for the trial) and
check essential documents. Monitors may also visit sites during the site selection
process for a new trial to see whether a site is suitable for trial participation, for a site
initiation visit to ensure that all necessary documentation and processes are in place,
and for a site close-out visit to close down the site at the end of trial participation.
The role of the monitor and monitoring activities in a trial is usually defined in a
study-specific monitoring plan.
Site Binder
As mentioned, it is common practice in drug trials for the trial Sponsor to create a site
binder for each participating site to assist them in maintaining and organizing the
necessary trial documentation which is required for the TMF (see above under
Participating Sites). On a visit to the site, the monitors will routinely check the site
binders to ensure that they are complete. The site binder would usually contain
copies of all versions of the protocol used at the site, all relevant Ethics and
Regulatory approvals, plus any other required local approvals, site staff curriculum
vitae, training records, delegation logs, and study procedure documentation. It may
also contain copies of signed patient consent forms, depending on the local pro-
cedures. Monitors will use this documentation prior to a site activation to verify that
a site can be activated to accrue patients and then, during the trial, to ensure that the
site is compliant with all study requirements.
One of the primary responsibilities of the monitors is to verify that the data entered
into the CRF is accurate and complete. This is done by comparing the CRF entries to
the data at its source. This could be the patient’s medical or clinic record, a separate
file with original documents relating to the patient’s participation in the trial but not
part of the official medical record, and ancillary records such as pharmacy inventory
and dispensation records. A log of each monitor visit (on-site or remote) should be
maintained along with details of what was checked and any findings. Monitors
would also follow up to ensure that any deficiencies were appropriately corrected.
380 E. McFadden et al.
Monitoring Reports
These monitoring reports are written after each monitoring visit and submitted to the
Sponsor or the coordinating center. The reports will also be maintained as part of the
TMF.
From the previous section about documentation requirements, it is clear that all
parties maintain trial-specific plans and associated records/reports. In this section of
the chapter, we consider instructional/process documents in the form of SOPs.
As mentioned previously there is a requirement to document not only what was
done but how it was done and to show that a task was done in a consistent way
throughout the trial, no matter who was doing it.
All of the entities involved in trials will follow SOPs. Procedures form the founda-
tion of a good quality system and describe how to perform a repetitive trial activity.
Procedures are often based on a standard policy which is more of a high-level statement
of intention to satisfy requirements. Details of the standard process to comply with the
policy that defines what needs to be done and why are operating procedures.
SOPs are essential documents and, like those covered in previous sections, should
include records to demonstrate that compliance with the procedure has been mea-
sured and the procedure has been followed. The protocol document itself can be
considered a procedure document for the selection, entry, and treatment of a patient
entered on the trial. It may also contain instructions on trial related activities such as
ordering trial medication, submitting materials for central review, randomization of
patients, and other essential trial activities.
The philosophy behind SOPs is that they should be general enough to apply to all
clinical trials being done by an entity and not be trial-specific. Depending on the
entity organization, SOPs may be at a global or local level. If there are certain
activities that are specific only to one trial, then a separate document should be
created to describe that procedure. A common approach for this type of document is
to call it a trial-specific work instruction rather than an SOP.
Clinical trial staff should be trained in applicable SOPs. Sometimes an entity will
follow Sponsor or other collaborator procedures, and there is a need to ensure at the
start of a trial that all parties are aware of the SOPs that are being followed. If
external/Sponsor SOPs are used, it is key to ensure good communication between
parties about procedure distribution and training. The entity that owns the SOPs
should provide training on the procedures. As procedures set out the standard to
work to, it is advisable that they are subject to periodic or scheduled review to ensure
they reflect the current practice. The quality system should include a process on how
to handle deviations from procedures.
How do you decide what procedure documentation is needed? The UK Clinical
Research Collaboration has developed a process for assessing competency of clinical
20 Documentation: Essential Documents and Standard Operating Procedures 381
trials units (CTUs) as part of a CTU registration process. They have developed a list
of areas of expertise which they consider essential for a CTU (McFadden et al. 2015)
and recommend that SOPs be developed to cover these areas. The names of the SOPs
can vary, but this list describes the basic topics which should be covered in SOPs in a
coordinating center.
We have updated this list to reflect recent changes in legislation and present it as a
table showing recommendations for documentation for each of our entities involved
in a trial. Table 1 shows recommended topics for procedure documentation by entity.
Sponsor
The Sponsor has ultimate responsibility for ensuring that all necessary procedures
and documentation are in place and maintained throughout the course of the trial.
The Sponsor can audit sites, coordinating centers, ISC, and monitors to check for
that assurance. The Sponsor should ensure that each participating party or entity
understands their responsibilities for preparation and maintenance of trial documents
and development and implementation of study procedures. It is also important to
establish which SOPs are to be used for the conduct of the study – is it the Sponsor
SOPs regardless of who is doing the specific task, or is it the SOPs of the entity to
which the task has been delegated?
Participating Site
The site should have their own hospital/clinic SOPs for trial-related activities or
follow those provided by the Sponsor (or a combination of both).
Coordinating Center
The ISC will require procedure documentation for preparation of reports, interac-
tions with the IDMC members, secure transfer of reports, minute taking, and
archiving. There is usually also a study-specific charter for the IDMC activities
developed as part of the overall study governance documents.
Monitors
As the reader can see, the requirements for documents are substantial, and it is
beneficial to develop a system for creation, maintenance, and storage of these
documents.
20 Documentation: Essential Documents and Standard Operating Procedures 383
The Sponsor and all other parties involved in the conduct of the trial (e.g., site,
coordinating center, ISC, and monitors) should aim to maintain an inspection ready
TMF at all times. In other words, a regulatory inspector should be able to reconstruct
the conduct of the trial using only the documents and metadata present in the TMF.
While the concept of an inspection-ready TMF sounds simple, it is not easily
achieved, and the quality of the TMF is a growing area of risk for the clinical
research industry. There are many aspects to maintenance of the TMF, and some
aspects are more challenging than others, such as managing electronic correspon-
dence including emails in the TMF.
TMF quality is determined by both the quality of the individual records held
therein and the quality of the systems and processes in place to maintain the TMF.
The following section gives some suggestions for such a system.
Document Creation
Having a standard template for the development of a plan or an SOP simplifies the
process when a new one needs to be developed. When a standard format is used for
all documents, it also makes it easier for users to find something within the document
as they are familiar with the layout of the sections. Figure 2 shows a sample layout
for an SOP template. The document is normally prepared by someone with knowl-
edge of the particular process (a subject matter expert). Once drafted, the document
should be routed for appropriate review and, once all changes have been incorpo-
rated, final approval. This entire process should be documented. Signature is usually
with wet ink, but there is a growing trend to use of digital signatures to approve
essential documents. The implementation of electronic signatures should comply
with international electronic signature requirements. Once approved, the document
should be circulated to all individuals who are required to follow the process.
Training in the contents should be documented in each individual training record.
Version:
TABLE OF CONTENTS
Date of Approval:
1. PURPOSE
Effective Date:
2. SCOPE
4. PRE-REQUISITES
Security Level:
5. ROLES AND RESPONSIBILITIES
6. PROCEDURES
Name Role Signature Date
7. REFERENCES
Approver
LIST OF TABLES
Reviewer
Reviewer
Reviewer
Reviewer Template Version No. Page 2 of X
confidentiality issues and partly to the requirement that the investigator must main-
tain control of the source documentation held at site.
TMFs will usually be a combination of both paper and electronic files. It should
be noted that the main requirements for storage and archival are the same for both.
The TMF itself should be well structured with records filed in an organized and
timely manner. This enables ease of identification and retrieval of both the documents
and the associated metadata – a key requirement for regulatory inspections. Metadata
attributes give context and meaning to the records. An audit trail is a form of metadata,
providing information on actions relating to the documents and records. Access to the
TMF must be appropriately controlled to ensure no authorized disclosure of informa-
tion, and the TMF itself protected from damage, unauthorized changes, and records
loss. TMF documents are subject to review at audit/inspection, and any party involved
in the trial could expect requests for direct access to the TMF during an audit/inspection.
Individual documents within the TMF must be version controlled, with an audit
trail for changes made. Finally, the contents of the TMF must be clearly indexed with
“signposts” to the relevant TMF repositories and systems.
Using a specialized eTMF system can greatly assist with the challenge of meeting
the above requirements and streamline the processes of managing the active eTMF
and of archiving the eTMF at study close-out. An eTMF system can also solve some
of the issues faced by the Sponsor when the TMF documentation is being generated
by a number of collaborating partners.
The TMF should be monitored throughout the course of the trial for quality in terms
of completeness, timeliness, document quality, and ease of retrieval.
Document Archiving
The Sponsor is responsible for archiving the TMF at the end of the study. An archive
is a physical facility or an electronic system designated for the secure long-term
retention and maintenance of archived materials. In the UK, such a system must be
under the control of one or more named archivists, with access limited to those
individuals.
Responsibilities for archiving the TMF should be agreed between the Sponsor,
the sites and all other parties involved in the trial such as the coordinating center,
ISC, and monitors. The Sponsor should ensure that all parties involved are capable
of archiving their TMFs in a manner that meets GCP requirements. If files are
transferred from one entity to another, then the transfer process should be tested
and validated to ensure that all files are transferred completely and correctly.
The required retention period for archiving the TMF will depend on the applica-
ble regulatory requirements. The principles for archiving paper and electronic TMFs
are the same:
386 E. McFadden et al.
(a) All archived materials must be stored in a way that ensures their integrity and
continued access throughout the required period of retention.
(b) Procedures should be established for making archived materials available for
inspection, e.g., by regulatory authorities.
(c) Any alteration to archived records shall be traceable.
(d) Any transfer of materials within the archive from one location, media, or file
format to another must be documented and validated if appropriate.
(e) The Sponsor shall maintain contact with any external organization that archives
its materials throughout the period of retention.
Summary
The requirement to be able to “reproduce” the conduct of a trial from the TMF leads
to an increasing volume of documentation required for each trial, particularly if it is to
be used in a regulatory submission. While there are agreed international standards for
trial conduct defined by the International Conference on Harmonization, there are
national variations on these around the world. It is important for the Sponsor of the
trial to ensure that all parties involved in the trial are fully aware of their responsibilities
in terms of documentation and that all national requirements are met.
Use of standard templates and document management systems and following
ALCOA+ principles will make it easier to develop, maintain, and implement
essential documents and SOPs.
Key Facts
Cross-References
▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ Design and Development of the Study Data System
▶ Good Clinical Practice
▶ International Trials
▶ Responsibilities and Management of the Clinical Coordinating Center
▶ Trial Organization and Governance
References
Ben-Yehuda N, Oliver-Lumerman A (2017) Fraud and misconduct in clinical research: detection,
investigation and organizational response. University of Michigan Press. ISBN – 0472130552,
9780472130559
DIA TMF (2018) Reference Model. Retrieved from: https://tmfrefmodel.com/resources/
Ellenburg S, Fleming T, De Mets D (2003) Data monitoring committees in clinical trials. Wiley,
New York
Good Clinical Practice Guidelines, E6 (R2) (2016). Retrieved from: http://www.ich.org/products/
guidelines/efficacy/efficacy-single/article/integrated-addendum-good-clinical-practice.htmlICH
GCP Guidelines/
McFadden E et al (2015) The impact of registration of clinical trials units: the UK experience. Clin
Trials 12(2):166–173
Weiss RB (1998) Systems of Protocol Review, quality assurance and data audit. Cancer Chemother
Pharmacol 42(Suppl 1):S88
Consent Forms and Procedures
21
Ann-Margret Ervin and Joan B. Cobb Pettit
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
Who May Obtain Informed Consent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
Who May Provide Informed Consent or Assent? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
Lack of Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
Other Vulnerabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
Nonnative Speaker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
Parental Permission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
Assent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Community Approval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
What Must the Documentation of Informed Consent Include? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
Consent Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Consent Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Understandable Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
Assessing Comprehension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Re-consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Termination of Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
Regulatory Requirements for Informed Consent in Canada and the United Kingdom . . . . . . . . 404
Canada . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
United Kingdom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
Abstract
Obtaining the informed consent of a participant is a prerequisite for enrollment in
a clinical trial. In the United States, federal regulations provide the framework for
establishing informed consent with additional protections for persons considered
vulnerable due to incarceration, illiteracy, or other condition. Investigators are
tasked with providing sufficient information about the research to satisfy the
ethical and regulatory requirements while communicating it in a manner that
maximizes the participant’s ability to make an informed decision regarding study
enrollment. There are clinical trial design features that are essential to include in
the consent form with care to describe topics such as randomization, allocation
ratio, and masking in a manner understood by the lay public. The informed
consent discussion should continue throughout the course of the trial as infor-
mally reaffirming the participant’s willingness to continue participation and
reconsenting them when there are significant changes to the study protocol are
important considerations for providing truly informed consent.
Keywords
Informed consent · Assent · Consent forms · Institutional review board
Introduction
The guidance in this chapter primarily pertains to clinical trials conducted in the
United States, but the general principles may apply more broadly to trials conducted
elsewhere.
Before enrolling a potential participant in a clinical trial, investigators must obtain
the individual’s informed consent. While people may think they know what “informed
consent” is, there is no set formula as to how to best achieve it. Informed consent is a
conversation between the participant and investigator that begins at recruitment and
ends with study exit. Presenting information about the trial to the target population,
taking into consideration the common characteristics of that population, e.g., age, sex,
common disease, or condition, can be challenging. The ethical, legal, and procedural
components of informed consent are intertwined; and while the ethical objective is
static, regulatory/legal authorities will modify requirements over time, causing
changes to procedural mechanisms. Investigators must have operational systems in
place to ensure compliance with regulatory changes that may affect their studies.
The Office of Human Research Protections (OHRP), the US government entity
that oversees federally funded human subjects research for the Department of Health
and Human Services (DHHS), describes the investigator’s obligation in the Code of
Federal Regulations, Title 45, Part 46 (45 CFR 46.116):
21 Consent Forms and Procedures 391
Except as provided elsewhere in this policy, before involving a human subject in research
covered by this policy, an investigator shall obtain the legally effective informed consent
of the subject or the subject’s legally authorized representative. An investigator shall seek
informed consent only under circumstances that provide the prospective subject or the
legally authorized representative sufficient opportunity to discuss and consider whether or
not to participate and that minimize the possibility of coercion or undue influence.
The information that is given to the subject or the legally authorized representative
shall be in language understandable to the subject or the legally authorized representa-
tive. The prospective subject or the legally authorized representative must be provided
with the information that a reasonable person would want to have in order to make an
informed decision about whether to participate, and an opportunity to discuss that
information Department of Health and Human Services, Office for Human Research
Protections (n.d.).
Under guidance from the OHRP, obtaining informed consent is a study procedure
that, alone, is enough to establish that a research institution is engaged in human
subjects research and must have oversight by an Institutional Review Board (IRB)
(also known as Ethics Committee or Research Ethics Committee (REC) in many
parts of the world) (Department of Health and Human Services, Office for Human
Research Protections 2008). The IRB has the obligation of ensuring that the inves-
tigators performing research activities are qualified and trained to so. Thus, the
consent designees for a trial must be identified, and their qualifications and training
submitted to the IRB for review and approval. The investigator must identify consent
designees who are appropriately credentialed to obtain consent for clinical
procedures. Qualified study team members may also work with the investigators
and consent designees to support the informed consent process.
If the person who obtains informed consent has a clinical care relationship with
the potential participant, the issue of “therapeutic misconception” arises (Sisk and
Kodish 2018). Thus, when the clinician introduces a research study, the patient may
interpret that introduction as a recommendation from the clinician, and may improp-
erly attribute the possibility of direct personal benefit to study participation. The
clinician must clearly explain that the trial is separate from clinical care and the
patient’s decision about participation will not affect the care that the clinician
provides.
Participants, or their legal agents, must provide their voluntary informed consent in
order to enroll in a research study. Adults, as defined by the locale’s law on the age of
majority, may provide informed consent for themselves unless they lack capacity to
do so.
392 A.-M. Ervin and J. B. Cobb Pettit
Lack of Capacity
Other Vulnerabilities
Some adults may not lack capacity to consent, but may be otherwise vulnerable due
to incarceration, immigrant status, pregnancy, illiteracy, or another similar situation.
Each participant must be considered as a unique individual, and investigators must
consider and address any personal characteristics that may obstruct the possibility of
obtaining legally effective informed consent.
Nonnative Speaker
Parental Permission
Parents (biological or adoptive parents) and legal guardians must provide parental
permission for minors (under the age of majority) to participate unless the study falls
under a specific exception to this rule. The exceptions are:
21 Consent Forms and Procedures 393
• The study addresses a topic that is protected by a statute that allows minors to
provide consent for themselves.
• The study meets the standards for waiver of informed consent and the IRB agrees
to waive parental permission.
• The study includes a population of children for whom requiring parental/guardian
permission is not appropriate given the study topic or the special characteristics of
their relationship, and the regulations governing the study permit the IRB to
approve a substituted mechanism to protect the minor participants. The DHHS
provides this exception at 21 CFR 46.408(c), with the example being a population
of neglected or abused children.
Depending upon the risk associated with the trial and the prospect of direct
personal benefit, permission may be required from one or both parents. In the
Randomized Trial of Peanut Consumption in Infants at Risk for Peanut Allergy
(LEAP Study), infants as young as 4 months old at high risk for a peanut allergy
were randomly assigned to avoid or consume peanuts in their diet until 5 years of
age. Because the IRB determined that the study offered the possibility of direct
personal benefit to the minor participants consent was obtained from one parent or
guardian (Du Toit et al. 2015).
Assent
Minors must have the opportunity to provide assent to study participation. Assent
means “a child’s affirmative agreement to participate in research. Mere failure to
object should not, absent affirmative agreement, be construed as assent” (45 CFR
46.402(b)). The IRB must determine whether children have the capacity to assent to
trial participation. The IRB must consider the “ages, maturity, and psychological
state of the children involved” (21 CFR 50.55(b), 45 CFR 46.408(a)). This deter-
mination may be for all children participating, for subgroups, or for individual
minors. Assent may be waived for certain minimal risk studies, and for an FDA-
regulated study which “holds out a prospect of direct benefit that is important to the
health or well-being of the children and is available only in the context of the clinical
investigation” (21 CFR 50.55(c)(2)).
Community Approval
For clinical trials that occur in community settings and in some international
locations, investigators may obtain the consent or approval from local leaders.
Community acceptance of the proposed trial may be important to successful recruit-
ment and enrollment of study participants. It is also important to establish a com-
munication mechanism with the community to facilitate the dissemination of results
once the trial is completed. The Surveillance and Azithromycin Treatment for
Newcomers and Travelers Evaluation (ASANTE) Trial recruited 52 communities
394 A.-M. Ervin and J. B. Cobb Pettit
in the Kongwa district, Tanzania, to receive annual mass drug administration (MDA)
or annual MDA plus a surveillance and treatment program for newcomers and
travelers to determine if the surveillance program would reduce infection with
Chlamydia trachomatis (Ervin et al. 2016). In the ASANTE Trial, community
leaders provided verbal consent for the participation of the community. Guardians
provided consent to enroll children, and children aged 7 years and older provided
assent to participate.
The consent form for a clinical trial must include specific regulatory elements
that may change over time, as well as provisions required by the institution where
the research will take place. The US regulations governing federally funded
human subjects research from 1991 until the 2018 revisions required eight
basic elements and provided six “additional elements” to be added to consent
documents when appropriate (Department of Health and Human Services, Office
for Human Research Protections n.d.). The revised regulations, effective January
21, 2019, include new basic and additional elements and new format require-
ments. Some of these changes increase the focus on the collection and use of
identifiable data and biospecimens.
While the consent form is the primary documentary evidence that a study
interaction with a participant occurred, other notes that the study team records
contemporaneously about that interaction may help record the context of the
discussion. The consent form may also reference other IRB-approved tools, such
as brochures, videos, and patient information sheets that the study team may use
to help explain the study. Table 1 presents the current US regulatory requirements
for informed consent. HIPAA Authorization is also included as many clinical
trials in the United States must also comply with the HIPAA mandate.
Consent forms for clinical trials will also include a description of the study arms,
including the use of placebos and sham procedures. The method of assigning
participants to different study arms should be discussed along with a lay description
of the allocation ratio. Additional important elements include:
Table 1 (continued)
Requirement Description
result from participation in the research;
4. The consequences of a subject’s decision to
withdraw from the research and procedures for
orderly termination of participation by the
subject;
5. A statement that significant new findings
developed during the course of the research
which may relate to the subject’s willingness to
continue participation will be provided to the
subject; and
6. The approximate number of subjects
involved in the study.
January 2018 revisions to 45 CFR 46.116; Expanded general requirements; 46.116(a)
effective January 21, 2019 (1) Before involving a human subject in
research covered by this policy, an investigator
shall obtain the legally effective informed
consent of the subject or the subject’s legally
authorized representative.
(2) An investigator shall seek informed consent
only under circumstances that provide the
prospective subject or the legally authorized
representative sufficient opportunity to discuss
and consider whether or not to participate and
that minimize the possibility of coercion or
undue influence.
(3) The information that is given to the subject
or the legally authorized representative shall be
in language understandable to the subject or
the legally authorized representative.
(4) The prospective subject or the legally
authorized representative must be provided
with the information that a reasonable person
would want to have in order to make an
informed decision about whether to participate
and an opportunity to discuss that information.
(5) Except for broad consent obtained in
accordance with paragraph (d) of this section:
(i) Informed consent must begin with a
concise and focused presentation of the key
information that is most likely to assist a
prospective subject or legally authorized
representative in understanding the reasons
why one might or might not want to participate
in the research. This part of the informed
consent must be organized and presented in a
way that facilitates comprehension.
(ii) Informed consent as a whole must
present information in sufficient detail relating
to the research, and must be organized and
presented in a way that does not merely
(continued)
21 Consent Forms and Procedures 397
Table 1 (continued)
Requirement Description
provide lists of isolated facts, but rather
facilitates the prospective subject’s or legally
authorized representative’s understanding of
the reasons why one might or might not want
to participate.
(6) No informed consent may include any
exculpatory language through which the
subject or the legally authorized representative
is made to waive or appear to waive any of the
subject’s legal rights, or releases or appears to
release the investigator, the sponsor, the
institution, or its agents from liability for
negligence
New basic elements: 46.116(b)
(b)(9)
(i) A statement that identifiers might be
removed from the identifiable private
information or identifiable biospecimens and
that, after such removal, the information or
biospecimens could be used for future research
studies or distributed to another investigator
for future research studies without additional
informed consent from the subject or the
legally authorized representative, if this might
be a possibility; or
(ii) A statement that the subject’s
information or biospecimens collected as part
of the research, even if identifiers are removed,
will not be used or distributed for future
research studies.
New Additional Elements: 46.116(c)
These require that a subject be informed of the
following, when appropriate:
(7) That the subject’s biospecimens (even if
identifiers are removed) may be used for
commercial profit and whether the subject will
or will not share in this commercial profit;
(8) Whether clinically relevant research results,
including individual research results, will be
disclosed to subjects, and if so, under what
conditions.
(9) For research involving biospecimens,
whether the research will (if known) or might
include whole genome sequencing (i.e.,
sequencing of a human germline or somatic
specimen with the intent to generate the
genome or exome sequence of that specimen.
For the purposes of this Chapter, the concept of
“. . .broad consent for the storage,
maintenance, and secondary research use of
(continued)
398 A.-M. Ervin and J. B. Cobb Pettit
Table 1 (continued)
Requirement Description
identifiable private information or identifiable
biospecimens” introduced by the DHHS in the
2018 revisions to the Common Rule, will not
be addressed.
HIPAA authorization requirements if the HIPAA authorization core elements (see
study involves using or disclosing protected Privacy Rule, 45 C.F.R. §164.508(c)(1))
health information (PHI) from a U.S. Description of protected health information
covered entity (PHI) to be used or disclosed (identifying the
information in a specific and meaningful
manner).
The name(s) or other specific identification of
person(s) or class of persons authorized to
make the requested use or disclosure.
The name(s) or other specific identification of
the person(s) or class of persons who may use
the PHI or to whom the covered entity may
make the requested disclosure.
Description of each purpose of the requested
use or disclosure. Researchers should note that
this element must be research study specific,
not for future unspecified research.
Authorization expiration date or event that
relates to the individual or to the purpose of the
use or disclosure (the terms “end of the
research study” or “none” may be used for
research, including for the creation and
maintenance of a research database or
repository).
Signature of the individual and date. If the
Authorization is signed by an individual’s
personal representative, a description of the
representative’s authority to act for the
individual.
Authorization required statements (see
Privacy Rule, 45 C.F.R. § 164.508(c)(2))
The individual’s right to revoke his/her
Authorization in writing and either (1) the
exceptions to the right to revoke and a
description of how the individual may revoke
Authorization or (2) reference to the
corresponding section(s) of the covered
entity’s Notice of Privacy Practices.
Notice of the covered entity’s ability or
inability to condition treatment, payment,
enrollment, or eligibility for benefits on the
Authorization, including research-related
treatment, and, if applicable, consequences of
refusing to sign the Authorization.
The potential for the PHI to be re-disclosed by
the recipient and no longer protected by the
(continued)
21 Consent Forms and Procedures 399
Table 1 (continued)
Requirement Description
Privacy Rule. This statement does not require
an analysis of risk for re-disclosure but may be
a general statement that the rivacy Rule may no
longer protect health information
NIH certificate of confidentiality (suggested “This research is covered by a Certificate of
language) Confidentiality from the National Institutes of
Health. The researchers with this Certificate
may not disclose or use information,
documents, or biospecimens that may identify
you in any federal, state, or local civil,
criminal, administrative, legislative, or other
action, suit, or proceeding, or be used as
evidence, for example, if there is a court
subpoena, unless you have consented for this
use. Information, documents, or biospecimens
protected by this Certificate cannot be
disclosed to anyone else who is not connected
with the research except, if there is a federal,
state, or local law that requires disclosure (such
as to report child abuse or communicable
diseases but not for federal, state, or local civil,
criminal, administrative, legislative, or other
proceedings, see below); if you have consented
to the disclosure, including for your medical
treatment; or if it is used for other scientific
research, as allowed by federal regulations
protecting research subjects.”
NIH guidance on consent for future In order to meet the expectations for future
research use and broad sharing of human research use and broad sharing under the GDS
genomic and phenotypic data subject to the Policy, the consent should capture and convey
NIH genomic data sharing policy in language understandable to prospective
2015 participants information along the following
lines:
Genomic and phenotypic data, and any other
data relevant for the study (such as exposure or
disease status), will be generated and may be
used for future research on any topic and
shared broadly in a manner consistent with the
consent and all applicable federal and state
laws and regulations.
Prior to submitting the data to an NIH-
designated data repository, data will be
stripped of identifiers such as name, address,
account, and other identification numbers and
will be de-identified by standards consistent
with the Common Rule. Safeguards to protect
the data according to Federal standards for
information protection will be implemented.
Access to de-identified participant data will be
controlled, unless participants explicitly
(continued)
400 A.-M. Ervin and J. B. Cobb Pettit
Table 1 (continued)
Requirement Description
consent to allow unrestricted access to and use
of their data for any purpose.
Because it may be possible to re-identify de-
identified genomic data, even if access to data
is controlled and data security standards are
met, confidentiality cannot be guaranteed, and
re-identified data could potentially be used to
discriminate against or stigmatize participants,
their families, or groups. In addition, there may
be unknown risks.
No direct benefits to participants are expected
from any secondary research that may be
conducted.
Participants may withdraw consent for
research use of genomic or phenotypic data at
any time without penalty or loss of benefits to
which the participant is otherwise entitled. In
this event, data will be withdrawn from any
repository, if possible, but data already
distributed for research use will not be
retrieved.
The name and contact information of an
individual who is affiliated with the institution
and familiar with the research and will be
available to address participant questions.
GINA (if appropriate) “A Federal law, called the Genetic Information
Nondiscrimination Act (GINA), generally
makes it illegal for health insurance
companies, group health plans, and most
employers to discriminate against you based
on your genetic information. This law
generally will protect you in the following
ways:
Health insurance companies and group health
plans may not request your genetic information
that we get from this research.
Health insurance companies and group health
plans may not use your genetic information
when making decisions regarding your
eligibility or premiums.
Employers with 15 or more employees may not
use your genetic information that we get from
this research when making a decision to hire,
promote, or fire you or when setting the terms
of your employment.”
Conflict of interest A statement that one or more investigators
have a financial or other conflict of interest
with the study and how it has been managed.
(continued)
21 Consent Forms and Procedures 401
Table 1 (continued)
Requirement Description
FDA “applicable clinical trials” Under 21 CFR 50.25(c), the following
statement must be reproduced word-for-word
in informed consent documents for applicable
clinical trials:
“A description of this clinical trial will be
available on http://www.ClinicalTrials.gov, as
required by U.S. Law. This Web site will not
include information that can identify you. At
most, the Web site will include a summary of
the results. You can search this Web site at any
time.”
Consent Materials
Consent Discussion
The consent process must precede any study-related activities, including screening
for eligibility, whether the discussion takes place in-person, by phone, or other
remote method. Typically, a study team member approved to obtain informed
consent reviews the consent form with the potential participant and answers any
questions. The study team member must be cognizant of anything that might
interfere with a participant’s ability to make an informed decision (e.g., illiterate,
language barriers, hearing, visual, or cognitive impairment). The study team member
must allow time for the prospective participant to consider whether to participate and
402 A.-M. Ervin and J. B. Cobb Pettit
to ask questions about the study. In certain circumstances, the initial consent
discussion may extend over time to allow the prospective participant to consult
with her physician and/or family members. Study team members should not ask for
consent when the potential participant feels exposed or vulnerable, for example,
when lying on a gurney approaching the operating theater, or when his/her deliber-
ative faculties may be compromised by severe pain, anxiety, or the influence of
medication, etc. When the participant is satisfied with the discussion and agrees to
participate, the participant, and when applicable, the person obtaining consent, sign
and date the consent document. If the study involves clinical procedures for which
only credentialed clinicians may obtain consent, the process may be bifurcated such
that a trained study team member discusses the consent form with the participant,
and then the clinician reviews the consent form with the participant and answers
questions. Then, the participant, research staff member, and clinical research staff
member all sign and date the consent document. Some trials may include more than
one consent form as was utilized in the Randomized Trial of Achieving Healthy
Lifestyles in Psychiatric Rehabilitation (ACHIEVE). The aim of ACHIEVE was to
assess the efficacy of a behavioral weight loss intervention among persons diagnosed
with a serious mental illness who participate in a psychiatric rehabilitation pro-
gram (Casagrande et al. 2010). Persons at participating rehabilitation centers were
orally consented prior to screening for ACHIEVE in order to measure their weight
and height. Persons expressing an interest in ACHIEVE were asked to sign a written
consent form for procedures related to eligibility screening and a second consent
form before randomization.
Understandable Language
Context
The consent discussion cannot take place under circumstances that introduce a threat
that might make prospective participants feel that they must participate (coercion), or
21 Consent Forms and Procedures 403
that impose undue influence over the decision such that the participants decide to
join or remain on a study that they otherwise would not elect to participate in or
discontinue participation. These conditions could undermine the voluntary nature of
the decision to participate. Investigators must consider the participant’s situation and
respect their privacy.
Assessing Comprehension
Re-consent
Termination of Consent
A study participant retains the right to leave a study at any time; the consent process
must explicitly communicate that right to participants, and if appropriate, the
consequences of that decision. It should be clear to the participant whether follow-
up is necessary for their own safety and well-being or if there are any other pro-
cedures that should occur as a result of that decision.
Canada
may be written and a signed consent is required for research regulated under the
Heath Canada Food and Drugs Act. The TCPS 2 also acknowledges that oral
consent, field notes, exchange of gifts, and other methods may be warranted for
documenting consent as cultural norms and research settings vary.
The TCPS 2 addresses the accommodations provided when persons lack the
capacity to consent. In this instance the investigator must ensure that the research
has a direct benefit to the participant or persons who are similar to the participant. If
the investigator is unable to show a direct benefit, then the research must be
minimal risk and low burden to the participant. A third party, who is not the
investigator or a member of the research staff, will be asked to provide consent
on behalf of the participant. If during the course of the trial, the participant regains
the capacity to consent, informed consent will be obtained. Assent may be obtained
if the participant has some capacity to comprehend the aims of the research. The
406 A.-M. Ervin and J. B. Cobb Pettit
TCPS 2 further advises investigators and persons who may be asked to provide
consent on behalf of a participant to review research directives for guidance on the
participant’s preference regarding participation in research activities. A research
directive does not, however, modify the Tri-Council’s requirements for informed
consent.
United Kingdom
Guidance on informed consent for clinical trials in the United Kingdom (UK) is
provided in the Medicines for Human Use Clinical Trials Regulations (MHCTR)
and Guidelines for Good Clinical Practice (European Medicines Agency Interna-
tional Conference (n.d.); The Medicines for Human Use (Clinical Trials) Amend-
ment (No. 2) Regulations 2006). The underlying ethical principles and general
content requirements do not differ from those of the US and Canada. A partic-
ipant information sheet (PIS) is prepared to support the consent process. The PIS
provides a summary of the trial, including the background and objectives, the
expectations for volunteers participating in the trial, risks and benefits, what data
will be used and who has access to these data, information on withdrawing from
the study, and how the results of the trial will be disseminated while maintaining
participant confidentiality. The style and length of a PIS are often tailored to
inform the persons providing consent or advice on study participation, including
children, legal representatives, and relatives.
There are special protections for vulnerable populations, including adults that
lack the capacity to consent, children, pregnant women, and patients participating in
emergency research. The requirements for the consent for vulnerable populations
may depend on the location of the research in the UK (England and Wales, Scotland,
or Northern Ireland) and the study type. For clinical trials of investigational drugs or
devices a legal representative may provide consent for adults who are unable to
consent for themselves in England, Wales, Scotland, and Northern Ireland. For all
UK nations the representative may be a person who has a relationship with the adult
but is not involved in trial conduct (personal representative) or a professional
representative such as a treating physician who is not involved in the study.
Scotland regulations further specify that a personal legal representative could be a
welfare guardian or attorney and if one is not appointed for the adult, then the
closest relative. For greater than minimal risk research in England, Wales, and
Northern Ireland that does not include investigational products, a person who
cares for or has an interest in the adult’s well-being (a personal consultee) or a
nominated consultee (a person independent of the study) can provide their opinion
on whether the adult would be willing to participate in the study. This opinion is
recorded on a Consultee Declaration Form. In Scotland, a legal representative is
asked to provide consent for research that does not include investigational products.
Specific requirements for children and emergency research in the UK are outlined in
Tables 3 and 4.
21 Consent Forms and Procedures 407
While there are regulatory and institutional requirements for obtaining consent for
clinical trial participation, investigators must also take steps to ensure that the
process maximizes the potential participant’s ability to make an informed decision.
Additional protections are necessary for vulnerable populations. Discussions should
occur in the appropriate context and supplemental materials may be important to
illustrate specific procedures and expected contacts during the course of the trial.
Assessing the potential participant’s comprehension of specific elements of the trial
should be considered particularly when the methods are complex and participation is
expected over an extended period. Informed consent discussions should be contin-
uous and written or other communications should be distributed to update partici-
pants during the course of the trial. Re-consent should be considered when
modifications may affect the participant’s willingness to participate in the trial.
Key Facts
Table 4 (continued)
England, Wales, and
Northern Ireland Scotland
Children May be included without May be included without
lacking consent if 1) the research consent if 1) the research
capacity to has potential benefits to the has potential benefits to the
give child, 2) the research has child, 2) the research has
consent been approved by the been approved by the
National Health Service’s National Health Service’s
research ethics committee, research ethics committee,
3) the research cannot be 3) the research cannot be
addressed in a nonemergent addressed in a nonemergent
setting, 4) a parent (or setting, 4) a parent (or
guardian) is notified as soon guardian) is notified as soon
as possible, 5) Consent and as possible, 5) Consent and
assent when appropriate are assent when appropriate are
obtained as soon as obtained as soon as
possible, and 6) the child possible, and 6) the child
and/or the parent or and/or the parent or
guardian are informed that guardian are informed that
the child can withdraw at the child can withdraw at
any time any time
Cross-References
References
Alzheimer’s Association (2004) Research consent for cognitively impaired adults: recommenda-
tions for institutional review boards and investigators. Alzheimer Dis Assoc Disord 18
(3):171–175. https://doi.org/10.1097/01.wad.0000137520.23370.56
Casagrande SS, Jerome GJ, Dalcin AT, Dickerson FB, Anderson CA, Appel LJ, Charleston J,
Crum RM, Young DR, Guallar E, Frick KD, Goldberg RW, Oefinger M, Finkelstein J,
Gennusa JV, Fred-Omojole O, Campbell LM, Wang N-Y, Daumit GL (2010) Randomized
trial of achieving health lifestyles in psychiatric rehabilitation: the ACHIEVE trial.
BMC Psychiatry 10:108. https://doi.org/10.1186/1471-244X-10-108
Department of Health and Human Services, Office for Human Research Protections (2008)
Engagement of institutions in human subjects research. Available at https://www.hhs.gov/
ohrp/regulations-and-policy/guidance/guidance-on-engagement-of-institutions/index.html.
Accessed 23 June 2020
410 A.-M. Ervin and J. B. Cobb Pettit
Department of Health and Human Services, Office for Human Research Protections (n.d.) Protec-
tion of Human Subjects 45 CFR §46.116 (a) and (b). Available at https://www.hhs.gov/ohrp/
regulations-and-policy/regulations/45-cfr-46/index.html. Accessed 23 June 2020
Department of Health and Human Services, US Food and Drug Administration (2014) Informed
consent information sheet: guidance for IRBs, clinical investigators, and sponsors. Available at
https://www.fda.gov/RegulatoryInformation/Guidances/ucm404975.htm#genrequirments
Accessed 23 June 2020
Du Toit G, Roberts G, Sayre PH, Bahnson HT, Radulovic S, Santos AF, Brough HA, Phippard D,
Basting M, Feeney M, Turcanu V, Sever ML, Lorenzo MG, Plaut M, Lack G for the LEAP
Study Team (2015) Randomized trial of peanut consumption in infants at risk for peanut allergy.
N Engl J Med 372:803–813
Ervin AM, Mkocha H, Munoz B, Dreger K, Dize L, Gaydos C, Quinn TC, West SK (2016)
Surveillance and azithromycin treatment for newcomers and travelers evaluation (ASANTE)
trial: design and baseline characteristics. Ophthalmic Epidemiol 23(6):347–353. https://doi.org/
10.1080/09286586.2016.1238947
European Medicines Agency International Conference on Harmonisation Guideline for Good
Clinical Practice ICH GCP E6(R2) Step 5. Available at https://www.ema.europa.eu/en/docu
ments/scientific-guideline/ich-e-6-r2-guideline-good-clinical-practice-step-5_en.pdf. Accessed
23 June 2020
Grady C, Toulomi G, Walker AS, Smolskis M, Sharma S, Babiker AG, Pantazis N, Tavel J,
Florence E, Sanchez A, Hudson F, Papadopoulos A, Emanuel E, Clewett M, Munroe D,
Denning E, The INSIGHT START Informed Consent Substudy Group (2017) A randomized
trial comparing concise and standard consent forms in the START trial. PLoS One 12(4):
e0172607
Kao CY, Aranda S, Krishnasamy M, Hamilton B (2017) Interventions to improve patient
understanding of cancer clinical trial participation: a systematic review. Eur J Cancer Care 26:
e124124. https://doi.org/10.1111/ecc.12424
National Institutes of Health, National Heart, Lung, and Blood Institute (2011) Questions and
answers: PANTHER-IPF study. Available at https://www.nhlbi.nih.gov/node-general/questions-
and-answers-panther-ipf-study. Accessed 23 June 2020
Sisk BA, Kodish E (2018) Therapeutic misperceptions in early-phase cancer trials: from categorical
to continuous. IRB Ethics Hum Res 40(4):13–20
The Idiopathic Pulmonary Fibrosis Clinical Research Network (2012) Prednisone, azathioprine,
and N-acetylcysteine for pulmonary fibrosis. N Engl J Med 366:1968–1977
The Medicines for Human Use (Clinical Trials) Amendment (No. 2) Regulations (2006).
Available at http://www.legislation.gov.uk/uksi/2006/2984/pdfs/uksi_20062984_en.pdf.
Accessed 23 June 2020
Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans – TCPS2 (2018).
Available at http://www.ethics.gc.ca/eng/policy-politique_tcps2-eptc2_2018.html. Accessed 23
June 2020
Contracts and Budgets
22
Eric Riley and Eleanor McFadden
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
Funding Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Types of Clinical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
Key Funding Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414
Key Differences in Funding Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Distribution of Funds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Request for Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
Preparation of Proposal and Budget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Budget Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
Budget Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
Preparing the Response to a Request for Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Selection of Relevant Partners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Negotiation of Contract Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Contract Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
Clinical Trial Agreement Guidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
Scope of Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
Budget Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
Signature of Contract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Activation of Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Other Legal Documents/Contracts/Contract Amendments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Contract Amendments and Budget Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425
Abstract
The clinical research landscape in the twenty-first century continues to evolve.
Over the last three decades, the clinical trials landscape has changed dramatically
with increased regulations, worldwide standards (ICH E6 Guidelines), and intense
scrutiny. Most recently, there has been substantial impact from changes to legisla-
tion on data protection (US Health Insurance Portability and Accountability Act,
EU General Data Protection Regulations) rather than legislation directed specifi-
cally at clinical trials. There is an increase in large multicenter clinical trials, many
of them international in scope. Trials are increasingly becoming more automated
and complex in their design, management, and implementation.
The complexity of the clinical trials environment and the increase in regula-
tions requires all those involved, whatever their level of contribution, to adapt to
the changes and to be very clear on the costs associated with carrying out research
(Mashatole, Conducting clinical trials in the 21st century- adapting to new ways
and new methods. Retrieved from https://www.clinicaltrialsarena.com/news/
conducting-clinical-trials-in-the-21st-century-adapting-to-new-wasy-and-new-
methods-4835722-2/, 2016). Equally important, the terms of agreement and
division of responsibility between relevant parties involved in a trial, including
the Sponsor and/or funder, must be clearly specified in advance in the form of a
legally binding agreement.
This chapter provides guidelines for preparing Clinical Trial Agreements/
Contracts and for developing budgets/funding requests for the work involved,
both essential activities during the start-up phase of a trial.
Keywords
Contract · Clinical Trial Agreement · Data Transfer Agreement · Budget ·
Sponsor
Introduction
In the chapter, the focus is primarily on steps 2 and 4 in the above list, but the
other steps will be touched on briefly. Firstly, it is important to provide an overview
of the funding landscape of clinical research and a background to the key issues,
which ultimately inform the budget development process and more specifically the
steps above.
Funding Landscape
There have been many changes in the approaches, regulations, and funding models
of clinical research in the past 30 years ultimately affecting the source and level of
investment. In 1991, 80% of US clinical trials were funded by government or
philanthropic organizations, including by one of largest sponsors of biomedical
research in the world, the US federal government. However, this number has been
steadily in decline as the pharmaceutical industry’s contributions continue to grow.
By 2005, industry funded an estimated 70% of US clinical trials. Given the fact that
commercial research is now assuming a larger share of the market, academic
research that was once mainly funded through public grants is having to rely more
on industry funding. As a result, these academic research groups are needing to
adjust how they operate in order to align with the goals of industry, which is to bring
drugs to the market in a timely and efficient manner (Pfeiffer and Russo 2016).
Much of the reason for the increased cost of doing research is the need for
compliance with legal and regulatory requirements. For example, the initial EU
Directive on Clinical Trials (Directive 2001/20/EC) stipulated that the same standard
of conduct had to be maintained for all drug trials, regardless of whether there was an
investigational agent involved or not. These regulations have been relaxed slightly,
but there are still extensive standards to follow.
In several countries, private nonprofit organizations and state/local governments
have been increasing their stake in research and now compete with the traditional
key stakeholders. As all parties battle the increasing costs and complexities of
clinical trials, which has been accelerated by global recessions and fluctuations in
funding, research organizations in every sector have to strategically position them-
selves and find ways to work with each other for sustainability and advancement of
their missions (Hind et al. 2017).
The contract and budget process is connected to the funding model for a particular
trial, which from a high-level perspective is dependent on the various types of
research. By having an understanding of the type of research being conducted, one
414 E. Riley and E. McFadden
can better appreciate the key elements that define these processes. Regardless of the
type however, the information generated by any clinical research ultimately leads to
the commercialization of a new drug or device, advancement of scientific knowl-
edge, and/or changes in health policy and legislation (Camps et al. 2017).
Broadly speaking, there are two types of clinical research, noncommercial and
commercial. In noncommercial research, the aim of the research project is to
generate knowledge that benefits the good of the wider public. It usually involves
government or nonprofit organizations such as academic institutions or foundations,
but it can also involve financial support from commercial entities such as a pharma-
ceutical or a biotech company, often in the form of an educational grant.
Commercial research is mainly sponsored by private industry. While there is a
genuine interest in advancing the development of scientific knowledge, there are
inevitably financial interests in this type of research specifically related to bringing a
new product to market or expanding the use of an existing product. This type of
research typically involves one organization which funds, designs, and carries out a
clinical trial either entirely under one roof or with delegated tasks outsourced to other
organizations which provide research services, such a Clinical Research Organisa-
tion (CRO). CROs can provide a range of services from project management,
database design and build, clinical trial data management, statistical analysis, and
administration of Independent Data Monitoring Committees (IDMC)/Data Safety
Monitoring Boards (DSMB). Due to this diversity and scope, they are becoming a
major force in drug development and clinical trial recruitment (Carroll 2005). Other
organizations such as AROs (Academic Research Organisations) and CCOs (Con-
tract Commercial Organisations) can also be involved in providing a range of
specialist services to support clinical research such as recruitment support, clinical
knowledge and expertise, marketing and regulatory filing support, and drug launch.
Within each type of research, there are various funding sources and mechanisms. As
part of their drug development programs, private industry is naturally a major funder
of clinical research. Another significant source of investment is the US federal
government, namely, through the National Institutes of Health (NIH). NIH is
considered the largest funder among the world’s top public and philanthropic
organizations in the world investing $26.1 billion in health research. The European
Commission and the UK Medical Research Council follow in second and third
places for research investment with recent contributions investments of $3.7 billion
and $1.3 billion, respectively (Viergever and Hendriks 2016).
Charities like Cancer Research UK and the Wellcome Trust make significant
investment to research and development in the UK and worldwide (Cooksey
2006). Private donors or foundations such as the Bill and Melinda Gates Founda-
tion and The Michael J. Fox Foundation for Parkinson’s Research also fund
clinical research. Additionally, there are global philanthropic groups such as The
European Organisation for Research and Treatment of Cancer (EORTC) or the
22 Contracts and Budgets 415
Breast International Group (BIG) who support and fund research in specific
disease areas. An organization such as BIG has expansive global reach and
influence. As a key stakeholder in breast cancer research, BIG represents a network
of collaborative groups connecting over 59 academic research groups on over 30
clinical trials or research programs at a given time and affecting over 95,000
patients since its inception (BIG 2018).
Overall, the motivation and incentives for each funding model vary. Often the
motivation for noncommercial research does not align with the current needs of
the pharmaceutical industry. In these cases where industry is sponsoring the
research, there is a need to ensure independence and impartiality to address the
scientific hypothesis. This has implications for budgeting given that there could be
extra costs involved in ensuring the scientific integrity of the trial, such as the need
for Data Safety Monitoring Boards (DSMBs) or in other ways like ensuring the
proper firewalls are in place between the relevant partners.
There are other potential differences between commercial and noncommercial
models, which relate to timelines and trial procedures. In academic settings, time-
lines are typically more relaxed than for industry trials. The rigor of trial procedures
can also vary. In commercial research, or any trial involving investigational drugs,
the infrastructure and standards are usually more resource intensive and to a very
high standard. This rigor is necessary to satisfy regulatory bodies as results of these
trials are used to ensure products can make it to the consumer market safely and
effectively.
Distribution of Funds
The distribution of funds will vary depending on the individual trial, the main
stakeholder(s), and how the contracts are written. As stated, there may be several
parties involved in ensuring a trial is carried out successfully (e.g., research sites,
CRO, data management services, drug distribution services, sample processing
labs), and the main funder (which may be the Sponsor) is responsible for paying
the partners either directly or indirectly.
In academic trials the Sponsor/funder, which could be a government body,
nonprofit organization, or a commercial partner, typically pays a participating
university or affiliated medical center directly. In another model, the Sponsor pays
the Coordinating Center, which subcontracts to each participating site. Usually at
major academic centers, research groups are able to access the necessary support
resources through the institution’s research infrastructure sometimes called a CTU or
clinical trial unit. This may include central personnel, laboratory resources, addi-
tional medical services like imaging or equipment, bio sample storage, and Institu-
tion Review Boards (IRBs)/Ethics Committees (ECs).
416 E. Riley and E. McFadden
A clinical trial originates with a scientific concept for testing one or more treatments
for a particular condition. This concept can then follow one of many different
pathways. It could, for example, be an investigator-initiated concept, a concept
from a pharmaceutical company, or a concept from a funding agency. At a high-
level description, the idea is built in to a draft protocol and a trial Sponsor and funder
is agreed. The Sponsor and the funder can be the same entity or two separate entities.
The next step is to decide how the trial will be conducted and by whom. Quite often,
a formal request for proposal (RFP) is developed and circulated to any interested
party to allow them to submit a proposal outlining their plan for the specific role that
they would play in the trial. This RFP is usually particularly relevant for the Clinical
Trials Coordinating Center, and both this model and the Coordinating Center role are
the primary focus of this chapter.
Sometimes it is predetermined that a specific Coordinating Center will be respon-
sible for the conduct of the trial, perhaps because of a direct association with an
investigator, expertise in the specific condition being tested, or existing contractual
relationships. For other trials, there is a competitive process where any interested
parties (or sometimes-selected parties) are invited to participate.
Regardless of the model, this step is when certain details of the trial conduct are
first defined so that those responding (or those preselected) can develop a proposal
for their role in the conduct of the trial. The proposal will include logistical details
about proposed procedures and scope of work but will also include a preliminary
budget. Examples of things, which may be defined, and impact on required resources
(and therefore the budget) are:
1. Required accrual
2. Number of participating sites
3. Number of countries
4. Volume of data
5. Method of data collection (paper/electronic)
6. Any specific software requirements
22 Contracts and Budgets 417
The detail included in the request for proposal can vary in detail. In some
instances, all of the above will be well defined making it more straightforward to
develop a budget and proposal. In other instances, the specifications can be vague
and poorly defined, and some interaction will be required with the party requesting
proposals so that a reasonable estimate can be made.
This step is the first opportunity for a potential applicant to draft a budget to submit
as part of the proposal. This step is critical, as there needs to be a balance between
being competitive with a proposal and ensuring that the budget is realistic and would
cover actual costs of doing the trial.
The escalating and complex costs of clinical research are forcing the main funders
including the pharmaceutical and biotechnology companies, CROs, and government
to tightly manage and control the financial particulars of their trials. This is having a
large impact on organizations and is requiring them to better understand the spon-
sor’s position and the wider funding landscape in order to create a proposal that will
stand out. Rising costs are also forcing them to adapt their budget models so they can
remain competitive and to be able to deliver high quality and on target work.
All clinical research studies require a budget, regardless of the funding sources,
size of research project, or parties involved in the research activities (Fine and
Albertson 2006). The budget can be seen as a planning document, which covers the
financial life of the study and supports the functions necessary for its success
(Floore 2019). Along with other key trial documentation such as the protocol,
schedule of events, or informed consent, the budget is important and should be
designed to be the best attempt at evaluating and planning for the resources and
costs needed to implement the study in order to achieve the scientific goals for the
study.
Budget Considerations
There are several things to consider when developing a research budget. The primary
financial consideration is ensuring that the research can be effectively carried out
with the available funds. Planning inefficiency and insufficient budget forecasting
are notable areas where many organizations are failing, which often jeopardizes the
financial sustainability of a clinical trial (Grygiel 2016).
418 E. Riley and E. McFadden
Another challenging area for the bottom line of a clinical trial is the therapeutic
area being studied and the protocol design. There is evidence to suggest that the
complexity of the trial protocol is associated with higher study costs, lower levels of
data quality, and longer study durations (Friedman et al. 2010). The protocol defines
much of how the study will be implemented and what components are required.
Additional specifications as defined above will also factor into budget calculations,
such as how many research sites are needed, estimated duration of the trial, relevant
regulations, and scope of responsibilities. All of these areas have a significant impact
on a research budget and should be carefully examined to ensure the costs are
properly considered.
The costs of increasing and complex regulations and compliance are becoming a
more common part of research budgets (Matula 2012). This refers to both the
personnel responsible for fulfilling these obligations and the costs associated with
fulfilling IRB/ethics requirements, local, national, or even international regulations
including General Data Protection Regulation (GDPR), a European Union regula-
tion that has a global reach and impact on data protection. There are also the costs
attached in ensuring research personnel are adequately trained in the appropriate
areas of clinical research such as Good Clinical Practice (GCP). While these costs
may not be directly listed as budget line items, the costs may be reflected in travel or
training costs or by specific compliance roles such as a Quality Assurance Officer.
Rising costs in running clinical trials are also stemming from the fact that per
patient costs are increasing at astronomical rates. In the USA, the average cost per
patient in a clinical trial increased 88% between 2008 and 2011 (Hargreaves 2016).
Some reasons include poor patient recruitment and retention, which is resulting in
massive cost overruns, missed deadlines, and in some cases premature closure of the
study.
Budget Format
Any party involved in a trial will have to develop a budget relevant to their roles and
responsibilities and will have different formats to use. The parties involved could
include a Coordinating Center, participating sites, Contract Research Organisations,
central laboratories, and drug distribution centers.
The format of a budget submission for a specific proposal is usually predefined by
the Sponsor/funder for the trial and can be in many different forms. It is important to
prepare and submit the budget proposal in the required format. Some budget requests
are based on providing estimates of effort over time for relevant personnel, some are
hourly rates and estimated total number of hours per position, or some could be a
fixed rate per task. It is important to ensure that any relevant overhead (add-on/
indirect) costs are incorporated into the budget proposal.
The process of developing the research budget can be streamlined by having
standard tools available for the initial costing process. Research institutions could
have a “budget toolbox” in place to use when developing a draft budget, regardless of
the required format for a submission (Appelman-Eszczuk 2016). The diversity of these
tools depends upon the funding portfolios and overall experience with previous
22 Contracts and Budgets 419
applications or bids. All toolboxes should contain the ability to understand and analyze
several key areas including the funding model being applied, the funding source, the
therapeutic area being researched, the essential components of an effective trial
budget, and, of course, an understanding of the tasks which will be the responsibility
of the applicant. An organization with this type of “budget toolbox” in place when
responding to a request for proposal will be able to develop and submit a budget more
quickly and more effectively than those who do not have these tools at hand.
The previous sections show the complexity of budget preparation and the impor-
tance of knowing the relevant factors, which contribute to the drafting of a budget
proposal. The budget should be prepared in the required format and according to
specifications provided by those making the proposal request. The proposal also
needs to demonstrate an understanding of the regulatory requirements for the
specific project under consideration. If any assumptions are made in the preparation
of the draft budget, they should be well documented so that if those assumptions are
incorrect, the budget can be adjusted accordingly. Adequate justification for all
budget line items should also be provided.
In addition to the budget, there will be text to be added to the proposal, and it is
important to follow all instructions in preparing the proposal. There may be a
questionnaire to complete or free text to write to summarize the plan to meet the
requirements for the trial and to justify the budget request. The written component of
the proposal should be clear and concise and cover all relevant information. It should
be clear from the text which responsibilities are being included in the proposal and
the budget and text should match. Finally, it is important to submit the proposal by
any stipulated deadline and to include all information that was requested. Late or
incomplete proposals may be rejected.
Once the party who requests submissions has received proposals from all interested
parties, there is a process of selection. There may be a requirement for the applicant
to give a presentation to the requester or to answer some additional questions. A final
selection will be made, and the successful applicant will then move on to negotiating
a legal contract with the Sponsor/funder.
Once a proposal has been accepted, the two (or more) parties involved have to
negotiate a legal agreement outlining the terms under which the work will be done
and incorporating the final accepted budget, which may differ from the budget in the
proposal as more details of the project are fully defined. It is important to ensure that
420 E. Riley and E. McFadden
Contract Content
There are standard sections, which would routinely be incorporated into a contract
for the conduct of a clinical trial, often referred to as a Clinical Trial Agreement or
CTA. These sections include:
Other sections, which may be relevant depending on the trial and the roles and
responsibilities of the contracting parties, could include:
As the contract is a legally binding document and would hold any party to
account, it is essential that this document has detailed legal input by representatives
22 Contracts and Budgets 421
of each party prior to agreement and signature. Quite often, the legal counsel for
involved parties will negotiate terms among themselves once the assignment of
responsibilities and general structure have been agreed. While legal advice can be
expensive, it is much less costly than the alternative, which is finding out that you are
not covered if something goes wrong.
There are templates online, which provide a starting point for a Clinical Trial
Agreement document. In the UK, the UK Clinical Research Collaboration
(UKCRC) has developed model agreements in several areas. Their website has
links to several model templates, including ones relevant to clinical investigation,
CROs, primary care and commercial trials, and one for site agreements (UKCRC
website – https://www.ukcrc.org/regulation-governance/model-agreements/). These
nationally approved model agreements have been developed and published to help to
speed up the trial development process and simplify negotiations. National Health
Service Trusts in England and the devolved nations (Scotland, Wales, and Northern
Ireland) are expected to use them for relevant contracts.
Other guidance can be found in the NIHR Clinical Trials Toolkit, “an interactive
color-coded route map to help navigate through the legal and good practice arrange-
ments surrounding setting up and managing a Clinical Trial of an Investigational
Medicinal Product (CTIMP) (www.ct-toolkit.ac.uk).”
A clinical trial podcast (Kunal 2017) details nine essential components of a CTA
and provides insight into pitfalls in their formulation.
Scope of Work
As mentioned above, one of the key components of the contract should be a detailed
summary of roles and responsibilities for each party. It is essential that this is well
documented and understood so that there are no misunderstandings or omissions
once the trial gets under way. The list of tasks can be extensive and may be best
included as a detailed Appendix to the contract.
Table 1 shows a sample list of high-level topics for a scope of work to be
considered in a contract between a Sponsor and a Coordinating Center. Each of
these high-level topics can be broken down into activities that are more detailed. For
example, under the Statistics header, it can be documented which party is responsible
for preparing the statistical analysis plan development; under Interactions with
Authorities and IRBs/ECs, it can be documented which party interacts with regula-
tory authorities and which is responsible for ensuring materials are prepared for,
submitted to, and approved by Institution Review Boards and Ethics Committees;
under Clinical Data Management, it can be defined which party is to hold the clinical
database, which does quality control and interacts with sites. These are just exam-
ples, but it is recommended that each high-level category be broken down into these
422 E. Riley and E. McFadden
Budget Evaluation
At this stage in the start-up process and before the contract is signed, there should
be a thorough evaluation of the initial budget proposal which was submitted as part
of the response to the proposal request. It is highly likely that additional detail about
the trial and its conduct has become evident during the intervening period between
the initial submission and the contract signature. There may be additional responsi-
bilities that have been added to the scope of work since the RFP was issued, and any
additional tasks or increase in responsibility can impact the initial budget.
All assumptions made in preparing the initial budget should be reexamined to see
if they are still relevant, and revised budget calculations made and negotiated with
the funder. It is also advisable to add language to the contract saying that there will be
22 Contracts and Budgets 423
new negotiations if the scope of the contract changes and that no such changes can be
made without agreement of both parties.
Signature of Contract
Once the terms of the contract, budget, and scope of work have been agreed, legal
representatives of each party should sign the document. Someone senior within an
organization would normally do this. The primary researcher would not normally be
authorized to sign such documents on behalf of an organization. Signatures can be
wet-ink, with a document being circulated to all parties to add their signature(s).
Sometimes multiple copies are signed so that each party receives a fully signed/
executed copy with original wet-ink signatures, and sometimes each party retains
their own wet-ink signature on site, and a scanned copy is sent to other parties. More
recently, electronic signatures have become more common with document signature
software that is compliant with relevant regulations. Often the method is dependent
on the laws within the relevant countries involved.
Activation of Trial
Once the contract is signed, the study can be activated and work commence. It is not
advisable to start work on the trial until such a contract is in place, as an organization
would have no legal basis for doing work before the document is signed.
There are other legal agreements, which may be needed for a specific trial. Some
examples of these are as follows.
Vendor Agreements
If specific software/services are contracted by a party involved in the trial and used
for fulfilling their responsibilities in the trial, there should be agreements signed with
each vendor. Examples of these would be software support, software provider, and
database/electronic data capture (EDC) host.
During the course of the trial, if work scope changes are made to the operation of the
trial, it is essential that these changes be reflected in an updated contract and budget
amendment. Examples of changes are:
1. Modified accrual goals
2. Change in study design
3. Additional recruitment sites
4. Changes to scope of monitoring requirements
These are some examples, but any of these changes would impact the work scope
and the budget, and an updated contract should be negotiated.
Key Facts
Cross-References
References
Appelman-Eszczuk S (2016) Clinical research site budgeting for clinical trials. J Clin Res Excell
87:15–21
BIG (2018) Annual Report 2018 Spreading hope- advancing breast cancer research. Belgium. [Last
accessed: 24 November 2020] Available at: https://www.bigagainstbreastcancer.org/news/
annual-report-2018
Camps I, Rodriguez A, Agusti A (2017) Non-commercial vs. commercial clinical trials: a retro-
spective study of the applications submitted to a research ethics committee. Br J Clin Pharmacol
84:1384–1388
Carroll J (2005) CRO crowing about their growth. Biotechnol Healthc 2(6):46–50. https://www.
ncbi.nlm.nih.gov/pmc/articles/PMC3571008
Cooksey D (2006) A review of UK health research funding. HM Treasury, Norwich
Fine and Albertson P.C (2006) Budget development and staffing. In: Penson DF, Wei JT (eds)
Clinical research methods for surgeons. Humana Press, Totowa
Floore T (2019) Balancing the clinical trial budget. J Clin Res Excell 101:16–21
Friedman L, Furberg C, DeMets D (2010) Data collection and quality control in the funda-
mentals for clinical trials, chapter 11. Springer Science and Business Media, Dordrecht, pp
199–214
Grygiel A (2016) The struggles with clinical study budgeting. Contract Pharma. http://www.
contractpharma.com/issues/2011-10/view_features/the-struggle-with-clinical-study-
budgeting/
Hargreaves B (2016) Clinical trials and their patients: the rising costs and how to stem the loss.
Pharmafile (Online). Available at: http://www.pharmafile.com/news/511225/clinical-trials-and-
their-patients-rising-costs-and-how-stem-loss
Hind D et al (2017) Comparative costs and activity from a sample of UK clinical trials units. Trials
18:1–11. 203
International Council of Harmonization E6 (R2) (2016) Good Clinical Practice [Last accessed on
2020 November 24]. Available from https://www.ema.europa.eu/en/ich-e6-r2-good-clinical-
practice
Kunal S (2017) Clinical Trials Arena 9 Essential Components of a Clinical Trials Agreement.
https://www.clinicaltrialsarena.com/news/9-essential-components-of-a-clinical-trial-agreement-
5885280-2/
Matula M (2012) Evaluating a protocol budget. In: Gallin J, Ognibene F (eds) Principles and
practices of clinical research, 3rd edn. Elsevier/Academic, Amsterdam/Boston, pp 491–500
Pfeiffer J, Russo H (2016) Academic institutions and industry funding: is there hope? J Clin Res
Excell 90:23–27
Viergever R, Hendriks T (2016) The 10 largest public and philanthropic funders of health research
in the world: what they fund and how the distribute their funds. Health Res Policy Syst 14:1
Long-Term Management of Data and
Secondary Use 23
Steve Canham
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
Regulatory Obligations for Data Retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
Regulatory Obligations and Long-Term Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
Data for Secondary Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
The Push for Secondary Data Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Barriers and Issues with Secondary Use of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
Appropriate Preparation of Data for Re-use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
Maximizing Scientific Value with Data Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Managing Data and Data Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
Trials Unit Systems for Managing Secondary Re-use of Individual Participant Data . . . . . 448
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453
Abstract
The reasons for retaining data after a study is finished are reviewed. The nature
and implications of the legal obligations to keep data are explored, with a brief
discussion around each of the main questions that need to be considered. The
increased pressure to make individual-level data available to others is then
examined. Some of the barriers to such secondary use, or “data sharing,” are
described as well as some of the ways data re-use can be anticipated and thus
facilitated. Practical issues such as data de-identification and data use agreements
are discussed. The importance of promoting data inter-operability using standards
and common vocabularies is stressed, followed by a brief discussion about data
repositories and the selection of a suitable long-term home for data. Processes and
systems to support the secondary re-use of data, from the point of view of a trials
S. Canham (*)
European Clinical Research Infrastructure Network (ECRIN), Paris, France
e-mail: [email protected]
unit, are suggested. A recurrent theme is the need to consider and plan the long-
term management of data from the very beginning of the study, because plans to
store and, especially, to share data may have profound implications for data
design and study costs.
Keywords
Data retention · Good clinical practice · Metadata · Secondary use · Data sharing ·
HIPAA · Data standards · Data repositories · Data use agreements
Introduction
Trials eventually reach a point when all data entry is complete, all the analyses have
been performed, and all the associated papers and result summaries are written. Direct
access to the trial data for its primary research purpose is either no longer required or
limited to occasional read-only access. The data cannot, however, be destroyed – there
is a regulatory and legal obligation to retain it, at least for a defined minimum period.
In addition, there is increasing recognition that the data has potential scientific value to
others and that – suitably de-identified and usually with controlled access – it could
and should be made available for possible re-use. For both of these reasons, therefore,
the data will require management in the long term.
The regulatory requirement for data retention stems from the possible need to re-
examine data, in the context of assessing, or re-assessing, either the safety of a
product or the general conduct and regulatory compliance of the study. There may be
a suspicion that the data has been interpreted wrongly, or that a particular safety-
related signal was missed, or even deliberately suppressed or mis-classified in the
original trial summaries. There may be a need – fortunately rare – to investigate
alleged fraud by individual investigators, or there may be actions for compensation
from individual participants. For all these reasons, the sponsor is responsible for
ensuring that the data is retained, enabling it, if necessary, to be examined within
legal or regulatory processes or institutional or professional disciplinary procedures.
Data retention also allows the completion of analyses originally abandoned at an
early stage and thus never published, as well as the re-analysis of results where
misreporting is suspected. Promoting such analyses was the aim of the RIAT (restor-
ing invisible and abandoned trials) initiative (Doshi et al. 2013; RIAT Support
Center 2020). GSK’s “study 329,” which looked at the effects of paroxetine (Paxil
or Seroxat) and imipramine (Tofranil) in the treatment of depression in adolescence,
is an example of a high-profile trial that was re-published as part of RIAT.
This study had been originally published in 2001 (Keller et al. 2001) and had
claimed that the drugs were “generally well tolerated and effective” in the target
23 Long-Term Management of Data and Secondary Use 429
population. During litigation, brought by New York State against GSK in 2004 after
it appeared that paroxetine in fact increased suicidal behavior among adolescents, it
emerged that the underlying data had never really supported the 2001 assertion. The
study had never been republished with a corrected analysis, however, and the
original paper had never been retracted (in 2020 it is still not retracted) so it was
re-examined within RIAT. The re-analysis (Le Noury et al. 2015) confirmed that the
drugs were not “statistically or clinically significantly different from placebo for any
prespecified primary or secondary efficacy outcome” but that there were “clinically
significant increases in harms, including suicidal ideation and behavior and other
serious adverse events in the paroxetine group and cardiovascular problems in the
imipramine group.”
Le Noury and colleagues had used not only the clinical study report (CSR) for
their analysis but also individual patient data (as SAS datasets) and about 77,000
pages of de-identified individual CRFs. Importantly, the authors noted that “Our
analysis indicates that although CSRs are useful, and in this case all that was needed
to reanalyze efficacy, analysis of adverse events requires access to individual patient
level data in case report forms.”
The principle of data retention is set out in the Good Clinical Practice Regulations
or GCP (ICH 2016). Although strictly speaking these only apply to investigations
involving medicinal products, the principles of GCP are usually seen as applicable
to, and are followed by, all types of trials. GCP uses the concept of “Essential
Documents” which are defined (section 8.1) as “those documents which individually
and collectively permit evaluation of the conduct of a trial and the quality of the data
produced.” The “Essential Documents,” often collectively referred to as the “Trial
Master File” or TMF, include:
8.3.14 SIGNED, DATED AND COMPLETED CASE REPORT FORMS (CRF) To docu-
ment that the investigator or authorized member of the investigator’s staff confirms the
observations recorded Investigator/ Institution.
8.3.15 DOCUMENTATION OF CRF CORRECTIONS To document all changes/addi-
tions or corrections made to CRF after initial data were recorded.
Essential documents therefore include the data, specifically in the form in which it
was collected and including amendments to that data, reflecting the fact that the
purpose of data retention is essentially to provide an audit trail for possible later
inspection. The retention period is given by section 5.5.11 of the GCP guidance:
The sponsor specific essential documents should be retained until at least 2 years after the
last approval of a marketing application in an ICH region and until there are no pending or
contemplated marketing applications in an ICH region or at least 2 years have elapsed since
the formal discontinuation of clinical development of the investigational product. These
documents should be retained for a longer period however if required by the applicable
regulatory requirement(s) or if needed by the sponsor.
The final sentence is important. The relatively short retention period demanded by
GCP may be extended by other regulations, applied at international, national, state,
430 S. Canham
or institutional level, and those regulations are subject to change. In the USA, the
retention period specified in CFR 21 (1) broadly follows the GCP requirement (US
Code of Federal Regulations 2020), but in Europe the period has been considerably
longer. The Clinical Trials Directive amendment of 2003 required:
but the Clinical Trials Regulation of 2014 extended this period to 25 years:
Unless other Union law requires archiving for a longer period, the sponsor and the inves-
tigator shall archive the content of the clinical trial master file for at least 25 years after the
end of the clinical trial. However, the medical files of subjects shall be archived in
accordance with national law. (European Commission 2014, Article 58)
To complicate things further, the required retention period may also depend on
the type of trial and the population under study. For pediatric studies (because any
statute of limitation on claims may not come into effect until the child is 18, and
then last for a further period, e.g., 4 years), retention may be necessary until the
youngest participant reaches a certain age, e.g., 23 or 25. Similar considerations
may need to apply to studies that allow pregnant women, or the partners of
pregnant women, to participate. In such cases the retention period will depend
on the relevant law within the applicable legal jurisdiction. Some treatment types –
especially if they are relatively new and untested – may also demand longer
retention periods.
A further aspect of the GCP guidance on data retention is that it requires data to be
kept both centrally by the sponsor, and/or the trial’s operational managers (i.e., a
trials unit or CRO) on the sponsor’s behalf, and at each clinical site. Useful detailed
guidance on what records should be kept where, for the TMF as a whole, is provided
by the European GCP Inspectors Working Group (EMA 2018). Although originally
written for a European context, the points made in this document should be relevant
to most other environments.
For data, GCP makes it clear that the sponsor should retain the original CRFs
while each clinical site should keep a copy of the data that they generated. In the days
of three-part carbon paper CRFs, this was automatic; the sites simply retained a copy
of whatever data they sent to the sponsor’s central facility. Nowadays, with almost
universal use of electronic remote data capture, this means a copy of the site’s data,
as collected by a clinical data management system (CDMS) and usually incorporat-
ing the investigator’s signature in electronic form, is returned back to the site in a
human readable form at the end of the trial.
A potential complication was introduced into this process with the advent of the
latest version of GCP (E6(R2)). Here an addendum to section 8.1 makes the point
explicitly that:
23 Long-Term Management of Data and Secondary Use 431
The sponsor should ensure that the investigator has control of and continuous access to the
CRF data reported to the sponsor. The sponsor should not have exclusive control of those
data. . . . . . .
The investigator/institution should have control of all essential documents and records
generated by the investigator/institution before, during, and after the trial.
This seems reasonable – if a sponsor has exclusive control of the data, it could, in
theory, make unaudited changes to the data before it was returned to the site. Unless
the investigator had the time and inclination to check the returned data (e.g., against
the source documents), he or she would likely be unaware of any changes made. But
what this addendum means in practice is unclear. Does the use of a CRO, as a third
party managing the data, ensure that the sponsor does not have exclusive control?
Probably, though some have suggested a CRO, paid by the sponsor, may not be
independent enough. Academic- or hospital-based trials units rarely use CROs,
though increasingly they use hosted CDMS solutions. But if they directly control
and can access the CDMS, and thus the data, and are also the sponsor, how can they
show that the site, rather than themselves, has “control of all essential documents and
records generated by the investigator/institution”? If they cannot, are they then in
breach of this addendum to GCP? Further discussion would seem to be necessary to
clarify exactly how this addendum should be interpreted.
The increasingly common additional question, of how data could also be made
available for possible re-use by others, is discussed in a later section. The final
responsibility for resolving the questions listed above rests with the sponsor, but they
will normally be discussed with investigators and the trial’s operational managers, i.
e., a CRO or trials unit. That discussion will usually encompass all the essential
documents, i.e., the whole of the TMF, although here only the data is considered.
It is clearly better to consider these questions as part of the initial study planning,
so that everyone is clear about what will happen to the data and essential documents
at the end of the study, and what their role and responsibilities will be, from the very
432 S. Canham
restored onto an infrastructure from a backup – in such a case data marked as deleted
will need to be re-deleted.
What format(s) should be used? The proprietary structures used by many clinical
data management systems, both for database storage and for a “data export,” do not
lend themselves to long-term storage. Even if systems remain in existence, they will
evolve, and a file created by one version may soon become unusable by later
versions. Twenty-five years is a very long time in technology. On the other hand,
such files often provide the most complete picture of the data as collected, with
previous values and audit trails included. Data is much easier to re-access if it is
stored in simpler non-proprietary formats, e.g., CSV (comma separated values), or
using a global XML schema, although such schemas are also likely to evolve over
time. But it may take additional work to ensure that the full set of data required,
including previous values, is included when using such formats.
The best answer is probably to use both format types. The lifetime of proprietary
files can be extended by using a virtual machine (VM) or, increasingly these days, a
server “Container” to preserve not just the data but also the context, e.g., the CDMS,
database, and operating system, in which it was housed. When a CDMS is updated
or replaced, old studies could be transferred to the new system, but it may be simpler
to split off the old system and the studies completed on it as a separate VM or
Container and “freeze” that in long-term storage. This has resource implications,
however, hence the need to consider this option at an early stage.
Complementing the data retained in this “native” format are the datasets in a non-
proprietary flat file format, i.e., the data as extracted and the data as analyzed, if
different. These should already exist because they have been required for the
analysis process. Organizing and retaining them should therefore be a relatively
straightforward exercise.
What metadata is also required? Data in any format quickly becomes useless
unless its meaning is clear, which means that metadata, for all types of retained data,
is essential. This should include, for each data item, its code, name, type, description,
and possible values. Most CDMSs can generate metadata for each study they
support, so including the metadata for data in its original format should be straight-
forward. Flat files created from the data may require additional, specific data
dictionaries, however, although again these should have already been created to
support the analysis process. Care needs to be taken that this descriptive metadata is
present or generated for all the data files, and included in the final data package.
As well as the descriptive metadata, the read me or “contents” file should include –
along with a general listing of the files included and the provenance and purpose of
each – any technical details about the files, e.g., the versions of systems used to
generate them. Text-based files like CSVs can all look the same to humans, but
machines can get confused by the different coding schemes used to generate them
(e.g., UTF-8 versus UTF-16) or the presence of technical marks in the file (e.g., byte
order marks). These may be unimportant at the time the data is generated because all
systems are set up to work with a specific configuration, but a few years later they
could cause problems if restoration is attempted. These details should therefore be
documented.
434 S. Canham
How long should the data be retained for? As indicated in the previous section,
there may not always be a simple answer to this question. Most sponsors, trials units,
and CROs become familiar with a particular regulatory regime and its data retention
guidelines, but it is worth checking to see if any exceptions apply or if the regulations
seem likely to change soon. For relatively new trialists (e.g., a new biotech startup or
an inexperienced investigator), it may be worth obtaining professional advice from a
CRO or trials unit, to ensure the regulations are well understood.
Some sponsors will stipulate a longer period for data retention than the strict
minimum (e.g., 30 years), often making it a blanket rule for all types of studies,
interventional and observational, to keep everything relatively simple. Exceptions
may still occur, but they should be less common. Some sponsors may even decide to
simply “keep everything” indefinitely. This is easy to say for a relatively new
sponsor with little if any data in long-term management, but decisions about how
and where the data needs to be stored still need to be resolved. It also raises an ethical
issue – if data is relatively identifiable, the longer it sits in an IT infrastructure, the
longer it risks being lost or hacked. Once the period required by regulation is over
therefore (admittedly a long time in Europe), there is a good argument that says such
data should be destroyed (or anonymized) and not simply left indefinitely in its
original state.
What data should the sites retain? The sites should end up with a copy of the data
they provided as input to the study. They are unlikely to have the systems to read the
data in the way it is stored centrally, so a database file is not appropriate. The data
will therefore need conversion to some more readable format – e.g., pdf, csv files,
and spreadsheets. This may need to be negotiated at the beginning of the study, and it
certainly needs to be planned as – especially for a large study with many sites – it
could be a time-consuming and costly exercise. Different CDMS have different
capabilities in this respect – some have an ‘archive’ function which will generate the
required data in a suitable format, while for others it will be more of a manual
exercise. The use of optical storage is a common way of transferring the data, i.e.,
sending the site a CD-ROM. If not copied to local systems, however, the disk may
not be accessible after several years – CD-ROMs have a finite lifetime – so
arrangements should be in place to ensure that the copying takes place, even though
this is the site’s responsibility.
How should final data destruction (if it ever occurs) be managed? If and when
some or all of the data is to be destroyed, then this should be explicitly authorized
and then documented. This can be quite difficult if the infrastructure where the data
sits is not under the direct control of the sponsor or the sponsor’s agents, but some
form of certification or assurance should be sought to show that the sponsor has done
their best to ensure full destruction.
A related issue is that elements of the underlying infrastructure will be periodi-
cally replaced. Machines and storage devices have a finite lifetime – indeed the
physics of solid-state storage devices (SSDs) means that they can only be written to
so many (million) times. There will therefore inevitably come a time when these
devices need replacing, with their data being transferred to new systems. It is
important that the infrastructure’s users are aware of this and are satisfied that device
23 Long-Term Management of Data and Secondary Use 435
removal also renders the data on that device completely inaccessible, usually
through physical destruction of the device. If the IT infrastructure is “in-house,”
this is relatively straightforward – procedures can be established to ensure it hap-
pens. When the infrastructure is external, it becomes more difficult, requiring
explicit recognition of this issue with suitable assurances sought and provided.
To ensure that all the questions listed above are considered, even if only in the
form of a checklist that needs to be worked through, it is important that both the
sponsor and the operational managers of a trial, the CRO or trials unit, have a
standard operational procedure (SOP) in place covering long-term data management.
It should be integrated with the other SOPs covering trial setup and design, and the
decisions taken as a result of working through it should be documented in the trial’s
data management plan (DMP). Much later, when the study ends, the same DMP can
be used to document the actions taken as a result of the plan.
Over and beyond simply keeping the data because it is a regulatory requirement,
which at base is a rather passive and defensive exercise, there is a growing recog-
nition that the data from a clinical trial has potential scientific value to other
researchers and through them to society as a whole. Over recent decades therefore,
there has been a steadily growing acceptance that a study’s individual participant
data (IPD) should be actively prepared for possible secondary use (so called because
it is outside the primary use of the original research study) and then be openly
advertised as available, albeit usually under controlled access.
Making data available in this way has been driven by the convergence of a number of
different arguments and trends. Among the arguments advanced in favor of data re-
use are:
looking at fluid resuscitation in African children with shock and severe infection
(Maitland et al. 2011; Levin et al. 2019; Maitland et al. 2019).
• In times of global pandemics like Ebola or COVID-19, the availability of IPD can
be critical in allowing investigators to properly evaluate the often hastily prepared
reports, as well as allowing the possible pooling of data from different sources. In
this context, calls for data sharing have been issued by the WHO (2015), the
Wellcome Trust (2020), and the Research Data Alliance (2020).
• Data availability makes it possible to compare or combine the data from different
studies. An example is the cross-study “data platforms” that have been
established in some specialist disease areas, for example, the Ebola Data Platform
(IDDO 2020). It also allows data aggregation for participant-level meta-analysis,
where, despite the potential advantages of such analyses, data has often been
difficult to obtain (Riley et al. 2010).
• Secondary use can reduce unnecessary duplication of work and make it easier to
build upon a trial with additional ancillary studies. An early example was the
Diabetes Control and Complications Trial (DCCT) of 1993 that made their data
available to other investigators. By 2015 over 220 ancillary studies had been
carried out using or building upon DCCT data, i.e., with the same cohort of
participants (Henry and Fitzpatrick 2015; EDIC 2020).
• Secondary use can lead to novel analyses and/or tool generation. In an experiment
in 2016, the New England Journal of Medicine hosted the SPRINT data analysis
challenge. People were invited to analyze the IPD from the NIH-sponsored
SPRINT trial (SPRINT Research Group 2015), “to identify a novel or scientific
or clinical finding that advances medical science.” A total of 143 different
applications were received, each representing a new application of the data
(NEJM 2016).
• Economically, because data sharing can increase the quality and efficiency of
clinical trials through the mechanisms described above, it can help to reduce the
wastage in research (Chan et al. 2014). Not surprisingly, funders are often strong
supporters of data sharing and mandate it in the studies they support. The
Wellcome Trust, the UK’s Medical Research Council, Cancer Research UK,
and the Bill and Melinda Gates Foundation all require that data be made available
for re-use. In a joint declaration, they concluded “It is simply unacceptable that
the data from published clinical trials are not made available to researchers and
used to their fullest potential to improve health” (Kiley et al. 2017).
• Ethically, IPD sharing has been framed as a way of better respecting the gener-
osity of clinical trial participants, as it increases the utility of the data they provide
and thus the value of their contribution. It has also been argued that, if access to
health and healthcare is a basic human right, access to data that can improve
health is similarly a fundamental right (Lemmens 2013). Those involved in
research therefore have an obligation to respect and promote that right by making
their data available (Lemmens and Telfer 2012).
• Socially, the substantial public investment in science, including clinical research,
demands a similarly public response: “publicly funded research data are a public
good, produced in the public interest, which should be made openly available
23 Long-Term Management of Data and Secondary Use 437
For all of the reasons listed above, the idea of data sharing in clinical research has
become much more acceptable, to the extent that Vickers was able to claim a
“tectonic shift in attitudes” over 10 years (Vickers 2016). In addition, many trial
registries now include sections for trialists to describe their plans for data sharing,
and there is strong encouragement from many journals for authors to indicate how
they will make the underlying data for a paper available to others. BMJ journals, for
example, for most of its major titles, stipulate the following (BMJ 2020):
• “We strongly encourage that data generated by your research that supports your article be
made available as soon as possible, wherever legally and ethically possible.
• We require data from clinical trials to be made available upon reasonable request.
• We require that a data sharing plan must be included with trial registration for clinical
trials that begin enrolling participants on or after 1 January 2019. . . .”
The last requirement is in line with the data sharing recommendations of the
influential International Committee of Medical Journal Editors, which also stipulate
that clinical trials must include a data sharing plan in the trial’s registration, from 2019
onward. The ICMJE currently stop short, however, of requiring the availability of data
for secondary use. Not making data or documents available is still listed as an example
of a valid data sharing plan (ICMJE 2020). This may be a recognition that, despite the
various “top-down” pressures to make data available, e.g., from publishers and
funders, and the growing cultural acceptance of data sharing, from a “bottom-up”
perspective there are several potential risks that can make it less appealing.
Some investigators fear that others could “mine” their data for insights and results
that would otherwise be available only to them. The critical value of published
papers for career progression can make this concern, that others might pre-empt
“their” papers using “their” data, a critical factor in deciding when data should be
438 S. Canham
made more generally available. It also influences the debate about whether the whole
dataset produced by a study should be made available, or just the data used to
support the conclusions of published papers, which may be subsets of the whole.
There have also been claims that researchers in low- and middle-income countries
could be particularly disadvantaged if their data is made available to those with more
developed capacities for analysis (Tangcharoensathien et al. 2010), a case of FAIR
being potentially unfair. The call has therefore been made for data sharing in such
contexts to be considered as a partnership, and more of a mutual learning exercise,
than the simple appropriation of one group’s data by another.
There is also a reticence of some authors to allow their data and analyses to be
examined and possibly misunderstood, misused, or simply criticized, with a conse-
quent need to enter into a public debate that could endanger their reputation, as well
as demanding time and effort. This was one of the major reasons quoted by
researchers as a barrier to data sharing in a survey conducted by the Wellcome
Trust (Van den Eyndon et al. 2016) across a range of researchers in the biological,
medical, and social sciences. The same survey, however, also found that in reality –
despite some of the high-profile cases reported above – very few researchers reported
these types of negative experiences from data sharing. In fact side effects when they
did occur were almost always positive, including more collaboration opportunities
and an increase in citations.
The lack of conventional academic reward for “simply” sharing data has also
been recognized as a barrier to data sharing. At the technical level, there needs to be
consensus on the best ways in which re-used data should be cited to ensure that the
original data generators are properly recognized (including tackling the issue of
different versions of data being available), but more importantly those citations then
need to be included in the evaluation of a researcher’s work, for example, when
considering grant applications or career progression. One scheme (Bierer et al. 2017)
proposes the use of the term “data author” in the literature, to clearly distinguish the
contribution of the data generators to the research, in not only initially collecting but
also managing, cleaning, curating, and preparing the data for re-use.
It seems likely that over time experience will show that there is less to fear from
data sharing than some researchers believe. Greater clarity should emerge from
publishers and others about their expectations for making data available, and the
time periods when that should occur; and systems will evolve to give greater
recognition to the researchers who provided the data. There will remain, however,
some very practical issues that contribute to the complexities and costs of supporting
secondary data use, which are often a cause of concern to researchers. An overview
of some of the main issues is provided below.
privacy regulations. The difficulty is that those regulations will vary across both
space and time. For example, currently (mid-2020) there appears to be a contrast
between a relatively pragmatic approach to secondary use of IPD in the USA, based
on de-identification of the data, and the more complex situation in the EU, where the
General Data Protection Regulation (GDPR) has brought several contentious issues
to the fore, including the exact characterization of pseudonymous data and the
potential role of consent (Peloquin et al. 2020). In addition, while the GDPR was
supposed to harmonize regulations relating to personal data use across the EU, it
returned some of the powers to regulate personal research data back to the member
states, so that a simple, single European regime has not been realized. The result is
that the regulations around secondary use of IPD in the EU continue to evolve.
No attempt is made here to try and survey the different and developing privacy
and data protection requirements that apply around the world. It will always be
necessary for sponsors and investigators to familiarize themselves with those
requirements and comply with them, which for multi-national trials may involve
multiple jurisdictions. There are, however, some common components to data
preparation that will need to be considered.
• Names
• Geographic locations smaller than a state, including street address, city, county,
post code
• All elements of dates (except year) for dates that are directly related to an
individual, including birth date, admission date, discharge date, and death date
• All ages over 89 and all elements of dates (including year) indicative of such age,
aggregated into a single category of age >¼ 90
• Fax numbers, telephone numbers
440 S. Canham
In the USA at least, following the fairly straightforward and public rules should
normally allow secondary use. The loss of dates (apart from the year element) is
something that could seriously impact the scientific usefulness of a study dataset,
given the central importance of the timing of events. One way around this is to
“rebase” dates to numbers of days after a fixed point (e.g., randomization), so that
they become integers with no relationship to the calendar.
Other de-identification techniques take the obfuscation of data further. One
approach ensures that no collection of data values is unique to a single individual,
instead being shared by at least k individuals (“k-anonymization”). This can include
techniques such as:
individual health and research data in the interests of public health, with suitable
safeguards but irrespective of the presence or not of explicit consent. Secondary re-
use of data may therefore be allowed under such regulations.
Even when consent is the basis of processing pseudonymized data, as it often is
with the primary study, the difficulty with consent for secondary use is that it cannot
be fully informed. By definition, at the time of the primary study, the nature of any
secondary usage is unknown. Attempts have been made to promote the use of a
“general consent” for secondary use for research purposes, linked to assurances
about data de-identification and data location (as has been proposed for bio-bank
materials), but it is not clear if such a consent would be acceptable in all jurisdictions.
Consent therefore remains a tricky issue whose relevance urgently needs clarifi-
cation in circumstances where data remains classified as pseudonymized. Having
said that, whether consent is deemed relevant from a legal standpoint or not, there
remains an ethical imperative to inform the study participant of any plans to make
the data they provide available for data sharing. Such information should be pro-
vided as part of the information sheets given to participants when they enter a study,
and include relevant details such as the de-identification measures that will be
applied, the location of data storage, and restrictions that will be placed on any sec-
ondary users (in turn meaning that establishing these details is a necessary part of
initial study planning).
The question then arises as to whether a participant should be able to object to
their data being re-used beyond the primary study, and be able to withdraw their data
from such re-use, either at the beginning of the study or at any time afterwards, even
after data collection has ceased. Again, different legal jurisdictions may have
something to say about whether this is possible or desirable, and what mechanisms
might be necessary to put in place to support it.
data use agreement is evidence of good intent and can help to protect sponsors from
reputational damage. It is noteworthy that two major data repositories managing
secondary re-use of data from the pharmaceutical industry – ClinicalStudyDa-
taRequest.com and the Yale University Open Data Access project – both insist on
data use agreements being in place before data is released (CSDR 2020; Yoda 2020).
Traditionally clinical studies have been designed in relative isolation, and study
datasets have therefore also tended to be idiosyncratic, each with a distinct set of
differently defined and coded data points, often categorized in different ways.
Unfortunately this can be a huge problem when trying to compare and/or aggregate
data from different studies, making those processes error prone, time-consuming,
and costly, as well as constraining what comparisons are possible.
The use of standards and conventions for data definitions, however, allows data to
be compared and/or aggregated much more easily across studies (and also, as
described in the chapter on the study data system, allows studies to be designed
more quickly and efficiently). As secondary re-use of data increases, it is vital that
investigators and study designers maximize the inter-operability, and thus the poten-
tial scientific value, of the data they generate, by increasing the use of data standards.
Data standardization operates at various levels:
The data points selected will depend upon the outcomes and safety signals to be
measured, as specified in the protocol. The nature of clinical trials means that
sometimes a study will have novel end points and safety signals, but a high
proportion of the data points collected will be very similar across studies. This
applies not just to the common variables found in most studies (e.g., demographics,
medical history, vital signs, adverse events, concomitant medications) but often also
to more disease-specific measures. One aid to selecting outcome measures is the
COMET (Core Outcome Measures in Effectiveness Trials) initiative, designed to
identify and support the generation of core outcome sets (COS) in trials, with a core
outcome set being defined as “an agreed standardised set of outcomes that should be
measured and reported, as a minimum, in all clinical trials in specific areas of health
or health care.” COMET maintains a database of papers that describe the generation
and content of over 400 core outcome sets (Comet 2020).
Professional, national, and international bodies may also develop outcome
measures and measuring schemes for particular disease areas, such as cancer staging
444 S. Canham
(e.g., TMN) and tumor measurement (e.g., RECIST) or, as an example of a response
to a specific disease threat, the set of data points developed by ISARIC for COVID-
related trials (ISARIC 2020). A further source of standardized data items is
published and validated questionnaires, e.g., those dealing with aspects of the quality
of life of participants or for assessing cognition or mood. Care must be taken that the
instrument is valid for the population under study (and translations in a multi-
national/multi-lingual context must also be validated), but in general using a pre-
existing questionnaire or rating scale will be more useful, less expensive, and more
generalizable than trying to develop a study-specific instrument.
Ensuring detailed definition of data points is critical for meaningful comparison,
within as well as between studies, and underlines the need for good descriptive
metadata that makes such definitions explicit. Although many data points are
relatively unambiguous (date of surgery, weight, blood biochemistry, etc.), some
are not. A notorious example is “blood pressure,” which for consistency should be
further characterized by position (lying, sitting, standing) but rarely is, or perhaps for
timing (e.g., before or after a procedure). Time points for events, despite their
importance for analysis, can also be ill-defined. Is a recurrence of a tumor dated
from the date the patient first reported the associated symptoms, the date of the scan
that confirmed disease progression (possibly one of several scans and tests), or the
date of the multi-disciplinary meeting that formally confirmed the recurrence diag-
nosis? There may be several weeks between these dates, so some consistent rules
need to be developed, applied, and described.
The use of different categorization schemes can also be a headache when com-
paring data, though it is sometimes possible to find mapping schemes between the
major controlled vocabularies. MedDRA is widely used for adverse event reporting
(and is mandated for such use within the EU) and is a de facto international standard
for that purpose. But medical history, for example, might be gathered using
MedDRA, ICD, SNOMED CT, or MESH terminology, among others. Unfortunately
there are few formal requirements or guidelines regarding the use of one controlled
vocabulary scheme over another – the choice may come down to such factors as
previous training and/or exposure to different systems, the existence or cost of
licenses, the practicalities of use, and the required levels of granularity and compat-
ibility with other systems. Whatever controlled vocabularies are selected, however,
they will almost certainly be more informative and easier to analyze than free text.
The most comprehensive and established framework for using data standards in
clinical research, one that is both global in scope and internationally recognized, is
that provided by CDISC, the Clinical Data Interchange Standards Consortium.
Beginning in 1997, CDISC has provided a broad range of standards and related
tools, covering all phases of the clinical study life cycle, including schema for
structuring pre-clinical data, for study protocols, for data collection, for data trans-
port, and for data submission and analysis. It also provides lists of questionnaires and
controlled vocabularies (CDISC 2020).
The key CDISC resources, in terms of study design and data re-use, are the
Clinical Data Acquisition Standards Harmonization (CDASH) standards and the
Therapeutic Area (TA) user guides. Taken together, these provide a means of
23 Long-Term Management of Data and Secondary Use 445
structuring, coding, and defining the data in a consistent fashion, especially those
relating to the data domains commonly found across studies – demographics,
adverse events, subject characteristics, vital signs, treatment exposure, etc.
CDASH and the TA guides are currently used much more within the pharmaceu-
tical industry than the non-commercial sector. The FDA, in the USA, and the
PMDA, in Japan (though not yet the EMA in the EU) have stipulated that data
submitted in pursuance of a marketing authorization must use CDISC’s Study Data
Tabulation Model (SDTM), a standard designed to provide a consistent structure to
submission datasets. Creating SDTM structured data is far easier if the original data
has been collected using CDASH, which is designed to support and map across to
the submission standard.
The CDASH system is relatively simple conceptually, but it is comprehensive,
and it does require an initial investment of time to appreciate the full breadth of data
items that are available and how they can be used. It provides standardized terms for
common study and demographic variables (e.g., SUBJID, SITEID, BRTHDAT,
AGE) and then uses a prefix-suffix system to define further variables in various
domains – 24 such domains are listed in CDASH 2.0. The prefix is a two-letter code
for the domain (AE ¼ adverse events, MH ¼ medical history, CM ¼ concomitant
mediation, EX ¼ Exposure (to the drug under investigation, PR ¼ Procedures, etc.).
The suffix indicates the type of data value, so, for instance:
The Therapeutic Area (TA) user guides supplement both the CDASH and the
SDTM standards, providing a steadily growing list of therapeutic or disease area
specific terminology and detailed explanations of how SDTM and CDASH defini-
tions can be applied. The list of therapeutic area standards already developed, or
being developed, is available on the CDISC website. In August 2020 there were 44
areas listed (from acute kidney injury, Alzheimer’s, and asthma through to tubercu-
losis, vaccines, and virology).
An evaluation of CDASH and any relevant TA standards is highly recommended
because they represent a relatively complete system for standardizing data collec-
tion. Readers are referred to the CDISC website (CDISC 2020), which provides
446 S. Canham
comprehensive implementation guides for each standard. While trials units in the
non-commercial sector have not been forced into preparing CDASH and SDTM
datasets, many have already experimented with using parts of the system. Ultimately,
wide use of the CDISC standards could enable an SDTM-based archiving and data
sharing model that could be used across all sectors of clinical research, allowing a
huge pool of more inter-operable data to become available.
One caveat is that using CDASH can have consequences for the structure of the
data that is exported from a CDMS, i.e., the nature of the analysis datasets, and it is
therefore important that statisticians are happy about the data being presented to
them in this way. Traditionally, CDMS systems generate a series of tables, each
corresponding to an eCRF, with the data arranged as a row/subject visit within that
table. Because the CDASH approach tends to make greater use of small repeating
groups of questions, it creates many “ribbon”-shaped tables instead, each focused on
a particular domain, with relatively few data fields in each but often with many rows.
These tables are organized as one row/event, where the “event” is represented by a
cluster of related data points. Some statisticians may be wary of accepting data in this
form, preferring to transform it to a more traditional structure, or face modifying
their normal approach to analysis and the library of tools they have established. In
other words, making use of the CDISC standards requires the full understanding and
cooperation of the statisticians tasked with analyzing the exported data.
The final aspect of standardization to consider is the generation of descriptive
metadata for the study data – the characterization of the data points: their codes,
names, descriptions, types, ranges, possible values, etc. This has traditionally been
done using a variety of methods, from simple “data dictionaries” in spreadsheets
through to XML schemas, for example, using the CDISC Operational Data Model
(ODM). To make this metadata more useful, and in particular more easily searched
and processed by machines, it would be useful to have such metadata in a standard
format, the most appropriate – because it exists specifically for this purpose – being
CDISC’s “Define.xml” standard.
Unfortunately, current use of Define.xml seems very limited outside of the
pharmaceutical industry. There is a need for CDMS developers to incorporate
Define.xml exports in their systems, for tools to help statisticians and others read
or search Define.xml files more easily, for tools to allow machines to search and/or
describe Define.xml content, and for funders, trials units, sponsors, and investigators
to push for greater consistency in generating metadata using this single standard
rather than the variety of approaches that currently exist. The first part of re-using
data is to understand what is in it, and without a consistent approach to metadata, that
is going to be more time-consuming and costly than it should be.
Currently, much of the clinical research data objects made available for sharing are
simply retained by the research team that produced them, somewhere on the disk
storage allocated to their department. The alternative, and other things being equal the
23 Long-Term Management of Data and Secondary Use 447
preferred option in the longer term, would be to make a conscious decision to move the
whole data package (i.e., datasets and related documents) to a dedicated data repos-
itory. This might be the institution’s or the company’s own data repository, specifically
set up for storing the research outputs of its staff, or it might be a third-party repository
– perhaps a general one storing all types of scientific data, or one specializing in
clinical research data, or even one specializing in a particular disease area.
Note that putting a copy of the data in a repository does not mean granting public
access to it; it simply means preparing the data for possible sharing and then
advertising that it is available. Those wishing to use it would still have to meet
any conditions that the researchers stipulated, e.g., provide a rationale for their usage
and/or adhere to a data use agreement. The data in the repository could be pseudon-
ymous (i.e., could be linked if required to the pseudonymous data held securely by
the researchers) or anonymous (could not be practically linked to that data)
depending on legal requirements.
The advantages of using a separate, dedicated data repository (including one set
up by the researchers’ own institution) include:
• Long-term data management. The original research team (or collaboration) will
change its composition, or may even cease to exist, and it may then become
difficult or impossible for data to be managed and requests for it to be properly
considered.
• Transfer of data to a repository helps to ensure that preparation of the data for
sharing (e.g., de-identification, provision of metadata) occurs and that the data
and related documents are properly described.
• Advertising the data and metadata in a repository’s catalog can help to make that
data and related documents more easily discoverable.
• It can, depending on the arrangements made with the repository, relieve the
original research team/sponsor of the need to review requests and even of the
need to make the decisions about agreeing to such requests.
• Anticipating transfer to a repository aids in explicitly identifying data preparation
and sharing costs at an early stage of the trial.
The problem is that most existing data repositories are not, yet, well organized to
manage the sensitive personal data generated by clinical research and have only
limited facilities for controlled access. The default for data repositories in most
scientific domains is open public access, with the only control a possible embargo
period on data release, so controlled access to sensitive data presents a challenge.
A recent study (Banzi et al. 2019) looked at data repositories that were potentially
available to non-commercial researchers for clinical research data. Twenty-five such
repositories were identified and assessed against eight key criteria (filtered down
from an initial list of 34), seen as particularly relevant to clinical researchers and their
data storage needs. The criteria were that the repository should have:
None of the repositories fully demonstrated all of the eight items included in the
indicator set, although three were judged as demonstrating or partially demonstrating
all of them. Other repositories appeared less suitable in a variety of ways, although
this may have been because in many cases the relevant information was not available
publicly on the repository’s website – many repositories do not do a good job of
advertising their services.
This situation may improve but at the moment it is clear that the full potential for
data re-use for clinical research data is hampered by the lack of suitable places to
store that data in the long term. This problem also underscores the need for robust
and public assessments of data repositories, so that potential users can make an
informed decision. Various general schemes have been proposed for this (e.g., see
CoreTrustSeal 2020), but they have not yet been expanded to include specialist
certification schemes for groups with particular requirements. Such a development
will be necessary, however, in order that clinical researchers can make informed
decisions about the storage of their data.
Managing the secondary use of clinical study IPD is complex, with a range of
technical, resource, and legal issues to consider. In many cases decisions will be
required as part of study planning – for example, deciding how to integrate data
standards in the study’s database design and what to include about potential data re-
use in the information sheets prepared for participants. Even when decisions and
activities could be postponed until the end of the study (e.g., deciding where data
should be stored in the long term, de-identifying the data), they should be anticipated
at the beginning of the study in order to estimate the resources required for those
activities and include the associated costs in bids for funds (not least because the
impetus for data sharing often comes from funders).
Managing the potential re-use of data is also a relatively new activity – one more
aspect of running a trial to add to all the other responsibilities and requirements faced
by investigators and operational managers. So how should a trials unit (using that
term in the most general sense, i.e., the trial management department in a pharma-
ceutical company, CRO, university, or hospital) integrate managing data re-use with
the other services it offers to investigators and sponsors? It seems clear that two
broad types of activity are necessary:
23 Long-Term Management of Data and Secondary Use 449
• A general preparation, of systems and staff, to understand and be prepared for the
various aspects of data re-use
• Study-specific activity, to manage the details of data re-use in the context of a
particular study. The latter will be split into two time points:
– That required during study planning, and
– That required at study end
As a brief practical guide to supporting data re-use, but also to summarize many
of the points made earlier in this chapter, suggestions for the main elements of each
of these activities are listed below:
General Preparation
• Clarify the legal regulations and requirements for data sharing in the relevant
legal jurisdiction(s). “Relevant” usually means those in which any study partic-
ipants live, and not just the jurisdiction of the trials unit itself. Among the things
to be clarified are the legal basis, or bases, under which data re-use is justified, the
role of consent (if any), the definitions and relevance of anonymized and pseudo-
nymized data, the need for data de-identification, and the need to demonstrate risk
assessment and/or risk management.
• Clarify any existing policies and procedures relevant to data sharing of the parent
organization (if there is one, e.g., a hospital, university, or company), and
incorporate them as necessary into the trials unit’s own procedures.
• For external sponsors or funders involved with a lot of studies, clarify their
policies and procedures relevant to data sharing (and data retention), with a
view to incorporating them, as necessary, into the trials unit’s own procedures.
• Ensure sufficient staff are familiar with regulations and policies relating to data re-
use, as described above, for study-specific work in this area to be carried out
effectively. Consider giving one or two roles operational oversight of the prepa-
ration for data re-use.
• Ensure sufficient staff are familiar with data standards and their application in
local systems for systems such as CDASH to be applied – perhaps at relatively
low levels initially but increasing over time. (Application of data standards may
require separate SOPs).
• Explore the options available for long-term data storage and data management,
within the department, within the larger parent organization, and within external
third-party repositories.
• Develop an SOP for preparing data for re-use, to be applied in the context of any
particular trial (unless all the decisions relating to data re-use in that trial are taken
entirely by an external sponsor). Integrate it into more general SOPs on study
preparation so that the data re-use is considered, planned, and costed at the outset
of the trial’s setup. The elements of the SOP are described in more detail in the
study-specific section below.
• Develop an SOP on responding to requests for data, to be applied in the context of
any particular trial (unless all such requests in that trial are managed entirely by an
external sponsor).
450 S. Canham
Conclusion
Study data management does not end with the end of the study. Data must be
retained for a set period to allow re-analysis if necessary, and, increasingly, data –
or a de-identified subset of it – is expected to be made accessible to others, if they can
justify the reasons for that access.
The simple retention of data is not difficult and has long been a requirement, but it
does require clear planning and resourcing, and it needs to be comprehensive – all
forms of the data need to be considered and archived or destroyed as necessary.
Because interest in the data may have waned at study end, it is important that all
decisions relating to data retention are taken at the beginning of the study, by the
sponsor but usually in collaboration with the study’s operational managers, and
that all activities are properly resourced.
Making data available for secondary re-use is a more complex, active process that
is relatively new for most trialists and trials units, but it is increasingly becoming an
expectation. The details of the processing required will inevitably depend on the
legal framework that applies, but as a minimum data will need to be de-identified. To
make the shared data more useful, it will also be important to ensure that the data is
as inter-operable as possible, by incorporating data standards into the study design
from the outset (applying such standards retrospectively can be done in theory, but in
practice is a very difficult and costly process). Finally, to make the data and
452 S. Canham
associated documents available in the long term, it will often be necessary to transfer
them to a dedicated data repository.
Again, the only way this end of study activity can be delivered efficiently is by
planning and resourcing it from the beginning of the study planning process. That
means setting up systems (including adequately trained staff as well as relevant
SOPs) that allow preparation for data re-use to be integrated into the rest of study
management.
Key Facts
1. Clinical trial data must be retained, for potential re-analysis and investigation,
for periods determined by the relevant legal jurisdiction(s).
2. Data retention is also required for compliance with Good Clinical Practice
(GCP).
3. Data should be retained both centrally and at each clinical site.
4. The sponsor has the final decisions with regard to the details (files, format,
location, etc.) of retained data.
5. Arrangements for retaining data in the long term should be established by the
sponsor, in collaboration with the trial’s operational managers (e.g., CRO or
trials unit) as part of study planning.
6. In recent years there has been increasing pressure on sponsors and investigators
to make de-identified individual participant data and study documents available
to others, for secondary research purposes.
7. Funders and publishers in particular have been keen to encourage secondary re-
use, as a mechanism for raising both the cost-effectiveness and the quality of
research.
8. The regulations governing secondary re-use will vary from one legal jurisdiction
to another and over time.
9. Almost all data will need to undergo de-identification before it is suitable for
secondary re-use. A variety of techniques have been published, but the stronger
the de-identification applied, the greater the risk to the scientific utility of the data.
10. Data use agreements offer an additional level of risk management around
secondary use and are an important reason why access to the data should
often be controlled rather than freely available.
11. To maximize the value of secondary use, it is important to make data as inter-
operable as possible, by the use of data standards – e.g., with common outcome
sets, consistent data definitions and categorizations, standardized data structures
and codes, and a standardized metadata scheme.
12. In the longer term, data is best transferred to a dedicated data repository.
Unfortunately, at the moment, few existing repositories are well adapted to
managing sensitive personal data available under controlled access.
13. Sponsors and study operational managers need to develop systems and pro-
cesses to support the preparation of data for re-use. This includes adequate
training of staff as well as SOPs and other quality documents.
23 Long-Term Management of Data and Secondary Use 453
14. Although much of the activity related to data re-use occurs at the end of the
study, much of the planning for it needs to take place at the beginning, as part of
the general study planning process.
Cross-References
References
Banzi R, Canham S, Kuchinke W et al (2019) Evaluation of repositories for sharing individual-
participant data from clinical studies. Trials 20:169. https://doi.org/10.1186/s13063-019-3253-3.
Available at https://trialsjournal.biomedcentral.com/articles/10.1186/s13063-019-3253-3.
Accessed 13 Aug 2020
Bierer B, Crosas M, Pierce H (2017) Data authorship as an incentive to data sharing. N Engl J Med
376:1684–1687. https://doi.org/10.1056/NEJMsb1616595. Available at https://www.nejm.org/
doi/10.1056/NEJMsb1616595. Accessed 14 June 2020
BMJ (2020) BMJ author Hub: data sharing. Available at https://authors.bmj.com/policies/data-
sharing/. Accessed 8 June 2020
CDISC (2020) CDISC standards in the clinical research process. Available at https://www.cdisc.
org/standards. Accessed 13 Aug 2020
Chan A, Song F, Vickers A et al (2014) Increasing value and reducing waste: addressing inacces-
sible research. Lancet 383:257–266. https://doi.org/10.1016/S0140-6736(13)62296-5
COMET (2020) Core outcome measures in effectiveness trials. Available at http://www.comet-
initiative.org/. Accessed 13 Aug 2020
CoreTrustSeal (2020) CoreTrustSeal certification. Available at https://www.coretrustseal.org/.
Accessed 13 Aug 2020
CSDR (2020) ClinicalStudyDataRequest.com: data sharing agreement. Available at https://
clinicalstudydatarequest.com/Help/Help-Data-Sharing-Agreement.aspx. Accessed 13 Aug
2020
Davey C, Aiken A, Hayes R, Hargreaves J (2015) Re-analysis of health and educational impacts of
a school-based deworming programme in western Kenya: a statistical replication of a cluster
quasi-randomized stepped-wedge trial. Int J Epidemiol 44:1581–1592. https://doi.org/10.1093/
ije/dyv128
Doshi P, Dickersin K, Healy D et al (2013) Restoring invisible and abandoned trials: a call for
people to publish the findings. BMJ 346. https://doi.org/10.1136/bmj.f2865. Available at https://
www.bmj.com/content/346/bmj.f2865. Accessed 5 June 2020
EDIC (2020) The epidemiology of diabetes interventions and complications. Available at https://
edic.bsc.gwu.edu/. Accessed 7 June 2020
EMA (2018) Guideline on the content, management and archiving of the clinical trial master file
(paper and/or electronic). EMA/INS/GCP/856758/2018 Good Clinical Practice Inspectors
Working Group (GCP IWG). Available at https://www.ema.europa.eu/en/documents/scien
tific-guideline/guideline-content-management-archiving-clinical-trial-master-file-paper/elec
tronic_en.pdf. Accessed 5 June 2020
454 S. Canham
Maitland K, Kiguli S, Opoka R et al (2011) Mortality after fluid bolus in African children with
severe infection. N Engl J Med 364(26):2483–2495. https://doi.org/10.1056/NEJMoa1101549.
Epub 2011 May 26. Available at https://www.nejm.org/doi/10.1056/NEJMoa1101549?url_
ver¼Z39.88-2003&rfr_id¼ori:rid:crossref.org&rfr_dat¼cr_pub%20%200www.ncbi.nlm.nih.
gov. Accessed 8 June 2020
Maitland K, Gibb D, Babiker A et al (2019) Secondary re-analysis of the FEAST trial (correspon-
dence). Lancet Respir Med 7(10):E29. https://doi.org/10.1016/S2213-2600(19)30272-3
Miguel E, Kremer M (2004) Worms: identifying impacts on education and health in the presence of
treatment externalities. Econometrica 72(1):159–217. https://doi.org/10.1111/j.1468-0262.2004.
00481.x.
MRC (2015) Good practice principles for sharing individual participant data from publicly funded
clinical trials. MRC Hub for Trials methodology research, UKCRC, Cancer Research UK,
Wellcome Trust. Available at https://www.methodologyhubs.mrc.ac.uk/files/7114/3682/3831/
Datasharingguidance2015.pdf. Accessed 12 Aug 2020
NEJM (2016) The SPRINT data analysis challenge. Available at https://challenge.nejm.org/pages/
about. Accessed 8 June 2020
Ohmann C, Banzi R, Canham S et al (2017) Sharing and reuse of individual participant data from
clinical trials: principles and recommendations. BMJ Open 7:e018647. https://doi.org/10.1136/
bmjopen-2017-018647. Available at https://bmjopen.bmj.com/content/bmjopen/7/12/e018647.
full.pdf. Accessed 12 Aug 2020
Özler B (2015) Worm wars: a review of the reanalysis of Miguel and Kremer’s deworming study.
World Bank Blogs. Available at https://blogs.worldbank.org/impactevaluations/worm-wars-
review-reanalysis-miguel-and-kremer-s-deworming-study. Accessed 8 June 2020
Peloquin D, DiMalo M, Bierer B, Barnes M (2020) Disruptive and avoidable: GDPR challenges to
secondary research uses of data. Eur J Hum Genet 28:697–705. https://doi.org/10.1038/s41431-020-
0596-x. Available at https://www.nature.com/articles/s41431-020-0596-x. Accessed 12 Aug 2020
Ramokapane K, Rashid A, Such J (2016) Assured deletion in the cloud: requirements, challenges
and future directions. Conference paper at ACM, October 2016. https://doi.org/10.1145/
2996429.2996434. Available at http://eprints.lancs.ac.uk/81611/1/Assured_deletion_Final_
version.pdf. Accessed 7 June 2020
Reichman J (2009) Rethinking the role of clinical trial data in international intellectual property law:
the case for a public goods approach. Marquette Intellect Prop Law Rev 13(1):1–68
Research Data Alliance (2020) RDA COVID-19 recommendations and guidelines (5th Release).
Available at https://www.rd-alliance.org/system/files/RDA%20COVID-19%3B%20recommen
dations%20and%20guidelines%2C%205th%20release%20%28final%20draft%29%2028%
20May%202020.pdf. Accessed 7 June 2020
RIAT Support Center (2020) Restoring invisible and abandoned trials. Available at https://
restoringtrials.org/. Accessed 5 June 2020
Riley R, Lambert P, AboZaid G (2010) Meta-analysis of individual participant data: rationale,
conduct, and reporting. 340:c221. https://doi.org/10.1136/bmj.c221. Available at https://www.
bmj.com/content/340/bmj.c221. Accessed 7 June 2020
SPRINT Research Group (2015) A randomized trial of intensive versus standard blood-pressure
control. N Engl J Med 373:2103–2116. https://doi.org/10.1056/NEJMoa1511939. Available at
https://www.nejm.org/doi/full/10.1056/NEJMoa1511939. Accessed 8 June 2020
Tangcharoensathien V, Boonperm, Jongudomsuk P (2010) Sharing health data: developing country
perspectives. Bull World Health Organ 88(6):468–469. https://doi.org/10.2471/BLT.10.079129.
Available at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2878166/. Accessed 14 June 2020
Torjesen I (2018) Pressure grows on Lancet to review “flawed” PACE trial (News article). BMJ 362:
k3621. https://doi.org/10.1136/bmj.k3621
UKRI (2020) Common principles on data policy. Available at https://www.ukri.org/funding/
information-for-award-holders/data-policy/common-principles-on-data-policy/. Accessed 7
June 2020
456 S. Canham
US Code of Federal Regulations (2020) Title 21, Chapter I, 312 D – responsibilities of sponsors and
investigators, section 312.62 investigator recordkeeping and record retention. Available at
https://www.govregs.com/regulations/expand/title21_chapterI_part312_subpartD_section312.
61#title21_chapterI_part312_subpartD_section312.62. Accessed 5 June 2020
Van den Eyndon, V, Knight G, Vlad A et al (2016) Towards open research, practices, experiences,
barriers and opportunities. Wellcome Trust. https://doi.org/10.6084/m9.figshare.4055448
Vickers A (2016) Sharing raw data from clinical trials: what progress since we first asked, “whose
data set is it anyway?”. Trials 17:227. Available at https://www.ncbi.nlm.nih.gov/pmc/articles/
PMC4855346/. Accessed 8 June 2020
Wellcome Trust (2020) Sharing research data and findings relevant to the novel coronavirus
(COVID-19) outbreak. Available at https://wellcome.ac.uk/coronavirus-covid-19/open-data.
Accessed 7 June 2020
White P, Goldsmith K et al (2011) Comparison of adaptive pacing therapy, cognitive behaviour
therapy, graded exercise therapy, and specialist medical care for chronic fatigue syndrome
(PACE): a randomised trial. Lancet 377(9768):5–11, 823–836. https://doi.org/10.1016/S0140-
6736(11)60096-2. Available at https://www.sciencedirect.com/science/article/pii/
S0140673611600962. Accessed 8 June 2020
WHO (2015) Developing global norms for sharing data and results during public health emergen-
cies. Available at https://www.who.int/medicines/ebola-treatment/data-sharing_phe/en/.
Accessed 7 June 2020
Wilkinson MD, Dumontier M et al (2016) The FAIR guiding principles for scientific data manage-
ment and stewardship. Sci Data 3:160018. https://doi.org/10.1038/sdata.2016.18
Yoda (2020) Yale open data access project: data use agreement training. Available at https://
yalesurvey.ca1.qualtrics.com/jfe/form/SV_0P7Kl30x4aAZDRX?Q_JFE¼qdg. Accessed 13
Aug 2020
Part III
Regulation and Oversight
Regulatory Requirements in Clinical Trials
24
Michelle Pernice and Alan Colley
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
Fundamentals of Global Regulatory Affairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
Regulatory Strategy: “Black and White” vs. “Gray Zone” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
The “Black and White” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
The “Gray Zone” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464
Hypothetical Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466
Regulatory Affairs Considerations for Clinical Trials in the USA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
Submitting an IND to FDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468
Maintaining the IND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
Regulatory Affairs Considerations for Clinical Trials in the European Union . . . . . . . . . . . . . . . . 470
Evolution of EU Clinical Trials Legislation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
EudraLex “The Rules Governing Medicinal Products in the European Union” . . . . . . . . . . . 471
Clinical Trials Facilitation and Coordination Group (CTFG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
EU Regulatory Agency Advice on Clinical Development and Clinical Trials . . . . . . . . . . . . . 472
Submitting a CTA in the EU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
Voluntary Harmonization Procedure (VHP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
Maintaining the Clinical Trial Authorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
Regulatory Affairs Considerations for Clinical Trials in Other Countries . . . . . . . . . . . . . . . . . . . . . 476
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
M. Pernice (*)
Dynavax Technologies Corporation, Emeryville, CA, USA
e-mail: [email protected]
A. Colley
Amgen, Ltd, Cambridge, UK
e-mail: [email protected]
Abstract
Understanding the regulatory requirements for initiating and conducting clinical
trials is a crucial starting point and success factor in any plan to advance drug
development in humans. Regulatory requirements go beyond what is considered
compliance with good clinical practice (GCP) and other standards. Regulations
provide the guardrails and opportunities for safe, efficient, and purposeful drug
development. While regulations provide the groundwork of what is to be con-
sidered “right” and “wrong” in drug development, there is a level of uncertainty
that is intentionally left for sponsor interpretation in order to provide flexibility.
Further, regulators usually represent the views of the country or region within
their specific purview, a structure which lends itself to dissonance between
different regulations/guidelines, furthering the need for sponsor interpretation.
Such interpretation is conveyed in the finished clinical trial application (CTA) or
investigational new drug (IND) application and then subject to the regulators’
review and approval. As important as it is to understand the written requirements,
it’s equally important to understand how and when to engage with the regulators
to expedite drug development. If done well, the combination of understanding the
regulations, implementing sponsor interpretation, and utilizing opportunities for
engagement with regulatory agencies can lead to ultimately deliver useful treat-
ments to patients.
In this chapter, global, regional, and national clinical trial regulatory consid-
erations will be described to enable the reader to understand the principles and
practice of conceptualizing, submitting, initiating, and completing clinical trials
in the regulated environment of drug development.
Keywords
Regulatory · FDA · EMA · Marketing authorization · BLA/NDA · MAA · IND ·
CTA · Approval · Sponsor
Introduction
When a patient goes to see a doctor and walks out with a prescription, the ability for
that treatment to be prescribed is the result of regulatory approval or “marketing
authorization.” Marketing authorization is only granted once sufficient clinical trial
data are generated to prove that the benefits of the treatment outweigh the risks. As
explained in the US Code of Federal Regulations (CFR), the purpose of conducting
clinical trials is to distinguish the effect of a drug from other influences (e.g., rule out
“placebo effect”). The Food and Drug Administration (FDA) considers adequate and
well-controlled studies to be the primary basis for determining whether there is
“substantial evidence” to support the claims of effectiveness for new drugs. Sub-
stantial evidence is defined in Section 505(d) of the Food, Drug, and Cosmetic
(FD&C) Act2 as, “evidence consisting of adequate and well controlled
24 Regulatory Requirements in Clinical Trials 461
may be conducted (e.g., a product approved for the treatment of adult patients with
melanoma may then be included in a new Phase 1 trial to assess the product’s safety
in pediatrics with a different malignancy).
Clinical trials are often conducted in more than one country. This is partly due to
the intent for the product to ultimately be approved in multiple countries, and
therefore data that is representative of each country’s population and local medical
practice is likely to be required by that country’s regulator. This is also due to the
need to expeditiously accrue patients to a trial, necessitating the ability to recruit
study volunteers from a larger population than what would be feasible in a single
country. While conducting clinical trials globally should lead to data that are more
representative of the real-world patient population, this also opens the sponsor up to
inconsistencies between requirements and advice received from different national
and regional regulators. As an example, the approval to conduct a clinical trial in the
USA is the subject of FDA review of the Investigational New Drug (IND) applica-
tion, which includes a multitude of detailed documents (e.g., information on the
manufacturing of the product, the preclinical data on the product). Thereafter, when a
subsequent trial is proposed (e.g., after the completion of the Phase 1 trial, a Phase 2
trial will be proposed), that individual trial’s clinical protocol will be submitted “to
the IND.” The clinical trial protocol is just one document usually less than 200
pages, whereas there are typically 30 or more documents amounting to thousands of
pages in the original IND. To conduct the same Phase 2 trial in the EU, a new clinical
trial application (CTA) must be submitted to the regulators of those countries within
the EU where the clinical trial will be conducted, even though a CTA was submitted
for the original Phase 1 trial. There are some documents that may be prepared to
support both an IND and CTA filing, whereas there are a number of other documents
that only serve to support one or the other. Unlike in the USA, where after the FDA
review of the original IND subsequent studies require less documentation, in the EU,
CTAs are submitted each time a new study is proposed. While certain initial CTA
documents can be referenced or resubmitted for subsequent CTAs, the submission
package still tends to be much larger than what is required for such subsequent
studies in the USA.
As exemplified above, in order to be successful in developing a new treatment for
patients, it is important for the sponsor to have capabilities to understand not only
basic global regulatory requirements but also the details of individual country
regulations.
Unlike the IND and CTA differences described, some regulations are consistent
globally, or have been harmonized between regions, and comprise the bedrock of
initial clinical development inception and planning. Developed by regulators around
the world, good manufacturing practice (GMP) and good clinical practice (GCP)
form the fundamental basis of what is required throughout global clinical develop-
ment. Adhering to the standards set forth in these practices is a requirement for
24 Regulatory Requirements in Clinical Trials 463
clinical trials to be conducted safely and ethically. Various bodies globally have
published what comprise the principles of GMP and GCP, including the World
Health Organization (WHO) and the International Council of Harmonization
(ICH). ICH was founded in 1990 with the mission to achieve greater harmonization
worldwide to ensure that safe, effective, and high-quality medicines are developed
and registered in the most resource-efficient manner. Since then, ICH has developed
guidelines across the various pillars of a drug development program, which include
the development of the experimental product’s quality attributes (referred to as the
“Q” category; e.g., stability and shelf life), the generation of preclinical toxicology
data (“S” category) and clinical efficacy and safety (“E” category) data, and multi-
disciplinary topics which apply across categories (“M” category; e.g., standardized
medical terminology). Such foundational globally relevant guidance was developed
based on core principals often reflected in individual country regulations and in turn
also influence future evolution of those individual country regulations.
Within the practice of regulatory affairs, there are clear-cut regulations that need to
be understood and adhered to, provided through the rules and regulations written in
“black and white,” meaning there is intentionally little room for interpretation
considering the criticality of the concept (e.g., regulations that govern patient safety
and adverse event reporting). As regulators and sponsors consider the adherence of
these rules as more than just “regulatory affairs,” this practice is more often termed
“compliance.”
In the USA, written regulations are laid out in the CFR. The CFR documents all
actions that are required under the applicable federal law which, in this case, is the
Federal Food, Drug, and Cosmetic Act (FD&C Act, codified into Title 21 Chap. 9
of the US Code) (The CFR is organized into a hierarchical series which will be
exemplified hereon consistent with the subject focus of this book chapter. The
hierarchy begins with Titles, and Title 21 of the CFR contains ‘Food and Drugs’
regulations. Next, are Chapters classified by regulatory entity, and Chap. 1 covers
the ‘FDA Department of Health and Human Services’. Chapters are broken down
into Subparts, where Subpart D describes ‘Drugs for Human Use’. Within Subpart
D there are a number of Parts, including Part 312 which describes ‘Investigational
New Drug Applications’. Collectively, this example selection within the CFR
would be referred to as “21 CFR Part 312,” “Chap. 1” and “Subpart D” are
assumed by “Part 312” as Chap. 1, Subpart D is the only component of 21 CFR
that contains a “Part 312.”). Among other topics, the CFR contains the central
principals of safely and ethically conducting a clinical trial. For a regulatory affairs
professional with a purview that includes the USA, the CFR is often the first pillar
464 M. Pernice and A. Colley
This flexibility is often evident in the guidance text itself being inconclusive or
situation-dependent, leaving the “door open” for the sponsor to consider what is the
most appropriate proposal for the particular investigational drug, specific patient
population, and disease landscape. Diseases that are life-threatening or otherwise
remain a high unmet medical need are of particular relevance for further consider-
ation, discussion, and even negotiation with the regulator. With the betterment of
public health as a shared common goal between all stakeholders (regulators,
24 Regulatory Requirements in Clinical Trials 465
Meetings between a sponsor and the agency [FDA] are frequently useful in resolving
questions and issues raised during the course of a clinical investigation. FDA encourages
such meetings to the extent that they aid in the evaluation of the drug and in the solution of
scientific problems concerning the drug, to the extent that FDA's resources permit. The
general principle underlying the conduct of such meetings is that there should be free, full,
and open communication about any scientific or medical question that may arise during the
clinical investigation.
Such engagement constitutes the third pillar of regulatory wherewithal to guide drug
development in an ethical, safe, and productive manner. The ability to combine
information and learning from each of the pillars outlined into a plan that suits the
aim of all stakeholders is what constitutes a regulatory strategy. Regulatory strategies
are developed by regulatory affairs professionals for each critical decision within a
drug candidate’s development plan. Such critical decisions span across the life of the
drug’s development and touch on topics both with immediate need and with long-
term impact, such as the optimal timing for an initial IND submission, whether or not
to develop in pediatric populations, and choosing a marketing authorization pathway
to aim toward throughout the drug’s development.
466 M. Pernice and A. Colley
A hypothetical case study can be found within the requirements surrounding clinical
trial endpoint selection in potentially registrational clinical trials for patients with a
disease of high unmet need, using certain cancers as an example of such disease.
When designing a clinical trial, the regulatory affairs professional is often tasked
with selecting and confirming the appropriate endpoint of a given clinical trial. In
this case study, consider that the team has requested the regulatory affairs profes-
sional to advise on whether there are any other endpoints that can be acceptable by
regulators for a marketing authorization, as a mortality endpoint may take too long to
reach in the clinical trial setting, considering the present-day unmet need of this
population.
Knowing that the clinical trial being designed is intended to study a patient
population of high unmet need, and aims to pursue a pathway that is as expedited
as possible toward marketing authorization, the regulatory affairs professional could
start with comprehending the “black and white” around the clinical trial results and
data requirements to support such an application in the USA. Accordingly, in
searching first through the CFR, the regulatory affairs professional will find 21
CFR part 314, subpart H entitled, “Accelerated Approval of New Drugs for Serious
or Life-Threatening Illnesses,” including 21 CFR part 314.510, “Approval based on
a surrogate endpoint or on an effect on a clinical endpoint other than survival or
irreversible morbidity,” which states:
FDA may grant marketing approval for a new drug product on the basis of adequate and
well-controlled clinical trials establishing that the drug product has an effect on a surrogate
endpoint that is reasonably likely, based on epidemiologic, therapeutic, pathophysiologic, or
other evidence, to predict clinical benefit or on the basis of an effect on a clinical endpoint
other than survival or irreversible morbidity. Approval under this section will be subject to
the requirement that the applicant study the drug further, to verify and describe its clinical
benefit, where there is uncertainty as to the relation of the surrogate endpoint to clinical
benefit, or of the observed clinical benefit to ultimate outcome. Postmarketing studies would
usually be studies already underway. When required to be conducted, such studies must also
be adequate and well-controlled. The applicant shall carry out any such studies with due
diligence.
Based on this, the regulatory affairs professional can advise the team that a clinical
trial should be proposed to FDA containing an endpoint that acts as a “surrogate”
that is “reasonably likely” to predict what a traditional, clear clinical benefit endpoint
would assess (e.g., mortality). Acknowledging that there are currently no approved
therapies in the malignancy being studied, and therefore nothing to design a com-
parative, head-to-head trial against, the regulatory affairs professional may consider
whether a Phase 2 study would be sufficient for initial marketing authorization.
Along with guiding the team in planning such a clinical trial, the team will also need
to be advised to plan for a post-marketing study including a certain clinical benefit
endpoint. As this is written in “black and white,” it is pertinent information for the
drug development team. Next, the regulatory affairs professional knows to seek
24 Regulatory Requirements in Clinical Trials 467
information beyond the CFR and identifies written FDA guidance titled, “Clinical
Trial Endpoints for the Approval of Cancer Drugs and Biologics” which advises,
among other content:
Surrogate endpoints for accelerated approval must be reasonably likely to predict clinical
benefit (FD&C Act § 506(c)(1)(A); 21 CFR part 314, subpart H; and 21 CFR part 601,
subpart E). While durable objective response rate (ORR) has been used as a traditional
approval endpoint in some circumstances, ORR has also been the most commonly used
surrogate endpoint in support of accelerated approval. Tumor response is widely accepted by
oncologists in guiding cancer treatments. Because ORR is directly attributable to drug effect,
single-arm trials conducted in patients with refractory tumors where no available therapy
exists provide an accurate assessment of ORR. Whether tumor measures such as ORR or
PFS are used as an accelerated approval or traditional approval endpoint will depend on the
disease context and the magnitude of the effect, among other factors.
With this, the regulatory affairs professional is empowered to advise the team of
regulation- and guidance-supported suggestions of possible endpoints for the team to
consider within the context of this particular malignancy and patient population.
Further, the regulatory affairs professional perceives an element of regulatory pre-
cedents that plays heavily into the FDA written guidance. In searching through the
USPIs and SBAs of approved treatments for other forms of cancer that previously
represented a high unmet need, a number of surrogate endpoints can be identified
and noted as successful in serving as pivotal evidence to support the marketing
authorization of that particular product.
Culminating the rules, regulations, guidance, and precedents, the team will
generate a proposed clinical trial design, including a selected surrogate endpoint,
to seek advice from major regulators. Such advice will generate a collaboration
between the regulator and sponsor to meet the common goal: the betterment of
public health.
With the refined clinical trial design, the sponsor will seek approval from the
regulators to conduct the study. This will entail multiple country- and region-specific
processes. In order to maximize the efficiency of the sponsor’s preparation, the
regulator’s review, and the applicability across countries, regulators worldwide
(FDA, EMA, and Japan’s Ministry of Health, Labor and Welfare (MHLW)) devel-
oped a set of specifications for applications to be submitted regulators, entitled the
Common Technical Document (CTD) which is broken into five parts or “modules.”
Module 1 is a region-specific part that contains documents required by the regulator
in that specific country or region. Module 2 through Module 5 are constant interna-
tionally, shown in Fig. 1.
The electronic version of the CTD (eCTD) was developed by ICH, enabling
electronic submissions in lieu of the previous paper-based submissions. This struc-
ture is employed for marketing authorization applications (MAA, BLA, NDA) in all
participating countries (referred to as “ICH countries”), including but not limited to
the USA, EU, Japan, Canada, Switzerland, and Australia. It is also employed for
IND submissions in the USA to FDA but is not uniformly employed for CTA
submissions to regulators within the EU.
468 M. Pernice and A. Colley
Not part
of the CTD
Regional
administrative
information
Module 1
Non-clinical
overview Clinical
Module 2
overview
The CTD
Quality overall Non-clinical Clinical
summary summary summary
The submission of an IND to FDA requires that the details of the investigational
product’s development be explained across many documents. These documents are
organized within the eCTD structure when an IND is being submitted to FDA.
Module 1 contains a cover letter, administrative forms, table of IND contents, the
investigators brochure, and an introductory statement including a brief summary of
what the clinical development plan is foreseen to include. Module 2 contains the
summaries of all subsequent modules (Modules 3–5).
Module 3 will describe the drug substance and the drug product in terms of
ingredients, manufacturing process, name and address of the manufacturer, limits
imposed to maintain the manufactured products’ integrity, and testing results to
show that the products remain stable over time. The drug substance is the
manufactured active ingredient before it is prepared in the useable form, the drug
product. The drug product is the finished manufactured product prepared in a form
(“dosage form”) that is able to be used for immediate administration (e.g., a finished
tablet or capsule) or for preparation of administration (e.g., a vial containing the drug
in a solution to be diluted in a bag of inactive ingredients, like “normal saline,” for
intravenous infusion).
24 Regulatory Requirements in Clinical Trials 469
Module 4 includes the reports containing the results from preclinical studies,
meaning testing done in animals and “in vitro.” Such testing provides data on safety
and toxicology as well as what to expect of the drug in humans in terms of how it will
be absorbed, distributed, metabolized, and excreted (ADME).
Module 5 includes the clinical proposals in terms of how the investigational drug
will be handled in treating the people who have volunteered to participate in the
clinical trial. The clinical trial protocol and the informed consent document are of the
most important documents in the IND submission, as it contains details of how
assessments will be made (e.g., how often the doctor will check the patients’
bloodwork, or when an x-ray will be done) and how the investigational drug should
be administered (e.g., route of administration, dose, frequency).
Once submitted to FDA, the IND will be reviewed by manufacturing, preclinical,
and clinical experts employed by FDA in the “review division.” The typical review
timeline for a new IND is 30 days, during which time the FDA may ask questions to
the sponsor, termed “requests for information.” Upon successful review of the IND,
the sponsor will receive a letter from FDA entitled “Study May Proceed” which
details the FDA acceptance of the proposal set forth by the sponsor in the IND.
Alternatively, if the FDA is not comfortable with the sponsor’s proposal in the IND,
the FDA can issue a “clinical hold” letter detailing what additional information is
needed prior to the sponsor being able to initiate the study in the USA. Such a
“clinical hold” can also be issued by the FDA during the ongoing conduct of the
study, if new information arises that leads the FDA to consider that study participants
at undue risk under the current trial protocol.
After the initial IND review has successfully completed, future studies intended
for the same or similar patient population can be introduced to the FDA under this
now-approved IND, simply with the new protocol and additional new documents
supporting the new trial (e.g., toxicology studies for a new patient population). The
FDA will review the new documents and will issue questions to the sponsor if
necessary. Unlike the original IND, there is no defined formal review timeline.
Technically, sponsors can initiate a study in less than 30 days after submitting the
new protocol to the IND. However, it is fairly common practice for sponsors to wait
an “informal” 30-day period before initiating the study. This practice can be helpful
to reduce the risk that the FDA will ask questions or place the study on clinical hold
after the study has been initiated; however, the FDA is able to issue “requests for
information” at any time, including after the initial 30-day period.
Amendments
As the IND receives initial approval and then stands as the core source of information
for an investigational product throughout the product’s development, it is common
practice for the contents of the IND to change over time. As an example, the IND
may’ve been initially submitted with the drug product in a vial formulation, and over
time, the sponsor developed a pre-filled syringe to facilitate ease of preparation and
470 M. Pernice and A. Colley
use. Once the sponsor is ready to introduce the drug product utilizing the pre-filled
syringe formulation into clinical trials, an amendment to the IND will need to be filed.
Such changes can occur in areas that impact the other modules within the IND, also (e.
g., additional animal toxicology data become available requiring an amendment to
Module 4 of the IND; the clinical trial is changed to also enroll patients with an earlier
stage of the disease, requiring an amendment to the protocol in Module 5).
Safety Reporting
One of the most critical aspects of maintaining the IND as the core source of
information for the investigational product is safety reporting. Over the course of a
clinical trial, the sponsor and the regulator will learn more about the safety of the
investigational product. Most of this learning will come from “adverse event
reporting.” “Adverse event” is defined as any untoward medical occurrence associ-
ated with the use of a drug in humans, whether or not considered drug related.
Throughout the conduct of the trial, adverse events will be reported based on the
study volunteers’ experiences while enrolled in the study. These reported events
must be promptly reviewed by the sponsor. This is one of the most important
regulations stipulated in “black and white” within the CFR and is also a unique
exception to regulations presented in the CFR which are typically restricted to
conduct in the USA. Safety reporting regulations in the CFR apply to safety reports
received from “foreign or domestic sources.” Accordingly, sponsors must review
safety reports received from every source and assess the reports for their potential
reportability to the FDA and impact on continued treatment within the ongoing
clinical trials. The CFR is also more prescriptive than usual with regard to the
required timeline of safety reporting, outlining which type of reports are mandatory
to be submitted no later than 7 calendar days and 15 calendar days.
Annual Reporting
Further to the submission types that occur if and when a qualifying event occurs, as
described above, maintaining the IND also comes with a requirement to report certain
information annually, within 60 days of the anniversary date of when the FDA review
of the original IND came to a successful close. This routine report includes a culmi-
nation of what occurred over the reporting period, spanning all topics encompassed in
an IND (e.g., manufacturing changes, the status of ongoing preclinical studies, clinical
and safety updates, status of the investigational product worldwide).
A major change to the legislation governing clinical trials in the European Union
(EU), the Clinical Trial Regulation (CTReg) EU No. 536/20143, currently awaits
implementation and could revolutionize the way clinical trials are run in the EU in
24 Regulatory Requirements in Clinical Trials 471
the next few years. The goal of the CTReg is to create a more favorable environment
for conducting clinical trials in the EU by addressing many of the criticisms leveled
at the current procedures implemented by the Clinical Trials Directive (CTDir),
Directive 2001/20/EC4, in 2004. It has been widely acknowledged that implemen-
tation of the CTDir led to a complexity and lack of harmonization that had direct
effects on the cost and feasibility of conducting clinical trials in the EU.
To understand the current procedures for clinical trial authorizations (CTAs) in
the EU, and the evolving regulation of clinical trials, it is useful to consider some
aspects of the European legislative process in general and the history of clinical trials
legislation. Prior to 2004, clinical trials were regulated by the national legislation of
each individual member state (MS) of the EU, and significant differences existed
between the requirements and procedures in each country. In an attempt to harmo-
nize clinical trial conduct, and the CTA processes, the CTDir was implemented into
national legislation of each MS from 2004 onward. The failure of the CTDir to fully
harmonize the CTA processes stems largely from the fact that EU directives are legal
acts which require each MS to achieve a result without dictating the means of
achieving that result in national legislation. This has led to each of the 28 national
regulatory agencies (national competent authorities (NCAs)) having differing sub-
mission package requirements and/or procedures. In contrast to directives, regula-
tions are legal acts that apply automatically and uniformly to all EU countries as
soon as they enter into force, without needing to be transposed into national law. EU
regulations are binding in their entirety, on all EU countries, and therefore imple-
mentation of the CTReg can be expected to overcome the current lack of harmoni-
zation in European CTA procedures.
The body of EU legislation in the pharmaceutical section is compiled into the ten
volumes of EudraLex. The basic legislation for medicinal products for human use is
contained in Volume 1 and includes the various directives and regulations pertinent
to both marketing authorization applications (MAAs) and CTAs. The basic legisla-
tion is supported by various guidelines, and Volume 10, “Guidelines for clinical
trials,” includes guidelines for CTAs approved under the current CTDir and guide-
lines intended to support the CTReg once it’s implemented. Volume 10 includes a
chapter on the CTA application, safety reporting, quality of investigational medicinal
products (IMPs), inspections, various additional guidelines, and finally links to
relevant basic legislation.
A sponsor planning to conduct a clinical trial in the EU will select one or more
European countries for participation by conducting a feasibility assessment that
evaluates a wide range of factors including identification of suitable investigators
and sites and availability of patients meeting the planned inclusion and exclusion
criteria for the trial. Since the access to medicines and standard of care can vary
between countries, this can sometimes influence the feasibility assessment.
Once the countries have been selected, a first step is to evaluate the specific
document requirements and procedures of the NCA and Ethics Committees (EC) in
each MS to plan the CTA submission. As discussed earlier, the exact requirements
will vary by MS because the CTDir has not been implemented in a harmonized
fashion. In addition to various administrative documents and country-specific
required documents, the core package submitted to all NCAs includes the protocol,
investigator’s brochure (IB), and the Investigational Medicinal Product Dossier
(IMPD). The IMPD includes information on the quality of any IMP in the trial as
well as relevant nonclinical and clinical data that is available. An overall benefit-risk
assessment for the trial should also be included unless already included in the
protocol. The possibility also exists to cross-refer to the nonclinical and clinical
data summarized in the IB. CTA submissions are not made in eCTD, but sections of
the IMPD typically follow a CTD headings with the quality information presented
like Module 3 of the CTD.
The CTDir states that assessment of a valid request for clinical trial authorization
by the NCA be carried out as rapidly as possible and may not exceed 60 calendar
days. The procedure will involve a validation phase to check all the necessary
documentation has been provided and is clear. This is followed by the review
phase, and usually the NCA will issue a list of questions (“Grounds for Non-
Acceptance”) requiring adequate responses prior to approval. The exact timelines
and procedures vary by MS and are also influenced by the potential for “clock stops”
when a sponsor is given additional time to respond to questions.
Typically, the regulatory and ethics procedures run in parallel, and a sponsor may
not start the trial in a MS until a favorable opinion is received from both the NCA
and the EC.
For a MN-CT, the sponsor has a choice of regulatory pathway, either submitting
separate CTAs in each MS via the relevant national procedures, as previously
described, or requesting assessment via a Voluntary Harmonization Procedure
(VHP). The decision to use VHP versus multiple national procedures should be
made on a case-by-case basis. The benefits of a single, harmonized procedure may
be attractive especially for trials involving many countries where there may be
significant operational benefits and the possibility of achieving a harmonized
474 M. Pernice and A. Colley
Amendments
The CTDir allows a clinical trial to be amended after it has started, and amendments
can be classified as non-substantial or substantial. An amendment to a trial is
considered substantial if the changes are likely to have a significant impact on the
safety or physical or mental integrity of trial participants or on the scientific value of
the trial. Guidance on what is typically considered substantial or not is found in the
European Commission communication 2010/C 82/01 (CT-1), and it is the sponsor’s
responsibility to assess any planned amendment on a case-by-case basis. A substan-
tial amendment must be submitted for review and can only be implemented once the
necessary NCA and/or EC approvals have been received. Non-substantial amend-
ments should be documented internally within the sponsor’s records and submitted
with the next substantial amendment.
24 Regulatory Requirements in Clinical Trials 475
In addition to the detail discussed pertaining to the USA and EU, there are many
regulatory considerations regarding the conduct of clinical trials in other countries.
Some countries (e.g., China, South Korea, India, Russia) require that, in order to
achieve marketing authorization, patients from that country must be included in
clinical trials submitted within the marketing authorization application. This require-
ment can be at least partly due to concerns about ethnic differences in how a drug
may be metabolized or a result of notable differences in overall patient care between
the studied countries compared to the country with local data requirements. While
each country has their own review process and procedures to assess the safety and
appropriateness of a proposed new clinical trial, most follow the same basic struc-
ture. This basic structure typically starts with the sponsor submitting a clinical trial
application containing multifunctional information spanning manufacturing, preclin-
ical, and clinical. Next, the local regulator reviews the submitted application and
may issue questions to be answered by the sponsor within a defined period of time.
Finally, if successful, the regulator will approve the proposed clinical trial to be
conducted in that country.
There are additions and exceptions to this basic structure, which a global
sponsor needs to develop the capabilities to understand and anticipate. An exam-
ple can be found in Japan, where the Pharmaceuticals and Medical Devices
Agency (PMDA) is the regulator with purview over clinical trial applications
and marketing authorization applications. In Japan, the sponsor typically plans
to meet with PMDA prior to submitting the clinical trial application for a consul-
tation with PMDA to advise on the overall acceptability of the basic proposal for
24 Regulatory Requirements in Clinical Trials 477
the new clinical trial (e.g., checks whether a proposed clinical trial complies with
the requirements for regulatory submission).
During the clinical trial planning, the regulations for all countries selected by the
sponsor to be included in the recruitment for study volunteers must be taken into
account to enable timely approval and initiation of the trial in the given country. In
particular, if a country with local data requirements for marketing authorization is
not included in the clinical development of the product, it is likely that additional,
dedicated studies may need to be conducted if the sponsor aims to have marketing
authorization in that country. Unfortunately, it is not uncommon that the conduct of
these additional, dedicated studies can lead to years-long delays in access to the new
treatment in that country. Therefore, up-front planning leveraging regulatory acumen
and guidance is critical to the success of enabling global approval and access to new
medicines.
Summary
Cross-References
▶ ClinicalTrials.gov
▶ Cluster Randomized Trials
▶ Consent forms and Procedures
▶ Data and Safety Monitoring and Reporting
▶ Data Capture, Data Management, and Quality Control; Single Versus Multicenter
Trials
▶ End of Trial and Close Out of Data Collection
▶ Evolution of Clinical Trials Science
▶ Good Clinical Practice
▶ Implementing the Trial Protocol
478 M. Pernice and A. Colley
▶ International Trials
▶ Investigator Responsibilities
▶ Multicenter and Network Trials
▶ Participant Recruitment, Screening, and Enrollment
▶ Post-approval Regulatory Requirements
▶ Reporting Biases
References
Clinical Trial Regulation (CTReg) EU No. 536/2014. Available via European Commission. https://
ec.europa.eu/health/sites/health/files/files/eudralex/vol-1/reg_2014_536/reg_2014_536_en.pdf.
Accessed 02 Sept 2019
Clinical Trials Directive (CTDir), Directive 2001/20/EC. Available via European Commission.
https://ec.europa.eu/health/sites/health/files/files/eudralex/vol-1/dir_2001_20/dir_2001_20_en.
pdf. Accessed 02 Sept 2019
Code of Federal Regulations. Available via Electronic Code of Federal Regulations (e-CFR).
https://www.ecfr.gov/cgi-bin/ECFR?page=browse. Accessed 02 Sept 2019
European Commission communication 2010/C 82/01 (CT-1). Available via European Commission.
https://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:C:2010:082:0001:0019:EN:PDF.
Accessed 02 Sept 2019
Sect. 505(d) of the Food, Drug and Cosmetic (FD&C) Act. Available via FDA webpage on FD&C
Act Chap. V: Drugs and Devices. https://www.fda.gov/regulatory-information/federal-food-
drug-and-cosmetic-act-fdc-act/fdc-act-chapter-v-drugs-and-devices#Part_A. Accessed 02 Sept
2019
ClinicalTrials.gov
25
Gillian Gresham
Contents
ClinicalTrials.gov: History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
ClinicalTrials.gov: Content and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
Characteristics of Trials Registered in ClinicalTrials.gov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
ClinicalTrials.gov Website Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484
ClinicalTrials.gov: Registration and Results Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
Registering a Trial in ClinicalTrials.gov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486
Reporting Results in ClinicalTrials.gov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Quality Control Review of ClinicalTrials.gov Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
Downloading and Analyzing Content from ClinicalTrials.gov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
Downloading Content for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
Limitations of Analyzing Data from ClinicalTrials.gov . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
Abstract
ClinicalTrials.gov is a federally supported, web-based clinical trials registry
maintained by the United States (US) National Library of Medicine (NLM) at
the National Institutes of Health (NIH). It is available to health care professionals,
researchers, patients, and the public. Since its launch in 2000, over 325,000 clinical
research studies have been registered in ClinicalTrials.gov. Unlike other clinical
trial registries and databases, clinical trials registration for certain types of clinical
trials is mandated by law under Section 801 of the US Food and Drug Adminis-
tration Amendments Act (FDAAA 801). There are several components that make
up the ClinicalTrials.gov registration process, including trial registration itself,
results reporting, and the download and analysis of the ClinicalTrials.gov content.
G. Gresham (*)
Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center, Los Angeles, CA,
USA
e-mail: [email protected]
While the previous chapter focuses on clinical trials registration in general, this
chapter pertains to clinical trials registered in ClinicalTrials.gov. This chapter
provides an overview of the history of ClinicalTrials.gov, a description of the trials
currently registered in ClinicalTrials.gov, and a review of the Federal Requirements
for Registration in the United States. A summary of the registration process, trial
reporting, and data analysis procedures follows. The chapter concludes with an
overview of the limitations associated with the analysis and reporting of
ClinicalTrials.gov registration data.
Keywords
Clinical trials registration · ClinicalTrials.gov · Clinical trial · Interventional
study · Clinical trials database · Results reporting
ClinicalTrials.gov: History
Trial registration and its regulation in the United States, as we know it today, has
evolved and expanded over the last 30 years. Key events and policies related to the
ClinicalTrials.gov are illustrated in the historical timeline (Fig. 1). ClinicalTrials.gov
definitions are consistent with those provided by the NIH and are listed in an online
glossary as part of the ClinicalTrials.gov website: https://clinicaltrials.gov/ct2/about-
studies/glossary. Some key definitions from the glossary are transcribed in Table 1.
International calls for trial registration first emerged in the late 1980s in response
to increasing awareness of publication and reporting biases (Dickersin 1990; Simes
1986). In 1986, Simes demonstrated the value of an international registry for clinical
trials using two case examples in ovarian cancer and multiple myeloma (Simes
1986). Simultaneous calls for registration were published at the turn of the twenty-
first century, providing additional examples of reporting biases and arguments for the
need for a comprehensive, prospective trial registry (Dickersin 1990; Dickersin and
Rennie 2003; Piantadosi 2017).
In 1997, the first Federal law to require trial registration was passed under Section
113 of the Food and Drugs Administration Modernization Act (FDAMA) related to
data bank containing information on privately or federally funded trials being
conducted under investigational new drug applications for serious or life-threating
disease and conditions:
1997: First US law 2005: ICMJE3 2008: Declaration of 2016: Final rule for
passes requiring trial requires trial Helsinki Revision FDAAA 801 &
registration: FDAMA1 registration promotes trial registration NIH Policy Issued
2000: NIH NLM2 releases 2007: Congress passes 2014: Notice of 2017: Revised
online clinical trials PDAAA4 to expand Proposed Rulemaking Common Rule
registry: ClinicalTrials.gov ClinicalTrials.gov for FDAAA 801 released (45 CFR 46) issued
submission requirements for public comment
1
Food and Drug Administration Modernization Act
2
National Institutes of Health National Library of Medicine
3
International Committee of Journal Editors
4
Food and Drug Administration Amendments Act
Table 1 (continued)
Term ClinicalTrials.gov definition
Collaborator An organization other than the sponsor that provides support
for a clinical study. This support may include activities related
to funding, design, implementation, data analysis, or reporting
Phase The stage of a clinical trial studying a drug or biological
product, based on definitions developed by the US Food and
Drug Administration (FDA). The phase is based on the study’s
objective, the number of participants, and other characteristics.
There are five phases: early phase 1 (formerly listed as phase 0),
phase 1, phase 2, phase 3, and phase 4. Not applicable is used
to describe trials without FDA-defined phases, including trials
of devices or behavioral interventions
Phase 1 A phase of research to describe clinical trials that focus on the
safety of a drug. They are usually conducted with healthy
volunteers, and the goal is to determine the drug’s most frequent
and serious adverse events and, often, how the drug is broken
down and excreted by the body. These trials usually involve a
small number of participants
Phase 2 A phase of research to describe clinical trials that gather
preliminary data on whether a drug works in people who have a
certain condition/disease (i.e., the drug’s effectiveness). For
example, participants receiving the drug may be compared to
similar participants receiving a different treatment, usually an
inactive substance (called a placebo) or a different drug. Safety
continues to be evaluated, and short-term adverse events are
studied
Phase 3 A phase of research to describe clinical trials that gather more
information about a drug’s safety and effectiveness by studying
different populations and different dosages and by using the
drug in combination with other drugs. These studies typically
involve more participants
Phase 4 A phase of research to describe clinical trials occurring after
FDA has approved a drug for marketing. They include
postmarket requirement and commitment studies that are
required of or agreed to by the study sponsor. These trials
gather additional information about a drug’s safety, efficacy, or
optimal use
Phase not applicable Describes trials without FDA-defined phases, including trials of
devices or behavioral interventions
All definitions transcribed from the ClinicalTrials.gov glossary available at: https://clinicaltrials.
gov/ct2/about-studies/glossary
gov was followed by FDA guidance for Industry, issued in 2002 and withdrawn by
the FDA in September 2017 (US guidance for industry 2002).
In 2004, the International Committee of Medical Journal Editors (ICMJE)
implemented a policy that required registration of all clinical trials as a condition
of consideration for publication (De Angelis et al. 2004). The policy applies to any
trial that started enrollment after July 1, 2005, where registration must occur before
patient enrollment and requires registration of trials by September 13, 2005, for those
that began enrollment before July 1, 2005. The ICMJE registration policy represents
an important landmark for trial registration, where a significant increase in trial
registration occurred after its implementation (Zarin et al. 2017a).
The World Health Organization (WHO) established a trial registration policy
shortly after, in 2006, releasing a minimum trial registration dataset of 20 items
(Appendix 10.12.) Additional information regarding the history of the development
of the WHO International Clinical Trials Registry Platform (ICTRP) is described in
the previous Chapter on “Trial Registration” Chap. 3.2. Additional international
efforts by the World Medical Association (WMA) to encourage trial registration
were made in 2008 at the 59th WMA General Assembly in Seoul, Republic of
Korea. At this time, the Declaration of Helsinki was amended to include trial
registration requirements initially outlined in Sections 19 and 30 (World Medical
Association 2013). These principles were again modified and re-ordered in 2013 at
the 64th WMA General Assembly, now corresponding to Sections 35 and 36 (World
Medical Association 2013). Section 35 indicates that every research study involving
human subjects should be registered in a public database, while Section 36 raises the
ethical obligation to publish and disseminate the results of research regardless of
whether the findings are statistically significant or “negative or inconclusive”
(Appendix 10.3). While not legally binding, the Declaration of Helsinki has
increased recognition and awareness of the importance and ethical obligations of
trial registration, especially among physicians conducting research in human
subjects.
The Food and Drug Administration Amendments Act (FDAAA) of 2007 became
one of the most important and influential policies for trial registration in the United
States. The FDAAA Public Law 110-85 was passed by Congress on September 27,
2007, and expanded registration and reporting requirements for ClinicalTrials.gov.
Such requirements, as detailed in Section 801 of FDAA, included expanding the
clinical trial registration information for applicable clinical trials and adding a results
database (FDAAA 801). The law also mandated submission of results for applicable
clinical trials of drug, biologics, and devices that were approved, cleared, or licensed
by the FDA. The law includes the requirement that the responsible party of an
applicable clinical trial must submit results within 1 year of data collection including
summary results of the demographic and baseline characteristics, primary and
secondary outcomes, points of contact, and agreements. Submission of adverse
events, including frequent and serious adverse events, was not required by law
until 2009 (FDA 801). Finally, the FDAAA law of 2007 introduced civil penalties
of “not more than $10,000 for each day of the violation after such period until the
violation is corrected” (FDAAA 801).
484 G. Gresham
The FDAAA Section 801 was modified and released in September 2016,
expanding on definition of the clinical trial and providing additional requirements
regarding trial registration and reporting (Zarin et al. 2016). The NIH simultaneously
issued a policy, requiring that all NIH-funded trials should be registered regardless of
whether they are covered under FDAAA 801 requirements.
A summary of the registration process itself, trial reporting, and data analysis will
then be provided.
ClinicalTrials.gov includes clinical trials being conducted in 207 countries, with
over a third being conducted in the United States only, half outside of the United
States, and the rest in both the United States and non-US countries. Study locations
were not specified in 12% of the registered trials. ClinicalTrials.gov is a living
database that is constantly being updated with new studies as well as undergoing
modifications and revisions to the study records as well as to the site itself. There-
fore, counts will vary with time, and the following summary of characteristics of
trials registered reflects trial counts completed as of December 31, 2019. Overall,
there were 325,860 studies registered in ClinicalTrials.gov of which 256,924 (79%)
were interventional, 67,486 (19%) were observational, and 601 were expanded
access. Types of interventions include drugs or biologics (59%), behavioral inter-
ventions (31%), surgical procedures (10.5%), and devices (12.5%). Among the
registered trials, 175,691 were completed to date (December 31, 2019). Trials can
also be characterized by lead sponsor, where industry was lead sponsor for 106,775
trials as of December 31, 2019, the US Federal Government including NIH was lead
sponsor for 37,706 trials, and all other funding sources were lead sponsor for
184,040 trials. While industry tends to fund larger, randomized drug intervention
trials, the NIH focuses on smaller, early development studies (Gresham et al. 2018;
Ehrhardt et al. 2015). An increasing number of behavioral trials funded by NIH have
also been observed in the last 10 years, which may include exercise and nutritional
studies.
location, phase, funder type, and recruitment status. The menu at the top of the
page includes five tabs: “Find Studies,” “About Studies,” “Submit Studies,”
“Resources,” and “About Site.” Within the “Find studies” menu, users can access
a map of the studies and information on how to search for studies as well as how to
use, find, and read a study record. The “About Studies” tab provides information
about the studies, a list of additional websites about studies, and the glossary of
common terms. Resources for administrators including registration guidelines
help with registering the studies; support and training materials as well as FAQs
can be found under the “Submit Studies” tab. A “Resources” tab includes a list of
selected publications, clinical trials alerts, RSS feeds, the metadata for
ClinicalTrials.gov, and information on downloading ClinicalTrials.gov content
for analysis. Finally, the top menu includes additional information about the site
where readers can learn more about the history of ClinicalTrials.gov; the history,
policy, and laws surrounding trials registration; and the terms and conditions of the
site. An additional link to the Protocol Registration and Results System (PRS) site
is available for administrators and study sponsors/investigators to access and
register their trials. To access the PRS site, users must have a PRS account linked
to their organization name, username, and password. More information on this will
be provided in a later section.
Clinical trials are organized by the ClinicalTrials.gov study identification number
(NCT number) which is unique to each trial registered. Every study record includes
the NCT number, which is listed at the top of the record along with the study title,
key dates (e.g., First posted, Results first posted, Last update), and names of the
sponsors, collaborators, and responsible party. Every record also includes a dis-
claimer that states the following:
The safety and scientific validity of this study is the responsibility of the study sponsor and
investigators. Listing a study does not mean it has been evaluated by the U.S. Federal
Government. Read our disclaimer for details.
Each trial record includes the ICMJE/WHO minimum 20-item Trial Data Set
(Appendix 10.12). Trial information is organized by tabs including “Study Details,”
“Tabular View,” and “Study Results.” Additional links to the disclaimer and
resources for patients on how to read and interpret a study record are also available.
The “Study Details” divides trial information by section: study description, study
design, arms and interventions, outcome measures, eligibility criteria, contacts and
locations, and more information (e.g., publications). Related citations are automat-
ically identified from the NLM and added to the publication tab using the study
identification number (NCT number) and updated directly to the publications field.
The tabular view provides the same information as listed in the “Study Details”
page with some additional features and links. Links to study documents (e.g.,
protocol, consent forms) can be accessed and downloaded, if available. A “Change
History” link also exists, where a complete list of historical versions for the specific
study is available and posted to the ClinicalTrials.gov archive site. When applicable,
study results are posted under the results tab and organized by baseline table,
486 G. Gresham
Table 2 Data entry tips for common data elements entered in ClinicalTrials.gov
Data element Definition Data entry tips
Study status Overall recruitment status: the Study status can alternate between the
recruitment status for the clinical following:
study as a whole, based upon the Not yet recruiting: participants are
status of the individual sites not yet being recruited
Study start date: the estimated date Recruiting: participants are
on which the clinical study will be currently being recruited, whether or
open for recruitment of participants, not any participants have yet been
or the actual date on which the first enrolled
participant was enrolled Enrolling by invitation:
Primary completion date: the date participants are being (or will be)
that the final participant was selected from a predetermined
examined or received an intervention population
for the purposes of final collection of Active, not recruiting: study is
data for the primary outcome, continuing, meaning participants are
whether the clinical study concluded receiving an intervention or being
according to the pre-specified examined, but new participants are
protocol or was terminated not currently being recruited or
Study completion date: the date the enrolled
final participant was examined or Completed: the study has
received an intervention for purposes concluded normally; participants are
of final collection of data for the no longer receiving an intervention or
primary and secondary outcome being examined
measures and adverse events (e.g., Suspended: study halted
last participant’s last visit), whether prematurely but potentially will
the clinical study concluded resume
according to the pre-specified Terminated: study halted
protocol or was terminated prematurely and will not resume;
participants are no longer being
examined or receiving intervention
Withdrawn: study halted
prematurely, prior to enrollment of
first participant
If the trial registered is multisite and
one of the individual sites is
recruiting, then the overall
recruitment status for the study must
also be “recruiting”
Once the first patient is enrolled, the
study start date should be updated to
include the actual date
Once the study has reached the study
completion date, the study completion
date should be updated to reflect the
actual study completion date
Study Brief summary: a short description of The brief summary should be brief
description the clinical study, including a brief and written for a lay audience (limit
statement of the clinical study’s 5000 characters)
hypothesis, written in language The detailed description can include
intended for the lay public. more technical information
Detailed description: extended The detailed description should not
(continued)
488 G. Gresham
Table 2 (continued)
Data element Definition Data entry tips
description of the protocol, including include the entire protocol nor should
more technical information compared it duplicate information that is already
to the brief description recorded in other data elements (limit
32,000 characters)
Study design Study design: a description of the Primary purpose can be selected from
manner in which the clinical trial will drop-down menu and includes
be conducted, including the following treatment, prevention, diagnostic,
information: primary purpose supportive care, screening, health
services research, basic science, and
device feasibility
The study phase should be selected
based on the NIH definitions (Table 1)
The interventional study model may
include single group, parallel,
crossover, factorial, or sequential
All the roles that are masked should
be indicated including participant,
care provider, investigator, outcomes
assessor, or open-label (no masking)
Study allocation can be randomized or
non-randomized. Note that quasi-
randomized is not a true form of
randomization
Anticipated enrollment should be
specified based on the primary
outcome power calculation. Once the
study is complete, the actual
enrollment should be updated
Arms and Arm: a pre-specified group or The arm title should be concise, but
interventions subgroup of participants in a clinical allow for easy distinction from one
trial assigned to receive specific arm to another
interventions (or no intervention) The arm definition is selected from a
Intervention: a process or action that drop-down menu and includes
is the focus of a clinical study experimental; active comparator;
placebo comparator; sham
comparator; no intervention; other
If the intervention is a drug, the
generic name should be used as well
as the dosage form, dose, frequency,
and duration
Intervention type is selected from a
drop-down menu and can include
drug, device, biological/vaccine,
procedure/surgery, radiation,
behavioral, genetic, dietary
supplement, combination product,
diagnostic test, or other
If conducting an observational study,
intervention name can be used to
identify the intervention or exposure
of interest
(continued)
25 ClinicalTrials.gov 489
Table 2 (continued)
Data element Definition Data entry tips
Eligibility The eligibility module specifies the Enter age limits, if applicable.
criteria criteria for determining which people Otherwise, enter “N/A (no limit)”
are (or are not) eligible to participate from a drop-down menu
in the study Sex refers to the classification of male
or female based on biological
distinctions, with drop-down of “all,”
“male only,” and “female only”
Gender refers to the person’s self-
representation of gender identity. If
applicable, a user can indicate that
eligibility is based on gender in
addition to descriptive information
about gender criteria
When entering eligibility criteria,
include headings for the inclusion and
exclusion criteria followed by a
bulleted list under each heading
Outcome Primary outcome: the outcome When specifying an outcome, include
measures measure(s) of greatest importance the specific domain, method of
specified in the protocol, usually the aggregation, specific metric, and
one(s) used in the power calculation. timepoint
Most clinical studies have one Do not use acronyms
primary outcome measure, but a Each outcome measure should be
clinical study may have more than one presented separately, regardless of
whether they share the same metric
If using a scale or questionnaire,
specify the number of items, how they
are scored, the minimum and
maximum ranges, and how the scores
are interpreted
Definitions and information obtained from: https://register.clinicaltrials.gov/prs/html/definitions.
html
It is the responsibility of the record owner to maintain and update the clinical trial
information within the required timepoints in accordance with Section 801 of
FDAAA and 42 CFR 11.64. Records for active studies are required to be updated
at least once a year with some data elements requiring more frequent updates. Once
the record has been reviewed for accuracy and modified as necessary, the verification
date will be updated, and the responsible party/PRS administrator can approve and
release the record.
defined, are required by the FDAAA to be submitted within 1 year after the trial’s
primary completion date (date that the final subject was examined or received the
intervention for the purposes of final data collection for the primary outcome). Trials
that are not considered “applicable clinical trials” (Non-ACT) are not required to
submit results, such as Phase 1 trials, feasibility, or observational studies. However,
under the NIH Policy, any trial that meets the NIH definition for clinical trial and is
funded in whole or in part by the NIH must provide summary results.
As of December 31, 2019, a search of the ClinicalTrials.gov registry identified
41,074 studies (interventional or observational) with posted results. This has
increased dramatically from the 2,178 records with results identified in September
2010 and 23,000 in 2016, probably as a result of the expanded FDAAA reporting
requirements (Zarin et al. 2011, 2016). It is anticipated that this number will continue
to grow as more applicable trials are completed after the January 18, 2017 imple-
mentation date. Understanding and training in the results submission process will
become more important to ensure timely and accurate data entry.
All submitted trial records undergo a quality review prior to being released to the
public. Quality control review of ClinicalTrials.gov records includes both automated
validation rules incorporated within each item for entry and review by PRS staff. Once
the responsible party/PRS administrator has released and submitted the initial record,
PRS staff reviews the record for completeness and any additional errors, deficiencies,
or inconsistencies (ClinicalTrials.gov 2019). Implementation of standard quality con-
trol review criteria, standardized review comments, and similar trainer programs across
PRS staff ensures consistency of the reviews. Reviews are also audited by other review
staff members to ensure proper review of the records. Specific quality control review
criteria and accompanying documents are publicly available and can be found under
“Support Materials” at the PRS User’s Guide and Review material link: https://
clinicaltrials.gov/ct2/manage-recs/resources#ReviewCriteria. Review criteria are orga-
nized by data entry element, which are categorized within 13 different modules for
describing the study protocol. PRS review of the trial registration record is estimated to
take between 3 and 5 days for registration and within 30 days for results. Reviewers
provide comments throughout the record that address general issues, formatting, and
specific notes on the completeness and appropriateness of each data element or result.
Comments may be identified as “major” which are required to be corrected or
addressed within 15 calendar days or “advisory” which are meant to improve the
clarity of the record and can be addressed within 25 days from when the PRS Staff sent
notification (ClinicalTrials.gov 2019). While they are able to identify errors in the entry
of information, the reviewers are not responsible for ensuring the scientific validity and
merit of the trial and cannot confirm that the information is compliant with policy or
legal requirements (Tse et al. 2018).
The most common problem identified upon quality review of registration infor-
mation is incomplete or insufficient information for the primary and secondary
outcomes. Common issues encountered when reviewing results include invalid or
inconsistent units of measure, insufficient information about scales, internal incon-
sistencies between different sections in the record, the inclusion of written results or
conclusions, and unclear baseline or outcome measure (Tse et al. 2018).
Once PRS comments have been received and addressed, the record owner will
resubmit the information for further review and comment. At this point, reviewers
may respond with additional comments and suggestions or release the record to the
public along with the assigned NCT identification number.
There are two primary methods for downloading clinical trials content from
ClinicalTrials.gov. The first is directly from the ClinicalTrials.gov database,
where some search results are available for download. For instance, a search of
25 ClinicalTrials.gov 493
studies within a particular disease site may be conducted, and the total records or a
selection of records from the search can be exported and downloaded to different
formats (.csv, XML, plain text, tab-separated values, and PDF). The record for an
individual trial can also be downloaded directly from the study record page. The
downloaded content includes 20 fields in long format with trials listed by study ID.
Exported data types include NCT ID, title, study status, whether results are available,
conditions, interventions, outcome measures, phases, sponsors, gender/age, enroll-
ment (sample size), funders, study type (interventional or observational), design, and
key dates (start date, completion date, date last updated, etc.). While downloading
content directly from ClinicalTrials.gov can be a simple and efficient method to
access up to date study information, it is limited to 10,000 records at a time and does
not include all registration fields and study results.
The second method for downloading ClinicalTrials.gov content for analysis is
through the Clinical Trials Transformation Initiative (CTTI) Aggregate Analysis of
ClinicalTrials.Gov (AACT): https://www.ctti-clinicaltrials.org/aact-database. The
AACT database contains restructured and aggregated information on Clinical Trials
registered in ClinicalTrials.gov that is refreshed daily and available in different
formats (e.g., Oracle dmp, Pipe delimited text output, and SAS CPORT transport).
It also includes static versions of the databases that are updated monthly and
available for download. Access to the cloud-based platform can be accessed upon
free registration and download of the required programs: https://aact.ctti-
clinicaltrials.org/download. The AACT database is a relational database linked by
NCT ID and organized by trial registration fields and categories. A comprehensive
data dictionary and schema are available on the website at the following link: https://
aact.ctti-clinicaltrials.org/schema. Data in the AACT database have been cleaned,
sorted, and created with additional calculated fields generated to facilitate and
improve the analysis of trials. The AACT CTTI database has also integrated the
MeSH thesaurus, thus improving search and indexing capabilities. Regardless of the
method used to obtain and analyze clinical trials data, it is important to take the
limitations of the registries into consideration when interpreting and reporting the
results.
There are several limitations associated with the use and analysis of data from
ClinicalTrials.gov. First of all, the analysis is based on the assumption that all trials
are registered (Zarin et al. 2017b; Gresham et al. 2018). Although registration has
significantly improved over time, especially during the last decade, one cannot
assume that the studies registered in ClinicalTrials.gov are unbiased representation
of the clinical research enterprise (Tse et al. 2018). A recent paper published by Tse
et al. (2018) identifies and describes ten common problems encountered when using
ClinicalTrials.gov for research (Tse et al. 2018). Some of the key issues raised
include the fact that ClinicalTrials.gov includes more than just interventional studies,
with approximately 20% of the registered studies being observational and 450 with
494 G. Gresham
expanded access records (Tse et al. 2018). Thus an understanding of the definitions
and specific registration elements and requirements for each study type is essential.
Trial records may also be incomplete or incorrect, thus leading to potentially
inaccurate reports and interpretations of the trial data. For example, missing
registration fields, especially for optional data elements, or misclassification of
data elements can occur, make it difficult to estimate and compare trends in clinical
trials. Incomplete records may also be a result of the changing database elements
over time, where the ClinicalTrials.gov structure has evolved since its establish-
ment in 2000 (Zarin et al. 2017b). Mandatory data elements have also been added
and modified over time, such as the primary outcome measure data elements and
sub-elements (Tse et al. 2018). Data entered in the trial record can also be modified
at any time, making it difficult to determine which information is most appropriate
for analysis. While the change history of modifications can be accessed, it is
difficult to obtain and download previous versions of the trial record for analysis.
Furthermore, while review of the quality review of the trial record and results is
performed, it does not include verification of the scientific merit and validity of the
information (Zarin et al. 2007).
Finally, an increasing problem includes duplicate registrations of clinical trials,
which can occur within ClinicalTrials.gov or across different trial registries (e.g.,
ICTRP). Duplicates within the ClinicalTrials.gov database are often a result of
follow-on or expansion studies being registered as separate records (Tse et al.
2018). There is currently no automated way to identify duplicates, although searches
of the trial titles and acronyms, summaries, and eligibility can be used to identify
similar records. As a result of a growing number of international trial registries, there
are also duplicate registrations across multiple registries, where almost 45% of
duplicates go unobserved or undetected (van Valkenhoef et al. 2016). There are
currently no methods for identifying identical trials across registries, thus distorting
and overestimating the number of trials registered. Prevention of duplicates across
registries would require coordination and potential linkage using one universal
registration number within a single platform such as the World Health Organization
(Zarin et al. 2007).
Conclusion
using automatic extraction of the quantitative data (Pradhan et al. 2019). While the
use and analysis of ClinicalTrials.gov registration data can provide valuable infor-
mation about a particular intervention, it is complex and requires an in-depth
understanding and knowledge of the registration and reporting requirements. Thus,
it is the responsibility of the lead sponsors as well as study investigators, staff, and
responsible parties to provide complete and accurate registration information in
order to contribute to scientific advancement and improve the clinical trials research
enterprise.
References
De Angelis C, Drazen JM, Frizelle FA, Haug C, Hoey J, Horton R, Kotzin S, Laine C, Marusic A,
Overbeke AJPM, Schroeder TV, Sox HC, Van Der Weyden MB, E. International Committee of
Medical Journal (2004) Clinical trial registration: a statement from the International Committee
of Medical Journal Editors. CMAJ: Can Med Assoc J 171(6):606–607
Dickersin K (1990) The existence of publication bias and risk factors for its occurrence. JAMA
263(10):1385–9. PubMed PMID: 2406472
Dickersin K, Rennie D (2003) Registering clinical trials. JAMA 290(4):516–23. PubMed PMID:
12876095
Ehrhardt S, Appel LJ, Meinert CL (2015) Trends in National Institutes of Health funding for clinical
trials registered in ClinicalTrials.gov. JAMA 314(23):2566–2567
Gresham GK, Ehrhardt S, Meinert JL, Appel LJ, Meinert CL (2018) Characteristics and trends of
clinical trials funded by the National Institutes of Health between 2005 and 2015. Clin Trials
15(1):65–74
Piantadosi S (2017) Clinical trials: a methodologic perspective. John Wiley & Sons
Pradhan R, Hoaglin DC, Cornell M, Liu W, Wang V, Yu H (2019) Automatic extraction of
quantitative data from ClinicalTrials.gov to conduct meta-analyses. J Clin Epidemiol
105:92–100
Simes RJ (1986) Publication bias: the case for an international registry of clinical trials. J Clin Oncol
4(10):1529–41. PubMed PMID: 3760920
Tse T, Fain KM, Zarin DA (2018) How to avoid common problems when using ClinicalTrials.gov
in research: 10 issues to consider. BMJ (Clinical Research Ed) 361:k1452
van Valkenhoef G, Loane RF, Zarin DA (2016) Previously unidentified duplicate registrations of
clinical trials: an exploratory analysis of registry data worldwide. Syst Rev 5(1):116
World Medical Association (2013) World medical association declaration of Helsinki: ethical
principles for medical research involving human SubjectsWorld medical association declaration
of HelsinkiSpecial communication. JAMA 310(20):2191–2194
Zarin DA, Ide NC, Tse T, Harlan WR, West JC, Lindberg DA (2007) Issues in the registration of
clinical trials. JAMA 297(19):2112–2120
Zarin DA, Tse T, Williams RJ, Califf RM, Ide NC (2011) The ClinicalTrials.gov results
database–update and key issues. N Engl J Med 364(9):852–860
Zarin DA, Tse T, Williams RJ, Carr S (2016) Trial reporting in ClinicalTrials.gov—the final rule. N
Engl J Med 375(20):1998–2004
Zarin DA, Williams RJ, Tse T, Ide NC (2017a) The role and importance of clinical trial registries
and results databases. In: Gallin JI OF, Johnson LL (eds) Principles and practice of clinical
research. Academic, London, pp 111–125
Zarin DA, Tse T, Williams RJ, Rajakannan T (2017b) Update on trial registration 11 years after the
ICMJE policy was established. N Engl J Med 376(4):383–391
Funding Models and Proposals
26
Matthew Westmore and Katie Meadmore
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
Types of Research Funding Agencies and Their Societal, Political, and Organizational
Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
Political Context for Public Research Funding Agencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499
Sources of Funding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500
Philosophies and Theories of Change of Funding Agencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
Whose Priority Is It Anyway? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504
Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
Funder Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
The Importance of Remit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
The Impact of Funder Policies on Research Culture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506
Funding Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
Open Versus Commissioned Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
Common Funding Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
Proposal Assessment Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508
Typical Application Route and Decision-Making Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
Who Reviews the Applications? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Assessment Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515
Success Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
Tips for Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518
Abstract
Clinical trials require funding – often a lot. Funders of clinical trials are not just
sources of funding however. They are actors in their wider research systems, have
their own philosophies, values, and objectives, and operate within different
M. Westmore (*) · K. Meadmore
University of Southampton, Southampton, UK
e-mail: [email protected]; [email protected]
Keywords
Research funding agency · Funders · Funding model · Decision-making ·
Funding decision · Sources of funding · Proposals and applications
Introduction
This chapter summarizes the similarities and differences of clinical trial funding
agencies around the world and the implications for funding models and proposals. It
is primarily aimed at trialists seeking to understand and ultimately succeed in
applying and funding; it will also be of interest to research funding agencies
(RFAs) and regulators.
Funders of clinical trials are not just sources of funding. They are actors in their
wider research systems, have their own philosophies, values, and objectives, and
operate within different political, social, and economic environments. While there
are commonalities, differences in their context and culture shape their approaches to
funding decisions, what they are looking for from the research community, and
therefore how to successfully engage with them.
Figure 1 shows a hierarchy of factors from the wider environment, through to the
internal organizational setting that ultimately affects how funding schemes are
designed and what is expected of applicants.
Clinical trial funding agencies are not all alike. Broadly, they share the ultimate aim
of improving human health through research but the way they operate and the way
they measure success differ. These differences depend on the societal, political,
economic, research system and organizational context in which they operate. With
26 Funding Models and Proposals 499
Fig. 1 Hierarchy of factors that influence research funding agency policies and procedures and
ultimately what applicants have to address
many commonalities, these differences lead to different aims and theories of change
of how to achieve those aims. This section outlines some of those differences.
The information presented in this chapter is solely meant to provide an overview
of the different types of contexts in which funding organizations operate and are
molded by. To do this, we have necessarily caricatured different types of organiza-
tions and generalized their objectives, values, and approaches. We have done this in
best faith to inform readers in a simple way rather than suggest any actual funding
agency neatly fits the character.
The allocation of funding for research is not just a technical process but is a
political one as well; this is especially true for publicly funded research. At what
level and how influential politicians should be is a controversial topic and beyond
the scope of this work. What is important to understand is how the political context
flows through the hierarchy of factors, Fig. 1, through to the expectations placed
on researchers.
500 M. Westmore and K. Meadmore
Politics (and therefore policy makers) can influence the research that is funded
across a spectrum of ways. From direct involvement into individual decision-mak-
ing, for example, in the prioritization of individual calls for research as the primary
customer of the eventual evidence, to setting the wider policy context in which
research funding agencies interpret their role; for example, in how some countries
national policy for economic growth has been internalized in funding agencies into a
desire to demonstrate potential impact at the application stage.
When done well, this connects researchers with policy makers and ensures
research reflects the needs and desires of those that fund it through taxation; when
done badly, this represents an unacceptable imposition on academic freedom and
allows political bias to cast a shadow across research.
Politicians who take little interest in research is perhaps no less worrying than
those that take too much; a survey of Canadian members of parliament and/or senior
aids found that 32% knew nothing about the role of the Canadian Institutes of Health
Research (CIHR) despite it being the primary federal funder of research (Clark et al.
2007).
US President Barack Obama summed up the tension:
Obama pledged in a speech to protect “our rigorous peer-review system” to ensure that
research “does not fall victim to political manoeuvres or agendas” that could damage “the
integrity of the scientific process.” However, he added that it was important that “we only
fund proposals that promise the biggest bang for taxpayer dollars”. (Obama 2013)
Sources of Funding
What funders are trying to achieve and how they believe they will achieve them also
significantly influences their policies and procedures. These could loosely be called
philosophies and theories of change. Some funders will be quite explicit about this
26 Funding Models and Proposals 503
and others will be influenced by a wider set of norms, culture, and tacit knowledge.
There are philosophies and theories that apply very generally to research and those
that are very focused on clinical trials.
Starting generally, the oldest and most influential concept is the Haldane princi-
ple. This is the idea that politicians may set the overarching strategic allocation of
funding (e.g., what to spend on research into the liberal arts, what to spend on
engineering, what to spend on health-related research); these are political questions.
Beyond that, decisions about what to spend research funds on should be made by
researchers rather than politicians; these are technical questions. This principle has
underpinned the entire peer review process since the Haldane report was published
in 1918 (HMSO 1918) and has been influential in subsequent funding policy not just
in the UK but around the world. Haldane in its purest sense gives primacy to
academic freedom in deciding on the direction of research. This has been highly
successful and has led to many modern economies being based on the advances in
knowledge, culture, and technology that has resulted from it. Haldane has its
limitations however.
The research community may well be the best experts to decide on highly
scientific technical questions but which research to support and how it should be
delivered are often subjective, value-laden issues. This is particularly the case when
the intended purpose of research is more utilitarian than enlightenment or non-
specific advances in knowledge; when the intended user of the research is not
another researcher (but, for example, a policy maker, clinician, or patient). This
requires a wider range of opinions, experiences, and expertise.
The first major challenge to the Haldane principle also originated in the UK. In
1971, the Rothschild Report (HMSO 1971) raised the issue, and proposed solutions
to it, that the research community and commercial funders while undoubtedly
successful in some fields were failing to deliver in others. Most notably in applied
research areas where parts of society needed more immediate answers to more
specific questions; in the context of this work, questions like is treatment A better
than treatment B? Rothschild developed the concept that applied R&D must have a
customer and that customer should be influential in deciding which research should
be carried out. Rothschild was and in some ways remains highly controversial. It has
nonetheless changed the nature of research and research funding.
Rothschild also led to the concept of market failure research funding. Whereby
public funders should not be supporting research that would happen anyway –
research that would be funded by commercial funders or philanthropic funders.
Doing so is not only unnecessary, and therefore a poor use of limited public funding
that could be spent on other areas but can also lead to crowding out. Those that
would have funded in that area now do not either because they do not need to or
because it now does not make commercial sense; for example, public funding results
in public knowledge that cannot be protected for commercial gain. This has the result
that the additional public funding actually reduces the over-investment in an area
rather than increases it.
Influenced by the work of Mariana Mazzucato in The Entrepreneurial State
(Mazzucato 2018), an alternative view has also developed. In certain circumstances,
504 M. Westmore and K. Meadmore
public funding can indeed have the opposite effect whereby the injection of funding
in an area causes other private and philanthropic funders to also fund in that area: this
concept is called crowding-in (as opposed to crowding-out). The public funder must
of course choose the area and nature of its investment carefully – this in turn will
again change the ways in which it makes its funding decisions.
More specifically relating to clinical trials. A common narrative in the develop-
ment of new treatments is the translational pathway. The US NIH defines transla-
tional research as:
Translational research includes two areas of translation. One is the process of applying
discoveries generated during research in the laboratory, and in preclinical studies, to the
development of trials and studies in humans. The second area of translation concerns
research aimed at enhancing the adoption of best practices in the community. Cost-effec-
tiveness of prevention and treatment strategies is also an important part of translational
science. (Rubio et al. 2010)
How strongly the funder subscribes to this model, and sees their role in facilitat-
ing ideas move along the pathway, will have significant impact on not only the
methodology expected but also who is setting research priorities and what outcomes
would be of greatest interest.
the public rather than “to,” “about,” or “for” them. This includes, for example,
working with research funders to priorities research, offering advice as members of a
project steering group, commenting on and developing research materials, and
undertaking interviews with research participants. More on this topic can be found
in section 3 ▶ Chap. 30, “Advocacy and Patient Involvement in Clinical Trials.”
Impact
Research impact is the effect research has beyond academia. There is no single
definition nor approach to measuring it, but it is often described as research that
has wider benefits and influences on society, culture, and the economy. It remains a
contentious and complex subject and also depends on the individual funder’s
context and where they sit in the translational pathway; one funder’s impact is
another funder’s input. What is universal, however, is every funder wants it. It
speaks to the funder’s fundamental purpose and it forms an important part of how
the funder is held to account by those providing the funding. A public funder has to
justify its overall impact to government, a philanthropic to its donors, and a
commercial to its shareholders. A discussion of the nature and role of research
impact is beyond the scope of this work but it is important to underline its
importance to the relationship between research funder, funded researcher, and
wider stakeholders.
Funder Policies
Funders encode all of the above into policies and procedures that guide their own
actions and the expectations placed on the research community. These will cover all
areas relating to the research over which the funder either has responsibility (such as
legislative requirements or financial rules ensuring good use of funds) or wish to
have influence (such as research integrity or transparency).
Different funders with different contexts will of course have different sets of
rules, policies, and procedures that must be understood and complied with.
All funders will have limitations on the nature of research they will support. This
flows from the fundamental purpose of the organization through the intended
purpose of the scheme being applied to. Remits will operate at different levels –
the whole funder, a funding program, or a specific call. Different funders will specify
their remits differently; some might be methodologically driven (e.g., by clinical trial
phase), others by clinical area or health need. It cannot be overstated how important
it is to understand the remit of a call or program being applied to. Preparing
applications can be an enormous piece of work, yet up to 20% of applications (for
506 M. Westmore and K. Meadmore
example, see the NIHR Health Technology Assessment success rates https://www.
nihr.ac.uk/documents/hta-programme-success-rates/23178) can be deemed out of
remit and will not be considered for funding.
While the majority of the focus of RFAs is on the relevance, quality, and impact of
the research they support, funders are increasingly paying attention to how their
policies and procedures have a wider influence on research delivery and culture.
RFAs sit in a highly influential position and are increasingly using that position to
improve research. For example, the move from a Haldane denominated view of the
world to Rothschild; the rise of the impact agenda, insistence on open access
publication, and wider clinical trial transparency.
Of particular importance is the global movement toward quality improvement in
research is the Research Waste and Rewarding Diligence Alliance (REWARD). The
REWARD Alliance was launched at the REWARD/EQUATOR Conference, 28–30
September 2015, stimulated by the seminal work of Iain Chalmers and Paul Glasziou
on avoidable research waste in 2009 (Chalmers and Glasziou 2009) and a series of
articles in the Lancet in 2014 (Lancet Series Research: increasing value, reducing
waste 2014), detailing expert consensus recommendations for all sectors of the
research ecosystem. The Alliance’s purpose is to facilitate efforts to maximize the
potential for research contributions by addressing five cross-cutting ideals: (1) The
“right” research priorities are set, with input from the users of research, including
patients and clinicians; (2) Studies are appropriately designed by building on what is
already known and are robustly conducted and analyzed through using up to date
methods to minimize bias; (3) Research regulation and management requirements
are proportionate to risks; (4) All information on research methods and study
findings are accessible; (5) Study reports are complete and usable. Both the
REWARD Alliance and the 2014 Lancet series noted that progress would require
action independently and collaboratively by different stakeholders, namely
researchers, funders, regulators, and publishers, with the inclusion of patients and
the public embedded in the activities of each of these stakeholder groups. An
international group of RFAs called Ensuring Value in Research (http://www.
ensuringvalueinresearch.org) has formed and developed a conceptual model and
ten guiding principles to address these issues. These are now beginning to inform
RFA policies globally (Fig. 2).
A second initiative particularly relevant to this work is the World Health Orga-
nizations’ Joint statement on public disclosure of results from clinical trials (World
Health Organization 2017). This sets out a number of expectations regarding clinical
trial transparency namely:
At the time of writing, 21 RFAs had signed up and are now implementing policies
to deliver on this. Even where RFAs are not formal signatories, however, the
importance of transparency policies is growing and are likely to be part of the
requirements for funded researchers.
Funding Models
Given the different contexts, environments, aims, and objectives of funders, different
models of funding have been developed. Each will follow a different process of
decision-making and place different expectations on the research community during
the application and delivery phases of research. Fundamentally, however, all are
attempting to achieve the same aim. The delivery of relevant, high quality, usable,
508 M. Westmore and K. Meadmore
and accessible answers to specific research questions. Where they differ is how and
who crafts the research question.
The two most common funding models used are open call (or researcher-led or
response mode) and commissioned call (or targeted or contract research). In open
call, researcher-led or response mode funding models, the RFA sets a high level
remit and the research community develops the research question and methodology.
In contrast in commissioned call funding models the RFA, working with stake-
holders, fully specifies the research question and the research community compete to
be the best team to deliver it. Typically, in these cases applicants will be provided
with a specific brief or vignette and the program of research in the application must
address this. Researcher-led calls or responsive mode funding is where the
researchers drive the research questions and topics and can propose research ques-
tions on any topic (so long as they are within the organization’s remit).
Sitting between open and commissioned calls are thematic calls. Some funding
organizations also issue themed calls for research in areas that have been identified as
health challenges, scientific, clinical, or community priorities. These are specified
more tightly than open calls but more broadly than commissioned calls.
Across all of these models, the importance of remit should be restated. Applicants
must ensure they are not only within the remit of the funder or program but also the
call in question. Deviations from the call specification may be tolerated but this
would be a high-risk strategy and would have to be robustly defended by the
applicant.
Regardless of which RFA is applied to and where the funds are sourced (public,
philanthropic, commercial, etc.), all RFAs have to make decisions regarding which
research applications they should invest in. Good decision-making processes are
seen as integral and essential to the research process (Nurse 2015). This is no easy
challenge as the number of applications received by funding organizations is often
large and the amount requested by the applicants typically outweighs the amount of
resource available (Guthrie et al. 2018). As such, funding organizations have
rigorous processes in place to facilitate decision-making in order to whittle down
the number of competitive applications and ensure that funds are awarded to the best
applications. However “best” does not have just one definition, and instead depends
26 Funding Models and Proposals 509
Table 1 (continued)
Typically
Funding model Description Purpose used by
Project funding Funding of a single project. Where a single or small Public
The project may have number of interconnected Philanthropic
multiple subprojects that subprojects are required Commercial
collectively address a
narrow need
Block funding Research institution is Provides long-term Public
awarded substantial sustaining and capacity- Philanthropic
funding with limited building funding. Allows Commercial
direction from the funder for the highest levels of
on what to use it for academic freedom and
creativity
Infrastructure Funding to provide Provides long-term Public
funding infrastructure to support sustaining and capacity Philanthropic
research projects funded by building to support a wider Commercial
other means research community
Sandpits and Prefunding workshops To bring research Public
other variants communities together to Philanthropic
collaborate in ways that
would not otherwise
happen, e.g., where highly
creative or radical
approaches or
transdisciplinary research is
required
on the organizational context and priorities. RFAs also need to balance academic
freedom and creativity with accountability and value for money. This balance will
again vary depending on the nature of the funder and nature of the research.
There are many types of approaches and processes involved in allocating research
funding (see Table 2).
The overarching processes for allocating funding are largely similar across public
and philanthropic funders (Nurse 2015). Commercial, own account, self-funded, and
crowd-funded research follow a vast heterogeneity of processes that cannot be
usefully summarized here. This section therefore focuses on public and philan-
thropic funders.
In the current landscape, the use of triage, face-to-face committee meetings, and
external peer review comprise a typical approach by funders to decide which
applications to fund (see Fig. 3). This standard route has been developed over
many years to embed the Haldane Principle and principles of openness and fairness.
Typically, once an application has been submitted, it goes through an internal triage
26 Funding Models and Proposals 511
Table 2 (continued)
Stage Brief description Pros (not exhaustive) Cons (not exhaustive)
environment (e.g., inclusive than face-to- (and so more likely to
telephone conference face as reduces travel, get applications they
or online such as time, and geographic like funded), reliant on
Skype) constraints technology
Inclusion of Applications are Funders are gaining Biases exist (toward
stakeholder reviewed by lay lay, patient, and/or own health condition/
perspectives people, patients/carers population perspective population); can be
of people with a difficult to find PPI for
specific health narrow criteria
condition, or people
from a specific
population
Sandpits and A sandpit model aims Provides a forum for Relies on appropriate
other variants to bring together brainstorming to foster selection of
researchers, funders, creativity and generate participants; may not
and reviewers to research proposals be inclusive as
interactively discuss quickly; shorter involves 3–5 day
and revise proposals at timeframe for proposal workshops
a workshop review and revision
Random There are many Eliminates biases, Uncertainty around
allocation for different ways that this transparent outcome, may not
applications could be done. Sorting capture very good
above a certain applications into three applications, still needs
threshold groups through peer reviewers and/or a
review according to a committee for initial
certain threshold (e.g., ranking process
not fundable, probably Even if statistically fair
fundable, definitely the use of random
fundable). Not chance to decide
fundable are declined funding is not
and definitely fundable welcome by the
are accepted. research community
Decisions in the
middle tier are made
through random
allocation up to the
amount of resource
available
system. Applications which are considered competitive are then sent to peer
reviewers. Different RFAs approach peer review differently; some will rely on
sending applications singularly or in small numbers to individual external and
independent experts; others will send all applications in a call to a face-to-face
committee of experts; other funders do both.
If both external and committee peer review are used, external reviewers com-
ments and recommendations on the proposal are considered at a funding committee
meeting (also referred to as panels or boards). Applications considered fundable may
then be ranked and a final list of funded applications is drawn. Before applicants are
26 Funding Models and Proposals 513
Fig. 3 Illustration of the common elements involved in an application route and potential areas for
differences
informed, the outcome may first need to have formal sign off from the RFAs
governance structures and/or external sponsoring agency (such as government
department for a public funder).
Different funders will operate variations on the process steps included in this
model. For example, the Canadian Institutes for Health Research (CIHR) and the
Australian National Health and Medical Research Council do not send proposals out
to external peer review. Other funders will not hold face-to-face meetings; instead
making their decision via electronic panels, scoring, and discussion.
Differences in the way different process steps are carried out may also include the
assessment criteria to which applications are judged (see Table 3), whether the
application is in response to a commissioned call or researcher-led, whether appli-
cants can make revisions and rebuttals following feedback from reviewers, the
number of internal triage stages, and the scoring system used for rating the applica-
tion (Guthrie et al. 2018).
514 M. Westmore and K. Meadmore
Although these processes are widely used, there is also much criticism surround-
ing them. For example, it is suggested that peer review (both external and internal
funding committees) is heavily biased and not reliable (Guthrie et al. 2017). Opin-
ions on what is fundable can be very subjective and vary widely. Largely these
issues, real or perceived, come from the long-term trend of research funding becom-
ing a professionalized, largely technical, and bureaucratic process. These trends are
in turn due to the desire for funders to operate fair, transparent and efficient
processes.
Over the last decade, funders have begun to explore variations to this typical
approach as well as alternative processes for funding, such as sandpits. However,
these approaches (alternative and more traditional) still have limited evidence on
how efficient and effective they are. Future work needs to explore in which circum-
stances different approaches work best, how, and for whom.
Assessment Criteria
check the remit of the organization and the research program/call they are submitting
to before writing on writing the application begins.
General or higher level assessment criteria usually consist of a few core values
that are important to a funder. For example, the UK’s NIHR states three general
assessment criteria: (1) need for the evidence; (2) value for money; (3) scientific
rigor. In a review of the UK research councils, Nurse (2015) suggests that there
are three key factors that should be considered when funding decisions for
scientific research: (1) who the researcher(s) are; (2) content of the research
program; and (3) the context within which the research being undertaken. In
practice, more criteria that focus on more specific questions under these headings
are used during review.
Common assessment criteria include scientific rigor, remit, and relevance (to fit
the funder and research programs objectives and research strategy), potential
research impact, innovation/originality of proposal, value for money, a need to
generate evidence, and potential societal impact (see Table 3). Funding organizations
may use a combination of these criteria and may weigh the criteria according to their
values. For example, a funder of late phase pragmatic trials will weight meaningful
and sufficiently resourced patient and public involvement throughout the research
project more highly than a funder of early phase efficacy study.
Success Rates
Not all submitted applications will get funded. Given the effort involved in devel-
oping and submitting an application, applicants have careful decisions to make
regarding where to submit their research proposals to enhance their chances of
success. In addition to checking the funding organizations remit and objectives,
applicants may also want to consider the success rates associated with a funding
organization or a particular research program. More interesting than the numeric
value of the success rate may well be the reasons behind it being high or low and
what the applicant can learn from that.
Success rates provide information about the percentage of applications that
receive funding from the total number of applications that are reviewed. Note that
there are also a number of applications (generally about 10–20%) which will not
make it past the first internal triage stage (i.e., remit and relevance). This is some-
times a stage which is forgotten about and success rates often do not include these
application numbers in their calculations; i.e., the true success rate could be lower
than stated.
At each decision stage of the review process, some applications are rejected.
These figures are fairly consistent across funders internationally and are reported by
the funding organizations (usually found on their website). In general the overall
success rate of funding organizations is between 15% and 25% (for example, see
https://www.timeshighereducation.com/news/uk-research-grant-success-rates-rise-first-
time-five-years for UK examples and https://report.nih.gov/success_rates/ for
NIH data). For those funding organizations that have a two-stage review process,
26 Funding Models and Proposals 517
for example, the UK’s NIHR and the Welcome Trust, about 50% of applications are
rejected at each stage.
Taking all these considerations together, all the layers of Fig. 1 hierarchy of factors,
there are a number of practical tips for success that seem simple yet are not always
followed. Doing these things can have a big impact on success rates:
• Choose your funder and program carefully. Squeezing an ill-fitting idea into the
wrong funder or scheme is unlikely to work.
• Make sure it is in remit and has a chance of being competitive. Look at the
funder’s past portfolio to get an idea of the type and quality of projects previously
funded.
• Write your application specifically for the funder and scheme of choice.
• Consider the broader expectations of the funder, scheme, or expert reviewers.
Don’t just focus on one element such as methodology. Different funders will have
different criteria and will weight them differently.
• You need to convince the peer review experts, external or committee, that the
question is important. This will be highly dependent on the nature of the funder
and the makeup of the expert and peer reviewers. Consider who it is important to
and why. Challenge yourself – is it important or just interesting – the question
may be important but the proposed study might not.
• Cut and paste high level prevalence or incidence figures are not convincing
on their own. Funders will want to know what difference the proposed trial
will actually make to those that use, deliver, or plan health services and
treatments.
• Remember you will need to convince those outside of your specialty. That will
include trialists and clinicians working in other areas, methodologists and statis-
ticians, patients, and the public.
• Make sure there is a real research gap that the proposed trial will add to
what is already known and that what you are proposing is plausible given
the existing evidence base. The best way of doing this is to base the new
proposal on a systematic review of the existing evidence. If there isn’t one,
do one.
• You need to convince the RFA that you have the right approach to delivering the
trial. This will include methodology but will also be much broader. How feasible
is it? Who are you partnering with? Do you have the right multidisciplinary team?
(clinicians, statisticians, patients, etc.).
• Make sure your sample size is credible and meaningful. Will it be achievable?
Will it change the meta-analysis?
• Consider wider issues around how you will do the research. Consider issues of
transparency, integrity, and openness.
518 M. Westmore and K. Meadmore
• Consider value for money. Is the answer to the question worth the investment?
How much is the trial costing per participant? The more expensive studies will be
expected to make a bigger difference to society.
Funders of clinical trials are not just sources of funding, they are actors in their wider
research systems, have their own philosophies, values, and objectives, and operate
within different politi cal, social, and economic environments. These will all affect
their policies and practice and ultimately what applicants need to work successfully
with them. Different funding agencies will use a range of funding models depending
on what they are trying to achieve. The decision-making process will vary by funder
and by scheme. It is likely to be based on multiple criteria; all must be considered
(see Table 3). Applications will be reviewed by a range of experts and usually
beyond the field of expertise of the applicant.
Key Facts
• Not all funders of clinical trials are alike; they have their own sources of funding,
stakeholders, philosophies, values, and objectives, and operate within different
political, social, and economic environments. These will all affect their policies and
practice, and ultimately what applicants need to do to work successfully with them.
• Different funding agencies will use a range of funding models depending on what
they are trying to achieve; from open calls for proposals limited only by broad
remit statements through to commissioned calls where the funder specifies the full
research question.
• With nuanced variations, many funders make decisions following a standard
procedure involving internal review, external expert and peer review, and funding
committee review.
• Understanding the political, philosophical, and contextual issues of funding
agencies is important, but there are also some simple practical tips for success
for applicants that are useful across all funders.
Cross-References
References
Chalmers I, Glasziou P (2009) Avoidable waste in the production and reporting of research
evidence. Lancet 374(9683):86–89. https://doi.org/10.1016/S0140-6736(09)60329-9
26 Funding Models and Proposals 519
Clark DR, McGrath PJ, MacDonald N (2007) Members’ of parliament knowledge of and attitudes
toward health research and funding. CMAJ 177(9):1045–1051. https://doi.org/10.1503/
cmaj.070320
Guthrie S, Ghiga I, Wooding S (2017) What do we know about grant peer review in the health
sciences? F1000Res 6:1335. https://doi.org/10.12688/f1000research.11917.2
Guthrie S, Ghiga I, Wooding S (2018) What do we know about grant peer review in the health
sciences? An updated review of the literature and six case studies. RAND Corporation, Santa
Monica. https://www.rand.org/pubs/research_reports/RR1822.html
HMSO (1971) A framework for Government research and development. HMSO, London
HMSO (1918) Report of the Machinery of Government Committee under the chairmanship of
Viscount Haldane of Cloan. HMSO, London. https://www.civilservant.org.uk/library/1918_
Haldane_Report.pdf. Accessed 10 June 2020
Mazzucato M (2018) The entrepreneurial state, 1st edn. Penguin, London
Nurse P (2015) Nurse review of research councils. GOV.UK. Available at: https://www.gov.uk/
government/collections/nurse-review-of-research-councils. Accessed 27 June 2019
Obama B (2013) Public papers of the Presidents of the United States: Barack Obama, Book I, p 345.
https://www.govinfo.gov/app/details/PPP-2013-book1/PPP-2013-book1-doc-pg342. Accessed
10 June 2020
Rubio D, Schoenbaum E, Lee L, Schteingart D, Marantz P, Anderson K, Platt L, Baez A, Esposito
K (2010) Defining translational research: implications for training. Acad Med 85(3):470–475.
https://doi.org/10.1097/ACM.0b013e3181ccd618
Series. Research: increasing value and reduce waste when research priorities are set. (2014) The
Lancet 383(9912):156–185 e3–e4. https://doi.org/10.1016/S0140-6736(13)62229-1
World Health Organization (2017) Joint statement on public disclosure of results from clinical trials.
Available at: https://www.who.int/ictrp/results/jointstatement/en/. Accessed 28 June 2019
Financial Compliance in Clinical Trials
27
Barbara K. Martin
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
CMS Policy Regarding Reimbursement of Costs in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
Categorization of Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523
The Clinical Trial Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
Coverage with Evidence Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526
Billing Compliance in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Coverage Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Qualifying for Medicare Coverage Under CTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
Device Classification and Medicare Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
Qualifying for Medicare Coverage with Evidence Development . . . . . . . . . . . . . . . . . . . . . . . . . . 530
Identifying Research Charges Billed to CMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530
Medicare Advantage Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
Issues in Non-compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
Subject Remuneration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531
Waiving of Co-pays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
Reimbursement for Subject Injury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
Billing Non-compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
Summary: Best Practices for Billing Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
B. K. Martin (*)
Administrative Director, Research Institute, Penn Medicine Lancaster General Health,
Lancaster, PA, USA
e-mail: [email protected]
Abstract
Financial compliance considerations are an important aspect of the design and
funding of clinical trials. Such research often involves a mixture of sponsor
funding and insurance billing for the clinical services provided in the trial. In
the United States, what can be billed to insurance and what must be paid by a
sponsor are in general determined by the Centers for Medicare & Medicaid
Services (CMS). Other third-party payer policies largely mimic those of CMS.
Medicare reimbursement for clinical trials is determined by the interagency
agreement between the Food and Drug Administration and CMS regarding
investigational devices, the clinical trials policy, and CMS guidance on coverage
with evidence development. Non-compliance in research billing carries risk of
monetary penalty. To ensure compliance, providers and institutions must conduct
coverage analyses to determine if a trial qualifies for CMS coverage and, if it
does, which clinical items and services can be billed to CMS. Claims with items
and services being billed to CMS must be identified with research codes and
modifiers. While these policies and procedures have brought some clarity to
research billing, there are still murky waters that providers and institutions need
to navigate.
The risk from non-compliance is not theoretical. Several cases of large fines to
major research institutions have been well publicized. The imperative for having
a comprehensive program for billing compliance continues to mount, and the cost
of this necessary infrastructure must be part of the calculation of institutional
overhead for clinical research.
Keywords
Billing compliance · Coverage analysis · Clinical Trial Policy · Coverage with
evidence development · Qualifying clinical trials · Waiving of co-pays · Subject
remuneration · Subject injury
Introduction
The design and conduct of an appropriate, informative, and successful clinical trial
of course is multifaceted. The trial must be based on a scientific question worth
answering. It must be ethically sound. The design must give the trial a reasonable
chance to actually answer the question that it is intended to answer. It must be
conducted with rigor and integrity. Also importantly, it must be adequately and
appropriately funded.
The funding of clinical trials differs from and is more complex than the funding of
other research studies. That is, studies that involve clinical services of any type, from
diagnostic testing or monitoring to surgery to administration of drugs, biologics, or
devices, could be funded entirely by a sponsor, but they are more likely to involve a
mixture of sponsor funding and billing to insurance for some or all of the clinical
27 Financial Compliance in Clinical Trials 523
services provided as part of the trial. In the United States, what can be billed to
insurance and what must be paid by a sponsor in general are determined by policy of
the Centers for Medicare & Medicaid Services (CMS). Many third-party payers have
policies and practices that largely mimic those of CMS. However, with CMS policy
come consequences for non-compliance that involve monetary and even criminal
penalties. Therefore, the issue of financial – particularly billing – compliance is now
an important concern in the conduct of clinical trials. This chapter explores the
history of current US policy and the resulting financial compliance considerations
that are now necessary. Acronyms frequently used in this chapter are explained in
Table 1.
The standard for CMS reimbursement has always been anchored to the phrase
“reasonable and necessary” from the Social Security Act, which established the
Medicare program. More specifically, Medicare is intended to reimburse for clinical
services and products that are “reasonable and necessary for the diagnosis and
treatment of an illness or injury, or to improve the functioning of a malformed
body member” (42 US Code § 1395y). Experimental treatments generally have
not met this standard for reimbursement, as it has been interpreted to mean that the
service or product must be demonstrated to be safe and effective. However, confu-
sion has long existed around the “routine” testing and treatment that individuals
might receive as part of a clinical trial that they would also receive if they were not
enrolled in a clinical trial.
Categorization of Devices
Table 1 (continued)
Acronym Full form Description
NCD National coverage A determination by the Centers for Medicare &
determination Medicaid Services as to whether Medicare will pay for
an item or service; in the absence of a national coverage
determination, an item or service is covered at the
discretion of the local Medical Area Contractor
OIG Office of the Inspector Specifically, the Office of the Inspector General of the
General US Department of Health and Human Services (HHS)
dedicated to protecting the integrity of HHS programs,
combating fraud, waste and abuse, and improving
program efficiency; the majority of resources go toward
oversight of Medicare and Medicaid
consideration by the FDA and HCFA of a means to determine whether some devices
might legitimately be covered by Medicare.
In the interagency agreement between FDA and HCFA regarding reimbursement
of investigational devices, FDA agreed to categorize the clinical investigation of
medical devices to aid HCFA in its reimbursement decisions. Specifically, FDA
would label as Category A those investigations of Class III devices (requiring pre-
market approval) that are innovative and for which the safety and effectiveness of the
device has not been established (i.e., they are experimental). Category B investiga-
tions, on the other hand, would be those that involve devices where the incremental
risk of the device is the primary risk in question (i.e., the underlying questions of
safety and effectiveness have already been resolved). Therefore, devices in Category
B investigations were able to meet the criteria of “reasonable and necessary,” and the
devices and the associated hospital and professional charges could qualify for
reimbursement by HCFA.
Reimbursement for care in non-device trials remained unclear until 2000. Early in
that year, the Institute of Medicine of the National Academy of Sciences released the
report, “Extending Medicare Reimbursement in Clinical Trials” (Aaron and Gelband
2000). The report summarized the state of reimbursement at that time, suggesting
that although Medicare did not have a policy to reimburse for care in clinical trials,
and many private insurers had policies that excluded coverage, a significant propor-
tion of costs of patient care in clinical trials were indeed paid for by insurers. This
was because providers bill for the services, and without any obvious identification of
a beneficiary’s participation in a clinical trial, the insurers were none the wiser. That
current state, though not untenable, left patients and providers with uncertainty
regarding whether costs would be covered. The IOM report recommended an
explicit policy that “Medicare should reimburse routine care for patients in clinical
526 B. K. Martin
trials in the same way it reimburses routine care for patients not in clinical trials.”
(Aaron and Gelband 2000).
As a result of the IOM report, President Clinton on June 7, 2000, issued an
executive order directing Medicare to create explicit policy and to immediately begin
to reimburse for the costs of routine services provided to participants in clinical trials
(The White House 2000). In response to the executive order, HCFA issued the
national coverage determination (NCD) for Routine Costs in Clinical Trials on
September 19, 2000 (CMS 2000). This NCD is widely referred to as the Clinical
Trial Policy and exists in much the same form today. To state first what the policy
excludes, it does not allow for coverage of investigational items or services them-
selves, unless an item or service is “otherwise covered outside the trial.” Addition-
ally, “items and services provided solely to satisfy data collection and analysis needs
and that are not used in the direct clinical management of the patient” are
not covered. What the policy does require is the coverage of routine costs from
“qualifying” clinical trials. Qualifying clinical trials must have therapeutic intent for
patients with a diagnosed disease, or intent to diagnose a clinical disease. The policy
also outlines seven desirable characteristics of a qualifying clinical trial. HCFA, or
now CMS, has never instituted a method for investigators to certify that their
research meets the seven desirable characteristics. Instead, clinical trials are deemed
as qualifying if they are federally funded, are conducted under an investiational new
drug (IND) application, or meet criteria to be exempt from having an IND. The
policy also provides that if Medicare is billed for a study that is not qualifying, the
providers are liable for the costs and could be investigated for fraud.
The IOM in its report, in addition to advocating for coverage of costs of routine care
in clinical trials, urged HCFA “to use its existing authority to support selected trials
and to assist in the development of new trials” (Aaron and Gelband 2000). That is,
the agency should identify research that is of particular significance to the care of
its beneficiaries and provide reimbursement for more than just routine care.
The Committee pointed to the example of the National Emphysema Treatment
Trial, in which HCFA agreed to pay for lung volume reduction surgery (LVRS)
only when the procedure was performed as part of the trial, which was a randomized
comparison between LVRS and standard medical management of emphysema.
Response to this recommendation of the IOM report took longer. On July 12,
2006, CMS published guidance on “National Coverage Determinations with Data
Collection as a Condition of Coverage” (CMS 2006). The so-called coverage with
evidence development (CED) allows for coverage of a service under
specific conditions. One such condition, more specifically labeled “coverage with
study participation” (CSP), provides for reimbursement of the item or service “only
when provided within a setting in which there is a pre-specified process for gathering
additional data, and in which that process provides additional protections and safety
measures for beneficiaries, such as those present in certain clinical trials.”
27 Financial Compliance in Clinical Trials 527
The intention is to allow coverage of items and services for which CMS does not find
there to be enough evidence to support their use as “reasonable and necessary” but
for which additional data could help clarify the value of the service. The decision to
allow coverage under the CED mechanism is made as part of the CMS coverage
determination process, and as such, it is open to public comment (CMS 2014b).
When CMS revised the Clinical Trial Policy in July of 2007, it added reference to
coverage of items and services “when provided in a clinical trial that meets the
requirements defined in [a specific] national coverage determination” (CMS 2007).
This version of the CTP has remained unrevised for the subsequent decade and more.
CMS policies that began to be put in place a couple decades ago have provided much
needed clarity to reimbursement for services provided in the context of clinical trials.
However, with this clarity has come the requirement for compliance, and the
possibility of penalty for non-compliance. This is a big concern to providers and
institutions that bill CMS and/or receive federal grant funds. This section will
discuss what is entailed in billing compliance in clinical trials.
Coverage Analysis
In the 2000 IOM report, the Committee discussed the status quo in reimbursement
and concluded that much of routine care in clinical trials was indeed already billed
to, and paid for, by insurance, mostly without the payers knowing when their
beneficiaries were in trials. However, because of the lack of clarity in policy, the
possibility remained that either providers or patients could be left holding the bill
after a denial of coverage or when sponsor support did not adequately cover services.
Providers and institutions differed in their billing practices. According to the Com-
mittee, only General Clinical Research Centers (GCRCs), typically funded by the
National Institutes of Health and located within major academic hospitals, seemed to
have a rigorous and consistent approach to billing for tests and services provided in
the context of their clinical trials (Aaron and Gelband 2000). These centers reviewed
all the charges for participants in their studies, often early phase research with
intensive treatment and monitoring, and determined which charges were for
unproven therapies or for tests and services that the participants would not have
had outside the clinical trial. These costs were not billed to CMS or other payers,
while routine charges were. At the time of the IOM report, this practice was
uncommon and mostly confined to GCRCs, which were able to devote substantial
resources to their clinical trial billing.
Today, this practice is considered imperative to a comprehensive system for
billing compliance. Such a system starts first with coverage analysis, the term that
has come to refer to the process of (1) determining if a trial qualifies for coverage
under the CTP, under CED, or as a Category A or B IDE study approved by CMS
528 B. K. Martin
and (2) analyzing all activities required by the study and whether they will be
supported by study funds or billed to insurance. Then, as was modeled by
GCRCs, all clinical trial charges must be reviewed and triaged for payment
according to the coverage analysis.
Determining coverage for a study is the first challenge in billing compliance, as
discussed below.
The interagency agreement and the CTP have done much to clear up uncertainty
regarding clinical trial billing. However, a couple issues remain unclear, and insti-
tutions and providers are still left to make their own determinations of the appropri-
ateness of billing or to consult with their local Medicare area contractors (MACs).
One area that has generated much discussion, particularly in the oncology field, is
that of early phase clinical trials. As mentioned above, to qualify for coverage under
the CTP, a trial must have therapeutic intent for patients with a diagnosed disease or
intent to diagnose a clinical disease. Many phase I trials are designed with safety and
toxicity measures as the primary outcomes of interest. Can such trials be considered
therapeutic in nature, as required under the CTP? Many oncology researchers argue
that these trials do have therapeutic intent; if the drugs under study were not thought
to have the potential to be therapeutic, they would not be under study. Nonetheless, if
the measure of therapeutic intent is the primary outcome of the study for which it is
designed and powered, then phase I studies arguably might not meet this criterion.
A second murky area is that of research on a device, such as diagnostic tool, that
did not require an IDE, or research on a procedure or technique, such as the use of
surgical intervention instead of medical management. Because these interventional
studies do not involve an IDE, IND, or IND exemption, the rules for covered devices
and qualifying trials do not apply. If the trial has federal sponsorship, it qualifies
under the default of the CTP, but if it does not, the trial can be caught in a dead zone
of no clarity on coverage. Federal sponsorship or CED coverage for a procedure and
technique gaining foothold in practice may be more likely for later phase studies.
However, non-sponsored early phase development may be hampered by virtue of
the lack of ability to bill for an innovative procedure or technique in the context of
a clinical trial.
In the two decades following the interagency agreement that established the
categorization of devices by the FDA, this process in and of itself was not enough
to establish Medicare coverage. That is, mere classification by the FDA did not
guarantee that Medicare would cover routine costs or that it would cover the costs for
a Category B device. Local MACs had to be consulted and pre-authorization sought,
often for each patient enrolled. CMS, with FDA concurrence, subsequently made
27 Financial Compliance in Clinical Trials 529
changes to its regulations regarding coverage of devices and routine costs as part
of IDE studies, with the intent to streamline its coverage determinations. Effective on
January 1, 2015, Medicare coverage determinations for IDE studies were centralized
(CMS 2014a). That is, sponsors are now required to submit IDE protocols to CMS
for a central review process. Studies that are approved for coverage of the investi-
gational device (Category B only) and routine costs are published on the CMS
website.
CMS and FDA further collaborated to revise the definitions of Category A and B
devices, to support CMS’s centralized decision-making on coverage (FDA 2017).
Each category has three similar sub-categories of devices, and whether or not there
are data to support the questions of safety and effectiveness determines the classi-
fication of the device as Category A or B. That is, a new device with no marketing
approvals will be considered Category A if “data on the proposed device or similar
devices do not resolve initial questions of safety and effectiveness” and Category B
if there is available information on the proposed device or similar devices that
supports the proposed device’s safety and effectiveness. If an approved device is
being studied for a new indication, it will be classified as Category A if the
information from the proposed or similar devices related to the previous indication
does not resolve questions of safety and effectiveness for the new indication and
Category B if it does. Finally, a proposed device that has “different technological
characteristics compared to a legally marketed device,” such that the information
from the marketed device can’t resolve the questions of safety and effectiveness, will
be considered Category A, while a device that has similar technological character-
istics would be classified as Category B if the information on the approved device
provides applicable data on safety and effectiveness. The amendment to rules on
device categorization further allows for changes in the categorization – most likely
from A to B but in some instances from B to A – as research on the device
progresses. FDA will categorize a device at the start of an IDE study but can consider
changes to the categorization with amendments or supplements to the study or at the
request of the sponsor.
As a consequence of device categorization by FDA and centralized review by
CMS, billing in the device research space may seem more focused and defined.
However, questions about coverage may arise from non-significant risk (NSR)
device studies (FDA 2006) or studies that can be conducted without an IDE (21
CFR § 812.2(c)), as these studies are not reviewed by FDA or approved by CMS.
Rather, one or more IRBs serve as the surrogate for the FDA in reviewing and
approving the study. Such studies may involve, for example, investigations related to
the clinical use of a new imaging approach, or optimal clinical use of an approved
monitoring device. Providers and institutions may need to negotiate with sponsors
and local MACs over payment for approved items and services that are being used
outside of common practice. For monitoring devices, it is not just the cost of the
devices themselves that may be in question but also coverage for medical review of
the data being reported by those devices. Providers and institutions may need to
determine their risk tolerance for practices for which the “reasonable and necessary”
standard could perhaps be challenged.
530 B. K. Martin
When CMS issues a national coverage determination for a specific item or service,
the NCD may limit coverage to certain indications, populations, or providers and
centers with demonstrated expertise in delivery of the item or service. It also may
allow for reimbursement only when the item or service is provided in the context of
a research study, under the coverage with evidence development mechanism. For a
trial to qualify for reimbursement of the item or service under the CED mechanism,
CMS must review and approve the protocol. The items and services that may be
covered under CED are listed on the CMS webpage. The CMS webpage typically
references the NCD for the item or service it is covering under CED. The NCD then
lists the research questions that CMS is interested in having answered and the
research studies that are currently approved by CMS.
It is worth noting that the CED mechanism for coverage pertains only to items
and services that are covered under Medicare parts A and B. The CED mechanism of
coverage could not be applied to self-administered drugs that fall under Medicare
part D.
Charges that are billed to CMS as routine care provided in the context of clinical
trials are to be labeled as such in the claims. This requirement by CMS also has its
roots in the events surrounding the advent of the CTP. President Clinton’s exec-
utive order directed HCFA to establish a tracking system for the charges billed to
and reimbursed by Medicare that were generated in clinical research (The White
House 2000).
First, an International Classification of Diseases (ICD) code is required.
The ICD10 code Z00.6 identifies a charge as part of an “encounter for examination
for normal comparison and control in [a] clinical research program.”
Second, the National Clinical Trials (NCT) number assigned by the clinicaltrials.
gov registry is required. Conveniently, this registry had already been established by
the Food and Drug Administration Modernization Act of 1997 (FDAMA), devel-
oped in conjunction with the National Institutes of Health, and made publicly
accessible in 2000 (National Library of Medicine). Initially, federally and privately
funded clinical trials conducted under investigational new drug applications were
required to be registered and to provide this information to the public, healthcare
professionals, and researchers. In 2005, the International Committee of Medical
Journal Editors (ICMJE) began to require that authors seeking to publish clinical trial
results provide evidence of registration of the clinical trial. The ICMJE’s interest was
to promote clinical trial registration as a means of addressing the well-documented
bias in the publication and reporting of clinical trial results. The requirement for
registration of clinical trials has since been expanded by the FDA Amendments Act
of 2007 (FDAAA). As a result, this registry provides a comprehensive mechanism
for identifying clinical trials when billing for items and services provided in these
trials.
27 Financial Compliance in Clinical Trials 531
Medicare Advantage plans pose another challenge to billing in clinical trials. When
Medicare Advantage plans came into being, the interagency agreement between FDA
and CMS regarding coverage of investigational devices was already in place, so the
cost of this benefit was calculated into the capitated payments made by CMS to the
Medicare Advantage plans. Therefore, Medicare Advantage plans are required to
cover costs from device trials that have been approved by CMS for Medicare billing.
Drug studies, on the other hand, get complicated. After the CTP went into effect in
2000, “CMS determined that the cost of covering these new benefits was not included”
in the capitated payments to Advantage plans (CMS 2014c). Therefore, it was decided
that CMS should pay for the covered clinical trial services outside of the capitated
payment rate. This means that, for a Medicare Advantage beneficiary, routine costs in
trials subject to the CTP need to be billed to Medicare Fee-for-Service rather than to the
beneficiary’s Advantage plan. Providers need to have a mechanism to redirect claims
appropriately. Then, the Medicare Advantage plan is required to cover the difference
between its beneficiary’s out-of-pocket costs and those incurred under Medicare Fee-
for-Service (CMS 2013). The capitated payment rates have yet to be adjusted, so this
band-aid solution has remained in place for a couple decades.
Issues in Non-compliance
Subject Remuneration
review the reason for the remuneration, the amount, and the schedule of payment.
This chapter will not cover the larger discussion on the appropriateness of subject
remuneration but will only address the billing compliance issues that are raised in
some circumstances.
Subject remuneration in clinical trials, like any form of gift or payment, is
viewed by the OIG as an inducement to Medicare beneficiaries that could influence
their selection of a particular provider of healthcare services. The OIG, in its
Special Advisory Bulletin dated August 2002, stated that a person who offers
remuneration to Medicare beneficiaries “could be liable for civil money penalties
(CMPs) for up to $10,000 for each wrongful act” (HHS OIG 2002). However, the
Advisory Bulletin allows that providers may offer gifts and remuneration that fit
within five statutory exceptions; subject remuneration in clinical trials is poten-
tially applicable to only one of these, which are practices allowed in the “safe
harbor” provisions of the federal anti-kickback statute. Payments related to clinical
trials (i.e., payments of industry sponsors to providers, payments of providers to
research subjects) are typically judged against the requirements of the “personal
services and management contracts” safe harbor of the anti-kickback statute (42
CFR § 1001.952(d)). This safe harbor category requires that the payments occur
under a written and signed agreement that details the services to be provided and
the schedule and term of the agreement (which cannot be for less than a year). It
also requires that the compensation for the clinical trial activities is set in advance,
is consistent with fair market value, and doesn’t exceed what is reasonably
necessary for the performance of the activities. Finally, the activities performed
under the agreement cannot involve business promotion, and the compensation for
the activities cannot be determined by the volume of referrals for services paid by
federal healthcare programs. If a clinical trial agreement is so constituted and
executed, providers should not incur liability for penalty for subject remuneration
set out in such an agreement. That said, it is prudent, when determining subject
remuneration, to have standard practices in place for the amount of compensation
for specific things such as extra time, travel, parking, or other expenses incurred by
subjects by virtue of being in the study. Stated in the converse, it is prudent to
avoid paying subjects when their research participation is not requiring much in
the way of time and expenses over and above what they would be investing for
standard care. For example, when a study is merely abstracting data on the results
of routine services, subjects are not incurring additional expenses by virtue of
being in the study, and providing them with remuneration could arguably consti-
tute inducement to receive those routine services as a research subject, and from
the research provider.
Waiving of Co-pays
As providers are likely well aware, CMS considers the waiving of co-pays by
providers to be in violation of the beneficiary inducements statute and the anti-
kickback statute. That is, such waivers are seen as inducements to use services paid
27 Financial Compliance in Clinical Trials 533
Subject injury is a somewhat sticky issue in clinical research. Let us first consider
clinical trials covered under the CTP. The CTP allows for billing of services related
to monitoring for, preventing, and treating adverse effects related to the investiga-
tional therapy. So, for example, in an oncology chemotherapy trial, treatment of
nausea and laboratory tests to monitor blood counts are billable. This seems quite
reasonable when medical monitoring and management of adverse effects are to
be expected with most therapies. It also is quite reasonable when it may be difficult
534 B. K. Martin
Billing Non-compliance
By far the issue that has gotten institutions in the most trouble has been the
inappropriate or double billing of Medicare and Medicaid for services provided in
the context of clinical research. There are a few well-known cases of penalties
imposed on major medical centers for research billing found to be in violation of
the False Claims Act (31 US Code § 3729). An early case is that of the University of
Alabama at Birmingham. The US Department of Justice announced in April of 2005
that it had reached an agreement with the university to pay $3.39 million to settle
allegations related to its research billing practices (DOJ 2005). The investigation of
the institution resulted from two lawsuits brought by whistleblowers – a former
physician and a former compliance officer at the medical school. It was alleged that
the university “unlawfully billed Medicare for clinical research trials that were also
billed to the sponsor of research grants.” In December of that same year, Rush
University Medical Center announced that it had voluntarily disclosed billing errors
to the federal government (RUMC 2005). Under the False Claims Act, the
government can impose fines up to three times the amount of the false claims.
Rush, because of its self-disclosure and cooperation with the investigation, was
only fined 50% of the amount of the false claims. With these penalties and restitution
of the claims, the fine paid by Rush reportedly totaled about $1 million, substantially
less than that paid by the University of Alabama at Birmingham.
Five years later, the Tenet HealthSystem/USC Norris Cancer Center agreed to pay
$1.9 million (HHS OIG 2010). The health system was already operating under a 5-
year corporate integrity agreement with the OIG as part of the resolution of a wide
range of investigated fraudulent activities. Under the disclosure requirements of the
agreement, the health system revealed that it submitted improper claims for “(1)
items or services that were paid for by clinical research sponsors or grants under
which the clinical research was conducted; (2) items or services intended to be free
of charge in the research informed consent; (3) items or services that were for
research purposes only and not for the clinical management of the patient; and/or
(4) items or services that were otherwise not covered under the Centers for Medicare
& Medicaid Services (CMS) Clinical Trial Policy.” The fine for its research billing
practices paled in comparison to the more than $900 million paid by the health
system in 2006 to settle its other billing liabilities.
In 2013, Emory University admitted to overbilling Medicare and Medicaid in
clinical trials conducted at its Winship Cancer Institute (DOJ 2013). The case was
536 B. K. Martin
In summary, this chapter has discussed the evolution of the rules regarding financial
compliance in clinical trials. The Institute of Medicine report set the stage for the
clinical trials policy, and the GCRCs referenced in the report were leaders in
establishing best billing compliance practices.
Today, a comprehensive program for financial compliance must include the
below elements.
review, documentation that a charge was reviewed is necessary for quality control
and auditing of the process.
• Auditing: The loop is closed by audit of at least a subset of study subjects to
ensure that all expected research charges were identified and billed as appropriate
to the sponsor or the subject’s insurer. It can lead to further auditing if errors are
detected that could be more widespread or systemic.
Key Facts
• In the United States, what can be billed to insurance and what must be paid by a
sponsor are largely determined by CMS.
• Medicare reimbursement for device trials is governed by an interagency agree-
ment between FDA and CMS, which has been in place since 1995.
• Medicare reimbursement for drug trials is governed by the Clinical Trial Policy,
which went into effect in 2000.
• CMS uses the mechanism of Coverage with Evidence Development to allow
coverage of items and services for which CMS does not find there to be enough
evidence to support their use as “reasonable and necessary” but for which
additional data could help clarify the value of the service.
• Research institutions need to have processes for determining at the start of a trial
if it qualifies for CMS coverage and for conducing coverage analyses to deter-
mine which clinical items and services are to be paid by the research study and
which can be billed to CMS.
• Charges that are billed to CMS as routine care provided in the context of clinical
trials are to be labeled as such through the use of modifiers and codes on the
claim.
• Subject remuneration, payment of co-pays, and reimbursement for subject injury
are three complicated issues for which special care must be taken to avoid
noncompliance.
• Noncompliance in research billing carries risk of monetary penalty, as set forth in
the False Claims Act.
• There have been four cases of large fines to research institutions for billing
noncompliance.
• A comprehensive program to ensure financial compliance in clinical trials
requires investment in research infrastructure by institutions.
538 B. K. Martin
References
Aaron HJ, Gelband H (eds) (2000) Committee on routine patient care costs in clinical trials for
medicare beneficiaries, Institute of medicine. Extending medicare reimbursement in clinical
trials. National Academy Press, Washington, DC
Centers for Medicare and Medicaid Services. National coverage determination (NCD) for routine
costs in clinical trials (310.1). Publication 100-3, Version 1. Effective date September 19, 2000
Centers for Medicare and Medicaid Services. Guidance for the public industry, and CMS staff:
national coverage determinations with data collection as a condition of coverage: coverage with
evidence development. Issued July 12, 2006
Centers for Medicare and Medicaid Services. National coverage determination (NCD) for routine
costs in clinical trials (310.1). Publication 100-3, Version 2. Effective date July 9, 2007
Centers for Medicare and Medicaid Services. CMS manual system: medicare claims processing.
New HCPCS Modifiers when Billing for Patient Care in Clinical Research Studies Publication
100-04. Effective date January 1, 2008
Centers for Medicare and Medicaid Services. CMS manual system: medicare managed care.
Chapter 4, Benefits and beneficiary protections. Publication 100-16. Effective date August 23,
2013
Centers for Medicare and Medicaid Services. CMS manual system: medicare benefit policy.
Publication 100-02. November 6, 2014a
Centers for Medicare and Medicaid Services. Guidance for the public industry, and CMS staff:
coverage with evidence development. Issued November 20, 2014b
Centers for Medicare and Medicaid Services. Medicare managed care manual. Chapter 8, Payments
to medicare advantage organizations. Revision 118. September 19, 2014c
Department of Health and Human Services, Food and Drug Administration, Center for Devices and
Radiological Health. Information sheet guidance for IRBs, clinical investigators, and sponsors:
significant risk and nonsignificant risk medical device studies. January 2006
Department of Health and Human Services, Food and Drug Administration, Center for Devices and
Radiological Health. FDA categorization of investigational device exemption (IDE) devices to
assist the centers for medicare and medicaid services (CMS) with coverage decisions: guidance
for sponsors, clinical investigators, industry, institutional review boards, and Food and Drug
Administration staff. December 5, 2017
Department of Health and Human Services, Food and Drug Administration, Office of Device
Evaluation. Implementation of the FDA/HCFA interagency agreement regarding reimbursement
categorization of investigational devices. IDE guidance memorandum #95-2. September
15, 1995
Department of Health and Human Services, Office of the Inspector General. Special advisory
bulletin. Offering gifts and other inducements to beneficiaries. August 2002
Department of Health and Human Services, Office of the Inspector General. OIG Advisory Opinion
08-11. September 17, 2008
Department of Health and Human Services, Office of the Inspector General. Semiannual report to
Congress, Part III: legal and investigative activities related to medicare and medicaid. Fall 2010
Department of Health and Human Services, Office of the Inspector General. OIG Advisory Opinion
15-07. May 28, 2015
Department of Health and Human Services, Office of the Inspector General. OIG Advisory Opinion
16-13. December 13, 2016
Department of Justice. Press release: University of Alabama-Birmingham will pay U.S. $3.39
Million to resolve false billing allegations. April 14, 2005
Department of Justice, Office of Public Affairs. Press release: justice department recovers over $2.8
billion from false claims act cases in fiscal year 2018. December 21, 2018
Department of Justice, U.S. Attorney’s Office, Northern District of Georgia. Press release: Emory
University to pay $1.5 million to settle false claims act investigation. August 28, 2013
27 Financial Compliance in Clinical Trials 539
Contents
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542
Risks of Financial Conflicts of Interest: Reality and Appearance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Regulations and Other Important Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Developing and Implementing COI Policies for Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
Disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
Review for FCOI in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
An Initial Consideration: Thresholds for Participation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550
Study Design and Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
Study Conduct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
Publication/Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553
Documenting and Communicating FCOI Decisions: The Management Plan . . . . . . . . . . . . . . . . . 554
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
Abstract
Because clinical trials are the gold standard for evaluating the safety and
efficacy of drugs and medical devices, they should be conducted as safely and
objectively as possible. Objectivity can be affected by study design and conduct,
but additional risks to objectivity – and possible risks to safety – may arise from
financial conflicts of interest (FCOIs). The explosion in recent decades of
financial relationships between medical researchers and the pharmaceutical and
medical device industry means it is very likely that at least some members of most
clinical trial study teams will have a financial tie with the sponsor or manufacturer
of the study drug or device. Payments of various types are ubiquitous. The data on
J. D. Gottlieb (*)
Johns Hopkins University School of Medicine, Baltimore, MD, USA
e-mail: [email protected]
Keywords
COI committee · Data safety monitoring board · Disclosures · Financial conflicts
of interest · Institutional conflicts of interest · Management plan · Technology
licensing
Definition
Introduction
The primary goals of those who conduct clinical trials must be to carry out safe and
objective research that has the potential to advance the science of human health. If
the safety of research participants (and future patients) and the objectivity of research
are of paramount importance, the potential risks of conflicts of interest must be
addressed.
Conflicts of interest (COIs) are ubiquitous in personal and professional life. While
there have been calls to consider the risks of intellectual COIs (Ioannidis and
Trepanowski 2018), such as strongly held personal beliefs, FCOIs in medical
research have received the most attention in large part because of the widespread
practice of industry payments to researchers and the incentives associated with
inventing, patenting, and licensing new technologies and starting new companies
in the biomedical field. And financial interests, unlike potentially competing intel-
lectual interests, can be measured.
Funders and consumers of academic research and the academic research com-
munity itself have intensified their focus on FCOIs because of the potential they have
to affect research, education, and clinical care. In the wake of various exposés and
28 Financial Conflicts of Interest in Clinical Trials 543
investigational reporting in the media (Stolberg 2019), scrutiny of the impact of the
financial interests of physicians and biomedical researchers increased throughout the
early 2000s. National associations have issued recommendations and guidelines for
addressing the risks that FCOIs pose in education, clinical care, and to some extent
basic and animal research. In academia, most of the attention, including regulation
and association standards, has centered on FCOIs in clinical research because the
welfare of human research participants is at stake and because research results
directly impact medical care and treatment.
This chapter will address FCOIs in clinical trials, including the financial interests
of investigators and the institutions where research is often conducted. Not all
financial interests create the potential for conflict of interest or the appearance of
conflict of interest. If a physician who conducts clinical trials in interventional
cardiology owns stock in an energy company, for example, the financial interest is
unlikely to create the potential for conscious or unconscious bias related to the value
of a particular intervention. However, if the researcher owns stock in a company that
sells cardiac stents, that is likely to create an FCOI with her research. So one must
define the types of financial interests that have the potential to affect the objectivity
and safety of clinical research. Under Public Health Service (PHS) regulation on
FCOI, when a grantee institution is conducting research, an investigator must
disclose to the institution any financial interest “that reasonably appears to be related
to the Investigator’s institutional responsibilities.” Institutions must review the
disclosed interests to identify any “[F]inancial conflict of interest (FCOI),” which
is defined as “a significant financial interest that could directly and significantly
affect the design, conduct, or reporting of PHS-funded research” (eCFR – Code of
Federal Regulations 2019). FCOIs must be addressed with specific management
steps. The key roles and responsibilities of the parties involved in clinical research –
investigators, institutions, the committees that evaluate potential FCOIs, sponsors,
and journals – are set forth in Table 1.
Those conducting or administering clinical trials at hospitals, medical schools, or
research organizations that themselves may have financial interests in biomedical
research also must deal with institutional COI (Cigarroa et al. 2018). Although there
are no US regulations governing institutional COIs in research, many research
organizations include institutional COI in their FCOI policies. When the institution
where the research is being conducted has a financial interest in the outcome of
research, or their senior leaders have ties to companies with financial interests in a
study, there may be an actual or apparent institutional COI. For example, an
institution that is conducting research on a novel bone harvesting device that it
licensed to a start-up company has an institutional COI with the trial testing the
device. Of course, it is an institution’s agents (deans, department chairs, etc.) who act
on behalf of the institution. The risks arise from a concern that in a conscious or
unconscious effort to maximize the value of the product or manufacturer in which
the institution has a stake, institutional agents may make decisions that conflict with
the safety and objectivity of the research project. Another source of concern arises
from the personal financial interests of institutional officials when those interests are
related to the research they oversee. For instance, even if a research dean or hospital
544 J. D. Gottlieb
Table 1 Roles and responsibilities of investigators, institutions, IRB or COI committees, sponsors,
and journals in the disclosure, review and management of financial conflicts of interest
IRB and/or
COI
Investigator Institution Committee Sponsor Journals
Disclose Establish, Review For FDA Set policies
personal publicize, disclosures regulated trials, regarding
financial interests implement, and associated collect permissible
to institution and/ enforce FCOI with specific disclosures from COIs; require
or IRB as policy clinical trials investigators, disclosure to
required manage FCOIs, journal and in
report FCOIs to publications
FDA
Comply with Disclose Develop and If a covered
FCOI institutional communicate entity, report
management COIs for review management payments to
plan as required plan physicians to
Centers for
Medicare and
Medicaid
Services under
Physician
Payments
Sunshine Act
Follow Monitor Identify any
disclosure compliance with failures to
requirements of management comply with
institution (e.g., plans management
to patients, study plan
team members,
sponsor) and of
journals
Report FCOIs to
regulatory bodies
as required
Make FCOI
information
related to PHS
sponsored
studies publicly
available per
PHS regulations
president is not directly involved in a particular study but has stock in the manufac-
turer of the study drug or device, she may – consciously or unconsciously – make
decisions affecting the safety and objectivity of the study. Even decisions that are not
intended to impact a study may be viewed as biased if the decision maker has a
related financial interest and that interest is not disclosed or steps are not taken to
protect the study.
28 Financial Conflicts of Interest in Clinical Trials 545
Regulations on conflict of interest in research are varied and inconsistent with one
another. There are different standards and recommendations issued by accrediting
bodies, journals, professional societies, and national associations (Gottlieb 2015).
Institutional officials should be familiar with the array of relevant regulations and
standards when developing conflict of interest policies. A brief overview of these
standards follows, and additional detail appears elsewhere in this chapter.
Clinical research is subject to the Public Health Service (PHS) regulations on
objectivity in research if PHS support is involved. The Food and Drug
Administration (FDA) regulation applies (CFR – Code of Federal Regulations
Title 21 2019) if the trial data are to be used in marketing applications for FDA
approval of a drug, device, or biologic product. Separate standards are maintained by
the Association for the Accreditation of Human Research Protection Programs
(AAHRPP), which accredits Institutional Review Boards (IRBs), national
associations such as the Association of American Medical Colleges (AAMC) and
the Association of American Universities (AAU), journals, including those that
adhere to the standards set by the International Committee of Medical Journal
546 J. D. Gottlieb
Editors (ICMJE), and professional societies such as the American Society of Clinical
Oncology (ASCO).
Public Health Service. The 1995 PHS regulation on FCOI (titled “Promoting
Objectivity in Research”) was substantially revised in 2011, and the revised
version went into effect 2012. The regulation covers research supported by PHS
agencies (including, among others, the National Institutes of Health, Centers for
Disease Control and Prevention, FDA, and the Centers for Medicare and Medicaid
Services). It outlines the types of financial interests that investigators must report
to a recipient institution that applies for or receives federal research support; how
the disclosed interests must be reviewed for potential FCOI with federally funded
research projects; and the range of possible approaches to managing FCOIs. While
the regulation does not distinguish among different types of research and its focus
is on protecting research objectivity rather than research participant safety, it
acknowledges that FCOIs in research involving human participants carry the
greatest potential risk. Some institutions responded to the 2012 revisions by
applying the federal standards to all research regardless of funding source in
order to have a single, consistent standard for COI review. Others opted to apply
the regulation only to research with federal support, potentially limiting their
administrative burden but creating dual standards for federally funded research
and research with other sources of support.The 2012 revision lowered the financial
“floor” for annual income that must be reported from $10,000 to $5,000 and
expanded reporting requirements for reporting of equity ownership. Other report-
able interests include royalties from intellectual property, honoraria, consulting
fees, and equity in publicly traded and privately held companies. Exceptions to
reporting requirements include income from US institutions of higher education
and service on certain US federal, state, and local government advisory panels.
However, income from non-excluded nonprofit organizations such as foundations
and foreign institutions of higher education must be disclosed. There also is a
requirement that institutions solicit disclosures of payment or reimbursement for
travel. Specified details about FCOIs that institutions have identified and managed
must be reported to the awarding agency and must be publicly disclosed on a
regularly updated website or upon request.
The PHS FCOI regulation does not require that research support from industry to
the recipient institution be disclosed and reviewed for potential FCOI. FDA COI
disclosure requirements do not include industry support for the “covered” study
(although they do include funds the sponsor may provide the institution that are not
directly supporting the covered study). However, there are reasons for institutional
policies to consider the role of industry research support, whether financial or in-
kind, in the course of their FCOI reviews. Some federally supported research pro-
jects also involve support from industry. Many institutions apply their FCOI polices
to research that is not federally funded but may be supported by industry (or
foundations that are closely tied to a biomedical company with an interest in the
research). Journals and professional societies typically require disclosure of research
support from industry. Institutions whose FCOI policies do not include research
supported by industry take the position that while grants or sponsored research funds
28 Financial Conflicts of Interest in Clinical Trials 547
credential for their human research protection programs need to address COI and
institutional COI policies as they prepare for the accreditation process.
The Association of American Medical Colleges (AAMC) issued guidance docu-
ments for dealing with COIs in human subject research in the early 2000s. The most
notable recommendation is that individual FCOIs in human subject research that
exceed certain thresholds should be subject to a “rebuttable presumption,” i.e.,
investigators with those interests should not be permitted to participate in the
relevant human research project. The AAMC’s recommendations challenge institu-
tions to set robust FCOI standards for their human subject research programs.
The International Committee of Medical Journal Editors (ICMJE) has issued a
series of recommendations, including recommendations for addressing conflicts of
interest (ICMJE | Recommendations | Author Responsibilities – Conflicts of Interest
2019) involving authors of journal articles as well as reviewers and editors. The
ICMJE developed a detailed COI disclosure form that journals can require authors
and those involved in the review of manuscripts to complete so there is transparency
about financial interests. The organization recommends that journals publish the
disclosure forms (or key FCOI information) with articles as well as a statement about
the authors’ access to study data. A large number of journals, including many
leading biomedical journals, claim to have adopted the ICMJE recommendations.
Organizations that conduct clinical trials should adopt, publish, and implement a
credible FCOI policy. The policy should:
• (i) Outline the financial interests that must be disclosed to the organization as well
as when and how they should be disclosed.
• (ii) Describe the process and standards for review of those interests, whether by
the IRB or an ancillary committee charged with addressing FCOIs (COI
Committee).
• (iii) Require the institution to issue a written management plan designed to
manage, reduce, or eliminate risks associated with the FCOI.
• (iv) State that the institution will monitor compliance with the FCOI management
plan and address failures to comply.
Disclosure
COI policies should specify who must make disclosures of potential FCOIs, what
interests and other information must be disclosed, and the time frame for disclosure.
Policies should detail whether all or a subset of the following must be disclosed:
personal income in the form of fees, honoraria, or other payments; patents, patents
pending, and trademarks; royalty income or entitlement to royalty under a license
agreement; equity interests in publicly traded companies; equity interests in non-
28 Financial Conflicts of Interest in Clinical Trials 549
So institutions that take institutional COIs into consideration may need manual or
custom methods of matching institutional financial information with trial data.
Substantive review is at the heart of the FCOI process. A robust review should
identify the risks that the investigators’ and institution’s financial interests may
generate in the context of a specific clinical trial and within the framework of
applicable policy and regulations. There should be a well-defined review process,
and the reviewers, whether the IRB members or the members of a COI committee,
should have relevant expertise (e.g., experienced clinical trialists, biostatisticians)
and independence. If the reviewing body is not the IRB, close coordination with the
IRB is essential.
Reviewers should be free of bias. They should disclose any competing personal
interests and should recuse themselves from a particular case if they have an interest
that may bias or appear to bias their review.
any trial of the product is likely to directly and significantly impact the value of the
equity. Likewise, royalty income and entitlement to future royalty income through
inventorship of a study drug or device can create an incentive to demonstrate the
safety or efficacy of the product since that can affect regulatory approval, sales, and
ultimately personal income. Some institutions allow limited participation for inven-
tors of investigational drugs or devices in an attempt to balance the organization’s
drive for innovation and translational research with the risks of FCOIs. Finally,
service as a board member or officer of a company with an interest in the investi-
gational drug or device is often treated as a bar to participation in trials of the
company’s products because fiduciary roles require the individual to act in the best
interests of the company, a goal that may directly conflict with an investigator’s
obligation to carry out safe and objective research.
Some institutions set thresholds for the institution’s own involvement in a trial.
For example, if an institution, through a technology license to a start-up company,
holds equity in the manufacturer of an investigational drug and is entitled to royalty
on eventual sales of the drug, it may be prudent for trials of safety and efficacy to be
conducted at another institution – one that does not have conflicts of interest. Some
institutions require that the protocol and/or the institutional COI be reviewed by
another institution’s IRB or COI Committee, provided that institution does not itself
have a conflict of interest with the study. Another approach is to permit the conflicted
institution to participate, but not lead the study. That may involve allowing only a
small percentage of patients to be enrolled at the institution and ensuring that the
institution does not serve as coordinating center or have another other leadership role
in the study.
Institutions should clearly outline which FCOIs disqualify an investigator from
participating in a trial. To the extent a conflicted individual may be permitted to
participate, the institution must undertake a careful, detailed analysis of the study and
especially the features that are vulnerable to FCOI risks. Elements of the review are
described below. The review should result in a plan to ensure that risks to safety and
objectivity are minimized or mitigated and that there is transparency to all key parties
about any conflicts of interest.
Clinical trials should be designed to answer scientific questions and not to support a
predetermined outcome. Certain study designs can help mitigate the risk that a
conflicted investigator will inject bias into the study. One option is to ensure that
the investigators are blinded to treatment and control arms. Investigators with FCOIs
generate greater risk for unblinded studies, especially Phase I or Phase II studies. For
example, in an oncology study comparing standard of care to standard of care
combined with an interventional therapy, an investigator with a financial interest in
the study drug may be tempted – consciously or unconsciously – to adjust dosages of
the standard therapy, as permitted in the protocol, to boost the apparent efficacy of
the intervention.
552 J. D. Gottlieb
Study designs with objective endpoints that can be recorded, reviewed, and tested
by those without FCOIs are likely to be safer from bias than those with subjective
endpoints.
Transparency is a powerful tool for protection against bias on the part of a
conflicted investigator. Disclosing to all study team members that an investi-
gator has an FCOI builds in a measure of oversight. The study team should
also be informed of the measures put in place to address the FCOI, especially
since they may be charged with implementing parts of the management plan.
For example, a research coordinator who is a consent designee should know if
the PI has a conflict of interest so he can clearly inform prospective subjects of
the COI (Friedman et al. 2007). Study team members should be advised about
how to raise any concerns related to the FCOI and should be protected if they
do so.
Expanding a study to more than one center and vesting greater authority in
another center, e.g., as coordinating center, especially if the PI and/or her institution
have financial interests in the study, can mitigate the risks that any bias in the conduct
of the study at the conflicted center will unduly influence the outcome.
Data Safety Monitoring Boards. Establishing independent data safety
monitoring boards (DSMBs), especially where the DSMB is informed of the FCOI
and formally charged with addressing any FCOI-related risks, can offer powerful
protection for trials with conflicts of interest. This is especially true if there is an
institutional COI, as long as the DSMB members are independent of the institution.
To ensure a DSMB is truly independent, its members should have no conflicting
financial interests. Ideally, the DSMB members should not be appointed by the
industry sponsor, and if an industry sponsor wants to have a nonvoting representa-
tive on a DSMB, that individual may provide information but should not participate
in or be present during discussion or voting.
Study Conduct
Recruitment and Consent. FCOIs can inject risk into the recruitment and
consenting process. Investigators with FCOIs may be tempted to stretch or
expand enrollment criteria to favor a particular outcome. While such behavior
may represent noncompliance with a protocol, the risk can be lowered by
putting protective measures in place such as prohibiting those with FCOIs
from recruiting subjects to trials. Likewise, informed consent should be
obtained by individuals other than a conflicted investigator and ideally by
individuals who (a) know about the FCOIs and (b) are not supervised by the
investigator with an FCOI. Informed consent documents should clearly state if
there is an FCOI or an institutional COI and provide contact information for
prospective subjects who may have questions.
Intervention. The greatest risk to clinical trial participants may be the inves-
tigational intervention. Because administering a drug is fairly straightforward,
there is usually no reason a conflicted investigator needs to participate. If the
28 Financial Conflicts of Interest in Clinical Trials 553
Publication/Reporting
The reviewing body should outline a plan for dealing with the FCOIs or institutional
COIs associated with the trial. A written management plan should outline clearly the
activities in which the conflicted individual (or institution) may participate and under
what conditions; which activities the conflicted investigator may not participate in
and who will handle them instead; and the disclosures that should be made in various
settings. The management plan should be provided to the investigator and others
who have a need to know, and there should be infrastructure in place to monitor and
help ensure compliance with it. Ideally, there should be a process for the conflicted
individual or, in the case of an institutional COI, the responsible individual, to
document their agreement to comply with the management plan.
Policies should be flexible enough that in certain circumstances, a reviewing body
may determine that an FCOI generates such significant risks for a trial that the
individual (or the institution) may have no role in any part of the study. If this
determination is made, it should be communicated through a management plan.
Management plans should address, at a minimum, the following items:
FCOIs are ubiquitous and have the potential to create bias and affect the safety of
clinical research. Financial relationships between the biomedical industry and those
who conduct essential research are likely to be a feature of clinical research well into
the future. Regulations and national standards to minimize or mitigate the risks of
these relationships will evolve, but investigators and research organizations need to
make disclosure, robust review, and careful management of FCOIs a fundamental
part of their culture. Adopting policies and procedures that are easy to understand
and follow and enforcing policies and procedures consistently will foster a culture of
compliance. Institutional leaders should communicate that addressing conflicts of
interest is a top priority and is part of a commitment to integrity in research, and they
should provide sufficient resources to support robust administration of the FCOI
policy. Researchers and institutions that demonstrate a commitment to transparency
and to mitigating undue influence of FCOIs in clinical trials will be viewed by the
scientific community, the public, and patients as credible and objective.
Key Facts
Cross-References
References
Ahn R, Woodbridge A, Abraham A et al (2017) Financial ties of principal investigators and
randomized controlled trial outcomes: cross sectional study. BMJ 356:i6770. https://doi.org/
10.1136/bmj.i6770
Bauchner H, Fontanarosa P, Flanagin A (2018) Conflicts of interests, authors, and journals. JAMA
320:2315. https://doi.org/10.1001/jama.2018.17593
Cain D (2008) Everyone’s a little bit biased (even physicians). JAMA 299:2893. https://doi.org/10.
1001/jama.299.24.2893
CFR – Code of Federal Regulations Title 21 (2019) In: Accessdata.fda.gov. Accessed 28 Jan 2019
Cigarroa F, Masters B, Sharphorn D (2018) Institutional conflicts of interest and public trust. JAMA
320:2305. https://doi.org/10.1001/jama.2018.18482
28 Financial Conflicts of Interest in Clinical Trials 557
Dana J, Loewenstein G (2003) A social science perspective on gifts to physicians from industry.
JAMA 290:252. https://doi.org/10.1001/jama.290.2.252
eCFR — Code of Federal Regulations (2019) In: Ecfr.gov. https://www.ecfr.gov/cgi-bin/text-idx?
c¼ecfr&SID¼992817854207767214895b1fa023755d&rgn¼div5&view¼text&node¼42:1.0.
1.4.23&idno¼42#sp42.1.50.f. Accessed 28 Jan 2019
Friedman J, Sugarman J, Dhillon J et al (2007) Perspectives of clinical research coordinators on
disclosing financial conflicts of interest to potential research participants. Clin Trials 4:272–278.
https://doi.org/10.1177/1740774507079239
Gottlieb JD (2015) Financial conflicts of interest in research. In: Suckow M, Yates B (eds) Research
regulatory compliance. Elsevier Inc., London, pp 253–276
Gottlieb JD, Bressler NM (2017) How should journals handle the conflict of interest of their editors?
JAMA 317:1757. https://doi.org/10.1001/jama.2017.2207
Hhs.gov (2016) Financial conflict of interest: HHS guidance (2004). In: HHS.gov. https://www.hhs.
gov/ohrp/regulations-and-policy/guidance/financial-conflict-of-interest/index.html#. Accessed
4 Feb 2019
ICMJE | Recommendations | Author Responsibilities—Conflicts of Interest (2019) In: Icmje.org.
http://www.icmje.org/recommendations/browse/roles-and-responsibilities/author-responsibili
ties%2D%2Dconflicts-of-interest.html. Accessed 28 Jan 2019
Ioannidis J, Trepanowski J (2018) Disclosures in nutrition research. JAMA 319:547. https://doi.org/
10.1001/jama.2017.18571. Available at: https://jamanetwork.com/journals/jama/article-
abstract/2666008
Lundh A, Bero L (2017) The ties that bind. BMJ 356:j176. https://doi.org/10.1136/bmj.j176
Lo B, Field MJ (eds) (2009) Principles for identifying and assessing conflicts of interests. In:
Conflict of interest in medical research, education, and practice, 1st edn. National Academies
Press, Washington, DC, pp 44–61. Available at: https://www.nap.edu/read/12598/chapter/4.
Accessed 25 Jan 2019
Stolberg S (2019) Youth’s death shakes new field of gene experiments on humans. In: Archive.
nytimes.com. https://archive.nytimes.com/www.nytimes.com/library/national/science/
012700sci-gene-therapy.html. Accessed 28 Jan 2019
Tringale K, Marshall D, Mackey T et al (2017) Types and distribution of payments from industry to
physicians in 2015. JAMA 317:1774–1784. https://doi.org/10.1001/jama.2017.3091
Wilson RF (2009) Estate of Gelsinger v. Trustees of University of Pennsylvania: Money, Prestige,
and Conflicts of Interest In Human Subjects Research. In: Johnson SH, Krause JH, Saver RS,
Wilson RF (eds) Health Law and Bioethics: Cases In Context
World Medical Association (2013) World Medical Association declaration of Helsinki. JAMA
310:2191–2194. https://doi.org/10.1001/jama.2013.281053
Trial Organization and Governance
29
O. Dale Williams and Katrina Epnere
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 560
Key Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
Funding Source and Its Relationship to Trial Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
Individual Organizational Units, Roles, and Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
Committees, Committee Roles, and Committee Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564
Common Threats and Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
Conclusion and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568
Abstract
An issue impacting the success of many human efforts is the organizational and
management strategy required for their successful completion. This is an impor-
tant issue for any clinical trial as well. It is always a challenge to match the needs
required for a successful trial with the resources available in a management
strategy compatible with the experience and personalities of the collection
of investigators and staff involved. Clearly the simplest situation is the single-
site trial with a single investigator and few or no staff. In this situation, the
investigator has only himself or herself to organize and manage. While this is
no guarantee of success, it creates much less of a management burden than does
a multicenter, long-term trial, especially since such endeavors typically include
numerous investigators, central laboratories, reading centers, coordinating center,
O. D. Williams (*)
Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA
Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
e-mail: [email protected]
K. Epnere
WCG Statistics Collaborative, Washington, DC, USA
and a large number of committees, each with its own purpose, requirements, and
personality. The organization and management (OM) issues for this situation are
critically important for the overall success of the trial. This chapter highlights
issues for such long-term, multicenter studies as these situations encompass all
the key, major issues.
Keywords
Organization and management · Organizational structure · Multicenter studies ·
Steering committee · Executive committee · Coordinating center
Introduction
The number of newly registered trials doubled from 9,321 in 2006 to 18,400 in 2014.
The number of industry-funded trials increased by 43%. Concurrently, the number of
NIH-funded trials decreased by 24% (Ehrhardt et al. 2015). In a recent communica-
tion, Meinert indicated ClinicalTrials.gov included almost 95,000 trials started
between 2014 and 2018 (personal communication Meinert 2019). This is a surpris-
ing number in many ways and raises the interesting question as to how many are well
organized and managed and how many will not meet their stated goals as a
consequence of inadequate OM.
It is often said that the inability to recruit adequate numbers of trial participants is
the most common cause of the failure of a clinical trial. The root cause of failure in
this case, however, is most likely due to an OM strategy that was not up to the task.
This situation was recognized early on in the history of multicenter trials in
the USA and was addressed in the Greenberg Report (1967) prepared in 1967 and
formally published in Controlled Clinical Trials in 1988. This report includes an
organization chart that has stood the test of time although the situation has evolved
in directions and magnitudes that were perhaps unimaginable in 1967. The key
components of this chart are listed in the discussion of committees below.
It is important to point out that it is not uncommon for the OM general issue to
receive inadequate attention from the earliest phases of trial planning as these issues,
critical as they are, often are much less interesting than the scientific and health-care
issues under consideration. The consequence of this lack of appropriate attention can
be catastrophic failure. Farrell et al. have repeatedly pointed out that even though
eminent trialists have written persuasively and repeatedly of the need for large,
randomized, controlled trials, in the scientific literature, little attention has been
given to the day-to-day and strategic management of such trials. She emphasizes
that the knowledge and expertise gained on running earlier trials are not widely
disseminated and new trials often have to begin from scratch. Because randomized
trial involves a huge investment of time, money, and people, Farrell suggests it
should be managed like any other business (Farrell 1998; Farrell et al. 2010).
Multicenter clinical trials often operate under two separate but related organiza-
tion charts, one representing the funding structure and its accountability and
29 Trial Organization and Governance 561
financial reporting expectations and one representing the committee structure for the
overall trial. The funding structure requirements necessarily address issues related to
inadequate performance of a trial’s individual funded entities. The committee struc-
ture performance expectations, while also critical, tend to be less concretely formu-
lated. The remainder of this chapter focuses on the latter.
The overall committee structure typically reflects a balance among the appro-
priate representation of stakeholders, expertise requirements, and operational
efficiency. The first of these may require large committees if there are large
numbers of clinical field sites and central units, which may be further augmented
should the expertise required not be available from these units. Such large numbers
of persons on committees may make it difficult or impossible to proceed with the
required efficiency. One strategy used is to create a steering committee, consisting
of representatives of all the stakeholders which has overall responsibility for the
trial. The role of the steering committee is to provide oversight of the trial on
behalf of the sponsor and funder and ensure that the trial is conducted in accor-
dance with the principles of GCP and relevant regulations. The steering committee
should focus on the progress of the trial, adherence to the protocol, and participant
safety (McDonald et al. 2014). A subcommittee, sometimes called an executive
committee, which is much smaller may be more directly responsible for day-to-
day issues.
Since the OM strategy for a multicenter, long-term clinical trial typically has
a committee structure at its core, it might be worthwhile to reflect on the old adage
that a camel is horse designed by a committee. The fact that a key committee exists
does not, unfortunately, necessarily mean it will function commendably. Success
requires the productive cooperation of all key stakeholders operating in a system that
recognizes and takes into account their individual needs as well as those of the
overall trial. The WRIST study group wrote a Guide on Organizing a Multicenter
Clinical Trial and stated that planning of multicenter clinical trials (MCCTs) is a long
and arduous task that requires substantial preparation time. They emphasized an
essential asset to planning a MCCT is the fluidity with which all collaborators work
together toward a common vision. This would mean a development of a consensus-
assisted study protocol and the recruitment of centers and co-investigators who are
dedicated, collaborative, and selfless in this team effort to achieve goals that cannot
be reached by a single-center effort (Chung et al. 2010).
Key Factors
A list of factors that may be helpful to consider when developing an OM plan for
a trial includes the following:
A goal for the overall OM scheme can perhaps best be characterized by the simple
statement “Who reports to whom about what and when.” Which bodies need a
report? What types of reports are required? How often are reports required and in
what format? What data are required to be included in the report, for example,
recruitment data, safety data, and blinded or unblinded data? Who will produce the
reports (McDonald et al. 2014)? A scheme that identifies and clarifies roles, respon-
sibilities, and accountability for the entities involved is vitally important.
Funding sources for clinical trials include for-profit entities such as pharmaceutical
firms; numerous US government agencies and those of other countries and the
European Union; various not-for-profit entities including foundations, societies,
and others; and international organizations such as the World Health Organization.
Each such source has its own expectations as to how trials it funds will be organized
and managed in the general sense and what its specific functional role will be for
a given trial. It is critically important that these expectations are clearly understood
by the investigative team at the very outset of a trial. The expectations and context
for pharmaceutical industry-sponsored trials are importantly different from those
sponsored by NIH, for example, and the OM scheme to be utilized needs to be fully
cognizant of these differences.
It should also be noted that some of the operational units may be funded through
subcontracts with other trial organizational units which are funded directly from the
funding agency. For example, some laboratories and reading centers may operate
under subcontracts to a coordinating center. In this case, the organizational unit
offering the subcontract has to have the resources and expertise to select and manage
the relationship with the entity under subcontract. Olmstead summarized that several
articles and surveys have addressed concerns of pharmaceutical company research
staff with the performance of their outside contract researchers. He classified these
issues into four categories – credibility, responsiveness, quality of product, and cost.
He concluded that strong emphasis on quality control and improved, automated data
management are key elements of improvement and added that improved organiza-
tion and management efforts on the part of contract researchers themselves will go
far to reduce the most obvious difficulties (Olmstead 2004).
1. Clinical centers. Clinical centers are the core operational unit for a trial.
The number of such units is usually based on those required to recruit the required
number of trial participants. Clinical centers recruit and interact with trial partic-
ipants as required for the duration of the trial. They also are responsible for all
29 Trial Organization and Governance 563
local research-related approvals and for collecting and transmitting data, typically
to a coordinating center. They also deal with biological samples, sent either to
a local laboratory or to a central laboratory, and for the collection and transmis-
sion of any images, again, to local readers or to a central reading center. They
participate in the trial committee structure as appropriate.
2. Coordinating center. The trial coordinating center is the heart of the trial, whether
it is a single-site or multicenter trial. Sometimes the broader function served by
this unit is divided into a clinical coordinating center and a data coordinating
center. In general, this combined entity is responsible for data-related and study
coordinating issues. The data coordinating component typically is responsible for
key elements of trial design and for data collection systems, data management,
and data analyses. This includes as well the design and testing of data collection
forms and data collection quality assessment. The data coordinating center
component also would prepare reports for trial overview committees such as
Data and Safety Monitoring Boards. The clinical coordinating center component
often is responsible for managing and reviewing the adverse and serious adverse
events. Responsibility for providing staff support for at least some committees is
usual. The data coordinating component team usually consists of chief investi-
gator(s), trial manager, programmer/IT support, database manager and/or data
clerks, and trial statistician (McDonald et al. 2014).
3. Central laboratory. Some trials require, in addition to the use of local laboratories,
more than one central laboratory. In general, central laboratories are responsible
for creating and maintaining shipping procedures for the transmission of samples
from the clinical centers to the lab. They are responsible for high-quality labora-
tory analyses for the parameters under their purview and for transmission of the
resulting data to the coordinating center. If abnormal results are considered
adverse or serious adverse events, they would be required to transmit appropriate
notifications. Importantly, they should participate in the appropriate standardiza-
tion programs and any quality control activities specific to the trial. They also may
serve as an archive for biological materials collected by the trial. Personnel from
the central lab also may participate in trial committees.
4. Central reading center. Some trials require the use of central reading centers for
images critical to the assessment of patient safety or trial outcomes. These centers
typically are responsible for the systems that transmit images from the clinical
centers to the reading center and for transmitting the results of assessments
they complete to the coordinating center. The center is expected to perform
high-quality assessments for the readings they undertake and to participate in
quality control activities as appropriate. They, like the central labs, may also serve
as an archive for images collected by the trial. Personnel from the center may
participate in trial committees.
experience and capability to successfully work with both sets of masters. Keeping in
mind that a trial may include more than 30 organization units, it would be ideal if
these 30+ units were each led by someone with appropriate OM experience and
capability. This doesn’t always happen, and a poorly organized and managed unit
can jeopardize the overall trial.
Since the core management strategy for many clinical trials is based on a committee
structure, the creation of the committees, selection of their members, and their
operational effectiveness and efficiency are of paramount importance. Decisions
need to be made up front as to which committees will be needed at least at the
outset. Often the first committee created is the steering committee, which includes
representatives of all the key stakeholders. Sometimes the chair is designated by
the funding agency and sometimes elected from the members. However this is done,
this person is key to the overall success of the trial and therefore needs to have the
requisite knowledge, experience, and personality for the task. There also needs to be
a succession plan that provides backup as needed.
The likelihood of this success may be enhanced by the following considerations:
6. Meeting conduct: Factors that may facilitate the success of the committee meet-
ings include appropriate agendas prepared well ahead of the meeting, accompa-
nied by documents and materials as appropriate; efficiently conducted meetings
to include appropriate control of time devoted to individual items and speakers;
and clear minutes and follow-up on issues addressed in previous meetings.
7. Committee accountability: Most trial committees are in fact subcommittees
to a steering committee or similarly designated committee so that they report to
this higher committee. The steering committee should hold the subcommittees
accountable for meeting their charge in a high-quality, timely fashion. This
typically requires both written reports and presentations at the steering committee
meetings.
The trial’s committee structure has the responsibility to inform the funding
structure component of issues that need to be addressed for specific individual
operational entities. This may require special reports and/or special meetings.
The designated committees play key roles, and their successful operation is
critical to the success of the overall trial. Examples of committees, which typically
operate as subcommittees of and thus report to the steering committee include:
9. Data form committee. Responsible for the development and testing of data
collection forms and sometimes overviews the data collection training and
certification procedures.
10. Publication and presentation committee. Responsible for overviewing the pub-
lication and presentation process for the trial. This includes efforts to help ensure
that trial publications are completed in a timely manner and also deals with
authorship conflict issues.
11. Data and safety monitoring board. An independent board of experts in the topic
of the trial and biostatistics responsible for trial integrity and participant safety.
Typically reports to the funding entity but also may report jointly to the steering
committee. Usually operates according to a charter established at the outset of
the trial. Reviews adverse and serious events and trial analysis reports.
12. Advisory committee. Some trials may involve an overarching advisory com-
mittee which is appointed by and reports to the funding entity. This committee
may assist with setting overall directions and with broad overview assessment of
trial progress and success.
As is the case for any endeavor such as a clinical trial, failure can occur. Some key
issues are:
1. Participant recruitment. One of the most common causes of a clinical trial failing
to be able to operate to completion is failure to recruit adequate numbers
of participants to undertake the randomization process. The trial OM process
sometimes is too slow to react to this crisis, and when it reacts, it does so with too
little too late. It is imperative that the OM process monitor recruitment status from
the very outset and react strongly to indications of recruitment problems.
The assumption should be that the enrollment will be slower than projections and
almost every trial should implement proactive measures to foster enrollment.
Frequent monitoring of actual vs projected enrollment by site to identify trends
gives the opportunity to consider protocol amendment, additional recruitment
funds, site closure, etc. (Allen 2015).
The STEPS study analyzed 114 multicenter trials and showed that 45% failed to
reach 80% of the prespecified sample size. Less than one third of the trials recruited
their original target number of participants within the time originally specified, and
around one third had to be extended in time and resources. Trials that actually
recruited successfully shared a common factor – they had employed a dedicated trial
manager. The STEPS collaborators suggested that anyone undertaking trials should
think about the different needs at different phases in the life of a trial and put greater
emphasis on “conduct” (Campbell et al. 2007; Farrell et al. 2010).
2. Clinical center failure. Especially if the trial includes a rather large number of
clinical centers, one or more may not perform adequately. Such a situation may
29 Trial Organization and Governance 567
jeopardize the trial. Since this may be a leadership problem at the clinical site, the
trial OM system may need to step in quickly and assist or replace as needed.
3. Coordinating center performance. Coordinating centers need to develop data
collection, data management, and data analysis systems and operate them in a
timely and high-quality fashion. If this does not happen, the consequence can be
severe.
Gathering clean data is among the most important steps to a successful clinical
trial. Even if the sites are found and patients recruited, but the data is inaccurate, it
will not be of use to the sponsor. Consider doing early and routine review of
data – whether remotely or during a monitoring visit. Priority should be given to
primary endpoint data. Identifying data quality issues early on, correcting those
issues, retraining site personnel, and establishing preventative measures allow for
data issues to be addressed and resolved quickly before evolving into significant
problems (Allen 2015).
4. Committee failure. If a key committee lags behind and is causing delays in trial
development or operation, some corrective action may be needed. Sometimes a
new chair should be appointed.
As described above, numerous entities typically are involved in the organization and
management of multicenter long-term trials so that there are numerous opportunities
for failure. Clearly, a clear and detailed organization and management strategy needs
to be established well before the onset of the trial. The strategy needs to provide an
unambiguous answer for the essential question “who reports to whom about what
and when.” Strong and experienced leadership closely connected with day-to-day
operations in a system that provides continuous monitoring and flexibility to adjust
to unexpected situations is key to success.
Cross-References
References
Campbell MK, Snowdon C, Francis D, Elbourne D, McDonald AM, Knight R, Entwistle V, Garcia
J, Roberts I, Grant A, The STEPS Group (2007) Recruitment to randomised trials: strategies for
trial enrolment and participation study. The STEPS study. Health Technol Assess (Winch Eng)
11(48). iii, ix–105
Chung KC, Song JW, WRIST Study Group (2010) A guide to organizing a multicenter clinical trial.
Plast Reconstr Surg 126(2):515–523
Ehrhardt S, Appel LJ, Meinert CL (2015) Trends in National Institutes of Health funding for clinical
trials registered in ClinicalTrials.gov. JAMA 314(23):2566–2567
Farrell B (1998) Efficient management of randomised controlled trials: nature or nurture. BMJ 317
(7167):1236–1239
Farrell B, Kenyon S, Shakur H (2010) Managing clinical trials. Trials 11(1):78
Greenberg Report (1967) Organization, review, and administration of cooperative studies
(Greenberg report): a report from the heart special project committee to the National Advisory
Heart Council. Control Clin Trials 1988(9):137–148
Online Documents
Allen S (2015) Best practices for clinical trial operations. https://www.pharmoutsourcing.com/
Featured-Articles/180536-Best-Practices-for-Clinical-Trial-Operations/
McDonald A, Lane A, Farrell B, Dunn J, Buckland S, Meredith S, Napp V (2014) Trial managers’
network guide to efficient trial management. https://cdn.ymaws.com/www.tmn.ac.uk/resource/
collection/77CDC3B6-133F-42E6-9610-F33FF5197D2F/tmn-guidelines-web_[amended_
July_2014].pdf
Meinert C (2019) The trend in trials. https://jhuccs1.us/clm/PDFs/NameThatTune.pdf
Olmstead FL (2004) Improved organization and management of clinical trials. http://www.
appliedclinicaltrialsonline.com/improved-organization-and-managementclinical-trials
Advocacy and Patient Involvement
in Clinical Trials 30
Ellen Sigal, Mark Stewart, and Diana Merino
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570
Patient Engagement in Research and Drug Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571
Primary Areas of Engagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
Challenges Associated with Incorporating Patients into Research and Drug Development . . . 573
Barriers to Patient Engagement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574
The Contribution of Patient Advocacy to Research and Drug Development . . . . . . . . . . . . . . . . . . 575
Trial Designs and Endpoint Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
Capturing and Measuring Patient Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
Contributors to Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578
Future Areas of Innovation and the Evolving Clinical Trial Landscape . . . . . . . . . . . . . . . . . . . . . . . 579
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
Abstract
Patient engagement in research and clinical trials has evolved over time. Patients
are no longer simply passive research subjects but are increasingly being inte-
grated into research teams and protocol review teams to help design, implement,
and disseminate clinical trial findings. While potential barriers exist for mean-
ingful patient engagement, mechanisms and methods to effectively engage
patients and advocacy groups are evolving, and resources and best practices are
continually being developed to assist researchers and patients. Additionally,
legislation and regulatory guidance are being instituted to promote patient
engagement and ensure it is a routine process for clinical trial development.
Developing patient-centered clinical trial designs has led to development of
innovative clinical trial infrastructures and statistical methods. Patient advocates
and organizations are also increasingly developing their own data sources and
clinical trials, which represent unique opportunities for researchers to partner
with patient groups to rapidly advance drug development.
Keywords
Patient advocacy · Drug development · Patient engagement · Patient-Centered
clinical trials
Introduction
The role of patients and advocates in clinical research and their involvement in
the regulation and oversight of clinical trials have substantially grown over time.
In just a few decades, patients have gone from being considered passive human
subjects whose clinical measures would contribute to answering research questions
to active participants and engaged stakeholders. This growing movement toward a
more patient-centered approach aims to provide the best healthcare for each patient,
which takes into consideration the patient’s own goals, values, and preferences
(Manganiello and Anderson 2011). This movement is rooted in early advocacy
efforts led by the HIV/AIDS community dating back to 1988 and resulted in
fundamental changes to the medical research paradigm.
The path from initial development of a new drug to entry of the new therapy into
the patient community relies on clinical trials, which represent the final step
in evaluating the safety and efficacy of new therapeutic approaches. Along this
developmental path, patients can provide critical input from collecting natural
history information; involvement in endpoint selection; protocol design; consent
and eligibility; clinical trial recruitment and retention strategies; design of post-
market safety studies; and dissemination of trial findings (Fig. 1).
A detailed analysis of several clinical trials indicates that 48% of all sites in a
given trial fail to meet their enrollment targets and more than 11% never enroll a
Post-
Drug Clinical Trials
Preclinical FDA Review marketing
Discovery (Phase I/II/III)
surveillance
• Input on relevance of • Characterize eligibility • Support for patient • Provide public • Participate in post-
research to patient criteria for clinical selection & testimony at FDA marketing
community trials recruitment Advisory Committee surveillance
• Identify unmet needs • Input on relevant • Educate/motivate • Attend/participate in initiatives
• Provide data on clinical endpoints patient community FDA hearings • Support in returning
therapeutic burden • Input on PROs • Serve on Data & • Serve on FDA study results to
• Input on informed Safety Monitoring advisory committees participants
consent form/process board • Present results to
• Work with regulatory • Input for trial patient community
agencies on benefit- adaptations & • Provide feedback on
risk & draft guidance modifications patient community
• Support sponsors at • Participate in benefit- perception of results
pre-IND FDA meeting risk discussions • Work with payers in
• Ensure balanced trial • Participate in patient reimbursement
portfolios preference studies
single patient (Kaitin 2013). It is estimated that less than 5% of adult cancer patients
enroll in a clinical trial despite many indicating their desire to participate in clinical
trials (Comis et al. 2003; Unger et al. 2016). Thus, significant barriers such as
clinical trial access, demographic and socioeconomic challenges, inappropriate or
excessive procedures, broad exclusion criteria, lack of patient-centric trial designs,
and patient and physician attitudes remain that hinder trial participation. While not
every barrier may be readily overcome, engaging patients early and often throughout
the entire research and drug development process can help ensure appropriately
designed trials that are viewed favorably by patients, answer questions important to
the patient community, and ultimately encourage participation.
A growing body of evidence describing the benefits of patient involvement in
research and clinical trials is slowly changing scientific, medical, and regulatory
practices. In their systematic review, Domecq and colleagues found that
patient engagement positively influenced research by increasing study enrollment
rates and helping researchers in securing funding, designing study protocols, and
choosing relevant outcomes (Domecq et al. 2014). Greater patient engagement in
research and clinical trials would help drug developers sponsor trials that are
more informed about the needs of the patients, which would translate to more
feasible and streamlined trial design generating better outcomes (Hanley et al.
2001; Tinetti and Basch 2013). Increased engagement could also reduce patient
accrual time due to improved enrollment, reduce patient attrition, and make findings
more applicable and relevant to the target population (Bombak and Hanson 2017),
which would significantly decrease trial costs. Implementation of mechanisms for
patient engagement can vary.
Acknowledging that patients are central to research and drug development, several
national and international organizations have invested in clearly defining the role of
patient involvement in research practices and the need for the development of
innovative infrastructures that will help facilitate the incorporation of the patient
voice in all stages of the research process, including design, execution, and transla-
tion of research (Domecq et al. 2014). The Patient-Centered Outcomes Research
Institute (PCORI) was established in 2010 to improve the quality and relevance of
evidence available to help stakeholders make better-informed health decisions and
requires that all its funded research projects include patient input throughout the
entire research study (www.pcori.org). Patient engagement has been defined by
PCORI as “involvement of patients and other stakeholders throughout the planning,
conduct, and dissemination of the proposed project” and is becoming institutional-
ized and incorporated into several funding schemes (PCORI 2018). Patient-driven
research activities have ranged from pre-discovery funding for development and
acquisition of animal models and cell lines all the way to post-market study design
and value discussions.
572 E. Sigal et al.
The US Food and Drug Administration (FDA) recognizes that patients are experts
on living with their conditions, and as such, their voice is uniquely positioned to inform
stakeholders and provide the right therapeutic context for drug development as well as
perspective on the outcome measures that are most relevant to patients and evaluation
by regulatory agencies (Anderson and McCleary 2016). Patients may voice their
concern or support for the development of certain drugs and provide a firsthand
perspective on the proper balance of risk to benefit for a particular disease or patient
population. For instance, the patient voice was crucial when reintroducing Tysabri, a
monoclonal antibody used to treat multiple sclerosis, which had been previously
removed from the market following reports of lethal side effects. After the thorough
review of safety information, the FDA convened an advisory committee where patients
and caregivers were invited to testify. Weighing all evidence, including the advocates’
testimonies, the FDA found enough support to remarket the drug under a special
prescription program (Schwartz and Woloshin 2015). Additionally, the FDA has
formalized several initiatives to encourage the inclusion of the patient voice in medical
product development. Under the fifth authorization of the Prescription Drug User Fee
Act (PDUFA V) signed into law in 2012, the FDA began the Patient-Focused Drug
Development (PFDD) program with the intent to more systematically incorporate the
patient perspective into drug development (FDA 2018). From 2012 to 2017, the FDA
organized 24 disease-specific PFDD meetings that have helped capture patients’
experiences, perspectives, and priorities and enabled the incorporation of this mean-
ingful information into the drug development process and its evaluation. Duchenne
muscular dystrophy advocacy organizations helped to exemplify how patient and
advocates can successfully inform regulators, provide meaningful input into benefit
and risk assessments, and identify treatment priorities. To build on this success and
enable more patient advocacy organizations to shape and influence drug development,
the twenty-first Century Cures Act and PDUFA VI have tasked FDA with developing
additional guidance to describe approaches to gather patient experience data, quanti-
fying benefit and risks, and using patient-reported outcomes in treatment development.
Moreover, the newly formed FDA Oncology Center of Excellence (OCE) has made
PFDD a priority and is exploring innovative regulatory strategies that incorporate
patient input. Additionally, the National Cancer Institute (NCI) also encourages patient
advocates to be involved in the clinical trial process. The SWOG Cancer Research
Network, one of five NCI cooperative cancer research groups, has an advocate assigned
to every research committee and who is involved in every stage of the process.
benefits observed for study sponsors and participants, greater patient involvement is
also driven by a compelling ethical rationale that lies behind the participation of
patients in the democratization of the research process (Domecq et al. 2014).
Data shows a compelling relationship between the incidence of clinical trial
enrollment and improvement in cancer population survival, and a recent survey
indicates the value patient engagement can have on improving patient retention and
accelerating trial accrual (Smith et al. 2015; Unger et al. 2016). However, several
challenges and concerns remain about the way patient engagement is being
conducted (Bombak and Hanson 2017).
The most commonly described patient engagement barriers were related to logistics
and a concern of tokenistic engagement (Domecq et al. 2014). Tokenism refers to
involving patients superficially. This can often occur when a small number of
participants, who may be involved in the research process minimally, are considered
to represent a far larger and diverse patient group. This insincere act of patient
inclusion hinders patients from seeking greater involvement in the research process,
and it lessens the credibility of the patient voice. Indeed, various research studies
have identified that people frequently find that participating in clinical trials is
meaningless or disempowering (Mullins et al. 2014), yet people often want to be
informed, empowered, and engaged in their medical management (Davis et al.
2005). Some programs may require patients to undergo intense forms of training
and involve abundant time, interest, and potentially resources (Bombak and Hanson
2017). These requirements may create preference for observable or quantitative
skills over instinct and intuition and may bias the perspectives shared as part of
the study. The lack of incentives or payment for a patient’s time may also be a barrier
for some patients to become engaged in research. Moreover, various erroneous
perceptions have been identified as barriers for engagement. Some studies have
identified the detrimental perception that patients will not be objective in their
decisions and will become a hurdle in the design and development process or that
patients and advocates are naïve about the research process and funding problems
(Hanley et al. 2001; Bombak and Hanson 2017). These barriers should be assessed
in more detail, and greater efforts should be placed on overcoming any perceived
drawback that would prevent patients from engaging and getting involved in
scientific research.
Historically, few mechanisms existed for systematic engagement of patients in the
drug development continuum, and in the very seldom cases in which structures for
patient participation exist, they may be disorganized or confusing (Hohman et al.
2015). Efforts to overcome these should be undertaken, and learning modules and
information are available to provide best practices. In recognition of these potential
barriers, many patient advocacy organizations have research training programs
designed specifically for patients to help inform and prepare them to support research
studies. They can also provide mechanisms to connect patients with opportunities to
30 Advocacy and Patient Involvement in Clinical Trials 575
The incorporation of the patient voice has directly impacted the way trials are
designed and conducted (Mullins et al. 2014). The way in which clinical trials are
designed can transform the evidence generation process to be more patient centered,
providing people with an incentive to participate or continue participating in
clinical trials. Providing better information to participants and incorporating
alternative trial designs will minimize concerns that clinical trials aren’t patient
centered and will dispel any doubts or concerns that prevent patients from becoming
meaningful participants in the planning and design of clinical trials. Addressing the
concerns and desires of patients has led to innovative strategies and designs to make
trials more patient centric.
Many new therapies in oncology are molecularly targeted against specific oncogenic
driver mutations that may be present in only a fraction of the patient population.
Although the advent of targeted therapies holds great promise for patients, it also
means that many patients may need to be screened before enough patients harboring
the necessary mutation are found. Additionally, patients may not have the mutation
of interest and will potentially have to seek out a variety of trials before finding a
match. Master protocols are one mechanism to assist with the development and
investigation of targeted therapies (Woodcock and LaVange 2017). Perhaps one of
the greatest efficiencies of the collaborative clinical trial system is its increased
benefit to patients seeking access to genomic screening technologies and
576 E. Sigal et al.
lifesaving access for patients to the innovation under investigation. Recognizing this
problem, Friends of Cancer Research and the Brookings Institute convened a panel
of experts at a 2011 conference to discuss potential methods for streamlining the
FDA approval process for drugs that show large treatment effects early in develop-
ment while still ensuring drug safety and efficacy. The discussion at this conference
informed the creation of the “Advancing Breakthrough Therapies for Patients Act”
which established the FDA’s Breakthrough Therapy Designation (BTD). This des-
ignation defines a breakthrough therapy as a drug intended to treat a serious or life-
threatening disease or condition and for which preliminary evidence indicates that
the drug may demonstrate substantial improvement over existing therapies (FDA
Fact Sheet: Breakthrough Therapies). Once BTD is requested by the drug sponsor,
the FDA and sponsor work together to determine the most efficient path forward, and
if the designation is granted, the FDA will work closely with the sponsor to help
expedite the development and review of the drug. Because innovative designation
and approval pathways such as BTD take into consideration novel approval end-
points for clinical trials demonstrating higher rates of benefit in carefully selected
patients, it is especially critical that patients are involved in identifying and defining
the endpoints most important to them.
Given the broad benefits associated with patient involvement in scientific
research and clinical trials, it is crucial to focus on greater dissemination and
awareness. Strategies for the uptake and implementation of mechanisms for patient
involvement should involve patients and patient advocates, health professionals, and
drug developers. The creation of more educational resources to support researchers
and patients when coordinating the incorporation of the patient voice in clinical
trials would also improve the uptake of these mechanisms.
All international regulatory agencies acknowledge that robust and accurate data
collected from the patient experience can be useful, as it complements existing
measurements of safety and efficacy, but warn that poorly defined PRO methodology
using heterogeneous analytical methods greatly hinders the incorporation of
PRO data in regulatory decision-making (Kluetz et al. 2018; Kuehn 2018;
Bottomley et al. 2018). It recommends that sustained international collaboration
among regulatory agencies is required to improve patient experience collection and
standardize the assessment, analysis, and interpretation of patient data from clinical
trials.
The FDA has recognized that a central aspect of PFDD is the use of patient-
reported outcomes (PROs) as a way to incorporate the patient voice in drug
development and regulatory decisions. PROs are directly reported by the
patient and provide a status of the patient’s health, quality of life, or functional status
(FDA-NIH Biomarker Working Group 2016). PRO measures can provide a better
understanding of treatment outcomes and tolerability from a patient perspective and
complement current measures of safety and efficacy (Kim et al. 2018). In 2009,
the FDA released guidance for industry on the use of PROs in medical product
development to support labeling claims and has worked with other advocacy
organizations, such as the Critical Path Institute, and industry to form working
groups that seek to engage patients and caregivers in the development of robust
symptom-measuring tools, such as the PRO Consortium. Although challenges exist
when seeking to collect patient and caregiver experience data, such as the need
for more personalized and dynamic measuring tools that keep up with the diversity
of novel drug classes with wide variety of toxicities, greater efforts to ensure
consistency, reliability, and applicability of these data are warranted to support
robust use in the drug development space.
Patients and advocacy organizations are also actively establishing their own data
sources to support clinical drug development and, in some instances, establishing
their own clinical trials. These include patient registries, online data-sharing
communities, wearable devices, and social media tools for capturing longitudinal
data points. Organizations such as the Genetic Alliance, the National Organization
for Rare Disorders, and Parent Project Muscular Dystrophy have launched regis-
tries to study the natural history of disease, burden of disease, expectations for
treatment benefits, and perspectives on tolerable harms and risks. These tools can
help inform academia and industry and incentive further study into a particular
disease state. Through public-private partnerships, advocacy organizations are
also initiating clinical trials within their patient communities. For example, the
Leukemia and Lymphoma Society is leading the Beat AML Master Trial, which is
a collaborative trial to test targeted therapies in patients with acute myeloid
leukemia (AML) (Helwick 2018). Principle investigators should look for oppor-
tunities to utilize and integrate these data collection efforts into their research
30 Advocacy and Patient Involvement in Clinical Trials 579
There has been great progress in the area of patient engagement in clinical trials and the
advancements being made by patient advocacy groups, and additional areas
of opportunity continue to be identified. The development of more refined frameworks,
models, best practices, and guidelines will help ensure early investigators have founda-
tional knowledge to meaningfully engage patients and advocacy organizations in their
research questions and drug development programs. Biopharma is investing heavily to
accelerate development timelines. TransCelerate BioPharma Inc., a nonprofit organiza-
tion that creates collaborations across biopharmaceutical research and development
community, has recently launched a new initiative around patient awareness and access
(TransCelerate 2018). Toolkits are available to assist research teams in engaging patient
advocacy organizations and participants to optimize clinical trial designs. Additionally,
some healthcare systems are partnering with cognitive computing platforms to help
physicians match, enroll, and support patients (Bakkar et al. 2018).
The incorporation of external data sources to streamline, augment, and support
clinical trial development is growing rapidly, due in large part to the advent of
technological solutions that include patient collaboration programs, crowdsourcing,
and the collection of big data and analytics. The US FDA is currently developing
guidance and a framework to describe how real-world evidence can support drug
development and regulatory decision-making. These external data sources represent
an opportunity to augment clinical trial data and can potentially result in more
streamlined drug development with fewer patients. These novel mechanisms of
data collection, as well as their use and implementation, will continue to require
the involvement of active advocates and consumers, who, through their experience,
will contribute greatly to the oversight and eventual success of future clinical trials.
Cross-References
References
Anderson M, McCleary KK (2016) On the path to a science of patient input. Sci Transl Med 8:1–6.
https://doi.org/10.1126/scitranslmed.aaf6730
Bakkar N, Kovalik T, Lorenzini I et al (2018) Artificial intelligence in neurodegenerative disease
research: use of IBM Watson to identify additional RNA-binding proteins altered in
amyotrophic lateral sclerosis. Acta Neuropathol 135:227–247. https://doi.org/10.1007/s00401-
017-1785-8
Bombak AE, Hanson HM (2017) A critical discussion of patient engagement in research. J Patient
Cent Res Rev 4:39–41. https://doi.org/10.17294/2330-0698.1273
Bottomley A, Pe M, Sloan J et al (2018) Moving forward toward standardizing analysis of quality of
life data in randomized cancer clinical trials. Clin Trials 15:624–630. https://doi.org/10.1177/
1740774518795637
Chhatre S, Jefferson A, Cook R et al (2018) Patient-centered recruitment and retention for
a randomized controlled study. Trials 19:205. https://doi.org/10.1186/s13063-018-2578-7
Comis RL, Miller JD, Aldigé CR et al (2003) Public attitudes toward participation in cancer clinical
trials. J Clin Oncol 21:830–835. https://doi.org/10.1200/JCO.2003.02.105
Davis K, Schoenbaum SC, Audet AM (2005) A 2020 vision of patient-centered primary care.
J Gen Intern Med 20:953–957. https://doi.org/10.1111/j.1525-1497.2005.0178.x
Domecq JP, Prutsky G, Elraiyah T et al (2014) Patient engagement in research: a systematic review.
BMC Health Serv Res 14:1–9. https://doi.org/10.1016/j.transproceed.2016.08.016
FDA (2018) FDA voices: perspectives from FDA experts. https://www.fda.gov/newsevents/news
room/fdavoices/default.htm. Accessed 12 Nov 2018
FDA Fact Sheet: Breakthrough Therapies. https://www.fda.gov/regulatoryinformation/lawsenfor
cedbyfda/significantamendmentstothefdcact/fdasia/ucm329491.htm. Accessed 12 Nov 2018
FDA-NIH Biomarker Working Group (2016) BEST (Biomarkers, EndpointS, and other Tools)
Resource [Internet]. Food and Drug Administration (US), Silver Spring; Co-published by
National Institutes of Health (US), Bethesda
Fergusson D, Monfaredi Z, Pussegoda K et al (2018) The prevalence of patient engagement in
published trials: a systematic review. Res Involv Engagem 4:17. https://doi.org/10.1186/
s40900-018-0099-x
Hanley B, Truesdale A, King A et al (2001) Involving consumers in designing, conducting, and
interpreting randomised controlled trials: questionnaire survey. BMJ 322:519–523
Hearld KR, Hearld LR, Hall AG (2017) Engaging patients as partners in research: factors associated
with awareness, interest, and engagement as research partners. SAGE Open Med
5:205031211668670. https://doi.org/10.1534/genetics.107.072090
Helwick C (2018) Beat AML trial seeking to change treatment paradigm. [Internet] The ASCO Post
Hohman R, Shea M, Kozak M et al (2015) Regulatory decision-making meets the real world.
Sci Transl Med 7:313fs46. https://doi.org/10.1126/scitranslmed.aad5233
Kaitin K (2013) 89% of trials meet enrollment, but timelines slip, half of sites under-enroll.
Tufts Cent Study Drug Dev Impact Rep 15:1–4
Kim J, Singh H, Ayalew K et al (2018) Use of pro measures to inform tolerability in oncology trials:
implications for clinical review, IND safety reporting, and clinical site inspections. Clin Cancer
Res 24:1780–1784. https://doi.org/10.1158/1078-0432.CCR-17-2555
Kluetz PG, O’Connor DJ, Soltys K (2018) Incorporating the patient experience into regulatory
decision making in the USA, Europe, and Canada. Lancet Oncol 19:e267–e274. https://doi.org/
10.1016/S1470-2045(18)30097-4
Kuehn CM (2018) Patient experience data in US Food and Drug Administration (FDA) regulatory
decision making: a policy process perspective. Ther Innov Regul Sci 52:661–668. https://doi.
org/10.1177/2168479017753390
Manganiello M, Anderson M (2011) Back to basics: HIV/AIDS advocacy as a model for catalyzing
change. AIDS 1–29. https://www.fastercures.org/assets/Uploads/PDF/Back2BasicsFinal.pdf
30 Advocacy and Patient Involvement in Clinical Trials 581
Mullins CD, Vandigo J, Zheng Z, Wicks P (2014) Patient-centeredness in the design of clinical
trials. Value Health 17:471–475. https://doi.org/10.1016/j.jval.2014.02.012
PCORI (2018) The value of engagement. https://www.pcori.org/about-us/our-programs/engage
ment/value-engagement. Accessed 12 Nov 2018
Schwartz L, Woloshin S (2015) FDA and the media: lessons from Tysabri about communicating
uncertainty. NAM Perspect 5. https://doi.org/10.31478/201509a
Smith SK, Selig W, Harker M et al (2015) Patient engagement practices in clinical research among
patient groups, industry, and academia in the United States: a survey. PLoS One 10:e0140232
Tinetti ME, Basch E (2013) Patients’ responsibility to participate in decision making and research.
JAMA 309:2331–2332. https://doi.org/10.1001/jama.2013.5592
TransCelerate (2018) Patient experience. http://www.transceleratebiopharmainc.com/initiatives/
patient-experience/. Accessed 12 Nov 2018
Unger JM, Cook E, Tai E, Bleyer A (2016) The role of clinical trial participation in cancer research:
barriers, evidence, and strategies. Am Soc Clin Oncol Educ Book 35:185–198. https://doi.org/
10.14694/EDBK_156686
Vroom E (2012) Is more involvement needed in the clinical trial design & endpoints?
Orphanet J Rare Dis 7:A38. https://doi.org/10.1186/1750-1172-7-S2-A38
Woodcock J, LaVange LM (2017) Master protocols to study multiple therapies, multiple diseases,
or both. N Engl J Med 377:62–70. https://doi.org/10.1056/NEJMra1510062
Training the Investigatorship
31
Claire Weber
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
Trial Sponsor Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
The Sponsor Quality Manual and Quality System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585
External Training and Study-Specific Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
GCP CSPs [(Including Contract Research Organizations (CROs)] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
Investigator Site Team . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
PI Delegation of Authority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587
Site Monitoring Visits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
Other Training Meetings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
Other External Teams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
Training Documentation and Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591
Abstract
The Investigatorship for clinical trials is a team with specialized experience who
are qualified by training and experience to successfully execute clinical trials. The
Investigatorship includes the trial sponsor, Good Clinical Practice (GCP) Con-
tract Service Providers (CSPs), and site Investigators, who may also include
C. Weber (*)
Excellence Consulting, LLC, Moraga, CA, USA
e-mail: [email protected]
outside experts such as Key Opinion Leaders (KOLs) and Data Monitoring
Boards (DMBs). GCP is the fundamental required training for all team members
conducting clinical trials. Training occurs throughout the lifecycle of the trial, and
each member of the team must have records of adequate training and qualifica-
tions to conduct the study for their identified role. This chapter explains the
Investigatorship members, the types of training conducted, and how training is
documented.
Keywords
Documentation · Monitoring · Delegation · File · Quality system
Introduction
The trial sponsor team is made up of individuals who based on their training and
experience will submit the clinical trial application (CTA) for the investigational
product (IP) under study and plan and implement the trial ensuring compliance with
International Council for Harmonisation (ICH) GCP and regulatory requirements.
The trial sponsor team includes qualified individuals in functional areas including
31 Training the Investigatorship 585
The trial sponsor develops and maintains a quality manual or equivalent document
that describes the quality system in their organization. The manual explains the
organizational structure and quality assurance of the sponsor team functions. The
quality manual also refers to required trial sponsor team training requirements and
types of procedural training for controlled documents. In addition, it is customary for
the quality system to describe how issues are escalated and how continuous improve-
ment areas are identified and addressed for managing the quality and training for
implementing clinical trials.
The qualifications of the trial sponsor team are documented for each team
member in curriculum vitae’s (CVs) and licenses relevant to their job description.
Each trial sponsor team member is required to have adequate training documentation
to perform their duties that are identified in job descriptions.
The hierarchy for training of controlled documents in the trial sponsor quality
system is described in Fig. 1, with the quality manual as the highest-level document,
and policies, procedures, work instructions, and records that are lower levels in that
order.
Training requirements are recorded in a training curriculum for each sponsor team
member. An example training curriculum is described in Table 1.
Controlled document training can be performed in person as on-the-job training,
group training, or read and understand training. The trainee signs training documen-
tation confirming the date of the training and the documentation is maintained in a
sponsor trial master file.
Fig. 1 Organization of
quality system documents
586 C. Weber
Sponsor teams may arrange trial-specific trainings and may attend courses such as
seminars, webinars, or conferences to further their education and skills specific to
their job duties. For these trainings, the trainee will print a certificate of attendance or
the agenda/sign-in log for and maintain them in the sponsor trial master file.
GCP CSPs are another important part of the Investigatorship. GCP CSPs are pro-
viders who perform trial development and execution services such as data manage-
ment, statistical analysis, Randomization and Trial Supply Management (RTSM),
and laboratory analysis. Clinical Research Organizations (CROs) are one type GCP
CSP specific for study monitoring.
Clinical Research Organizations (CROs) are defined as:
Each GCP CSP team member is also qualified by training and experience to
perform their job duties.
Each GCP CSP will maintain a quality system, training curricula, and documen-
tation in a similar way to the trial sponsor team. The trial sponsor team maintains
adequate oversight of the GCP CSPs to ensure that the GCP CSP staff are qualified
and have a training system and documented training records. For US Investigational
New Drug (IND) trials, the transfer of regulatory obligations from the trial sponsor
team to the GCP CSP for important functions identified in the Code of Federal
Regulations (CFR) are documented by the trial sponsor on the FDA 1571 New Drug
Application form (Section 15). and forwarded to FDA.
31 Training the Investigatorship 587
The trial sponsor team provides specialized training (e.g., detailed IP and trial-
specific training) to the GCPs CSPs throughout the trial, and this training is
documented and maintained in the trial master file. It is important to note that the
trial sponsor team and GCP CSP team partner on many aspects to implement the
clinical trial, and training development and management of training records is an
essential part of this collaboration.
The Investigator site team is made up of lead Investigators [(also known as principal
investigators (PIs)], subinvestigators, and other site study personnel who are respon-
sible for executing the trial according to GCP, health authority, institutional review
board(IRB)/Ethical Committee (EC), and local regulations and guidelines. The PI is
defined as:
A person responsible for the conduct of the clinical trial at a trial site. If a trial is conducted
by a team of individuals at a trial site, the investigator is the responsible leader of the team
and may be called the principal investigator (ICH E6 (R2) Glossary Section 1.34).
Any individual member of the clinical trial team designated and supervised by the investi-
gator at a trial site to perform critical trial-related procedures and/or to make important trial-
related decisions (e.g., associates, residents, research fellows) (ICH E6 (R2) Glossary
Section 1.56).
The PI and subinvestigators who are responsible for the conduct of the study
under a US IND are documented on the FDA 1572 form or equivalent. The PI
supervises the investigation at the Investigator site, and other members of the
Investigator team may include subinvestigators, study coordinators, pharmacists,
and laboratory personnel.
Each Investigator site team member must be qualified by education and experi-
ence to perform their functions at the study site and the PI has overall responsibility
for delegating tasks to other qualified team members. The site/institution will also
have a quality system and procedures requiring training.
PI Delegation of Authority
The purpose of this form is to: a) serve as the Delegation of Authority Log and b) ensure that the individuals performing study-related tasks/procedures are appropriately
trained and authorized by the investigator to perform the tasks/procedures. This form should be completed prior to the initiation of any study-related tasks/procedures. The
original form should be maintained at your site in the study regulatory/study binder. This form should be updated during the course of the study as needed.
AE/SAE interpretation
Regulatory Document
Physical Examination
Laboratory Specimen
Assess Inclusion and
Medication History /
Collection/Shipping
–Exclusion Criteria
Administration of
IP Accountability
Medical History
Administrative
Maintenance
Completion
Completion
Please Print
NAME: OTHER (specify):
I certify that the above individuals are appropriately trained, have read the Protocol and pertinent sections of 21CFR 50 and 56 and ICH GCPs, and are authorized to perform the above study-related tasks/procedures. Although I have
delegated significant trial-related duties, as the principal investigator, I still maintain full responsibility for this trial.
Invesgator Signature: Date:
Source National Institute of Health (NIH) Delegation of Authority Log Version 2.0 24 April 2014
The trial sponsor team and/or the GCP CSP team collect training qualification
documentation from the Investigator site team members (e.g., CVs, licenses, docu-
mentation of GCP training, etc.). They also provide the Investigator site team with
specialized trial-specific trainings at site monitoring visits, Investigator group meet-
ings, and other trial meetings. Each of these trainings are documented and forwarded
to the trial master file and maintained in the Investigator site files.
A site monitor identified by the trial sponsor team and/or the CSP has the respon-
sibility of monitoring the Investigator site. Monitoring is defined as:
The act of overseeing the progress of a clinical trial, and of ensuring that it is conducted,
recorded, and reported in accordance with the protocol, standard operating procedures
(SOPs), GCP, and the applicable regulatory requirement(s). (ICH E6 (R2) Glossary Section
1.38).
A written report from the monitor to the sponsor after each site visit and/or other trial related
communication according to the sponsors SOPs. (ICH E6 R2 Glossary Section 1.39)
To document initial training prior to the commencement of the trial, the site
initiation monitoring visit report is required to be filed at the at the Investigator site:
To document that trial procedures were reviewed with the investigator and the investigator’s
trial staff. (ICH E6 R2, Section 8.2.20)
During the study, the site monitor conducts monitoring visits at regular intervals,
and at the end of the study, the site monitor conducts a close-out visit. Each interim
visit may include training as applicable, and further reviews of the Delegation of
Authority Log to ensure the Investigator site team continues to be trained and
qualified. The close-out visit includes specific training about final trial documenta-
tion and closure.
Investigator meetings and communications between the trial sponsor team, GCP
CSP team, and Investigator site team throughout the study are held to ensure
adequate study training for new and existing team members as applicable.
Other external team members such as KOLs and DMBs are qualified by education
and experience to provide their expertise for the study. Training and communications
with these external teams are documented according to the site sponsor required
training procedures and filed in the sponsor trial master file.
• Quality system and controlled document trainings (e.g., procedures, policies, etc.)
• GCP training certifications
• Professional training certifications
• Continuing education unit (CEU) accreditations
• CVs, job descriptions, and relevant licenses documenting qualifications
• Electronic system training including granting and revoking system access to
systems (e.g., EDC, RTSM, pharmacovigilance system, etc.)
• Trial-specific training including agendas and attendance at Investigator meetings
and other communications
• Site pre-qualification visits, initiation visits, interim visits, and close-out visits
• Monitoring visit reports and follow-up letters to Investigators
Training files for the entire Investigatorship are maintained as part of the sponsor
trial master file and Investigator site files and must be made available to regulatory
authorities upon request.
The Investigatorship is made up of the trial sponsor team, GCP CSPs, and Investi-
gator site teams which may include Investigator external experts. Each team member
must be adequately qualified and trained to perform their duties to conduct the
clinical trial, and Investigators must adequately delegate authority to appropriate
team members. GCP training is the foundation of all training for clinical trials.
Training documentation is an important aspect of the trial and is maintained
throughout the trial by the sponsor in the sponsor trial master file, and by the
Investigators in the Investigator site file.
Key Facts
The facts covered in this chapter include: definitions of the Investigatorship team,
their roles duties and requirements, and the overall guiding principles of ICH GCP
for ensuring training and maintenance of the Investigatorship site files.
Cross-References
▶ Investigator Responsibilities
▶ Selection of Study Centers and Investigators
▶ Trial Organization and Governance
31 Training the Investigatorship 591
References
Code of Federal Regulations, Title 21, Part 312
Department of Health and Human Services, Food and Drug Administration, Investigational New
Drug Application (IND) (Title 21 Code of Federal Regulations (CFR) Part 312) FDA 1572
(21 CFR 312.53(c))
Department of Health and Human Services, Food and Drug Administration, Investigational
New Drug Application (IND) (Title 21 Code of Federal Regulations (CFR) Part 312) Form
1571 03/19
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice, Section
8.2.20
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice,
Glossary Section 1.20
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice,
Glossary Section 1.56
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice,
Glossary Section 1.38
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice,
Glossary Section 1.34
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice,
Glossary Section 1.39
National Institute of Health (NIH) Delegation of Authority Log Version 2.0 24 April 2014
Responsibilities and Management of the
Clinical Coordinating Center 32
Trinidad Ajazi
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
Responsibilities of the Clinical Coordinating Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
Clinical Trial Development and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
Site Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
Site Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
Regulatory Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
Trial Sponsorship and Investigational New Drug Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
Inspection Readiness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
Quality Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
Standard Operating Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603
Quality Assurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
Oversight of CROs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
Adverse Event and Safety Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
Management of Clinical Coordinating Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
Research Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605
Industry Collaborations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606
Institutional Approval of Clinical Coordinating Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
Resource Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
Operational Efficiency and Project Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
Risk Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
Clinical Coordinating Center and Trial Governance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609
Clinical and Data Coordinating Center Integrated Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
Clinical Trial Network Group Coordinating Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
T. Ajazi (*)
Alliance for Clinical Trials in Oncology, University of Chicago, Chicago, IL, USA
e-mail: [email protected]
Abstract
Clinical coordinating centers of investigator-initiated multi-site clinical trials
have a myriad of responsibilities throughout the life cycle of clinical trials from
trial concept development to completion. At the core of all clinical research is the
dual mandate to protect human subjects and ensure trial data integrity. National
regulations and international guidelines are designed to enable regulatory com-
pliance and achievement of these mandates. Integrated within clinical coordinat-
ing center activities are quality management mechanisms designed to monitor,
control, and assure patient safety and data integrity.
This chapter summarizes the responsibilities of the clinical coordinating center
with emphasis on efficient trial development and site selection, presents regula-
tory compliance requirements, focuses on practices for quality management, and
describes clinical coordinating center management and network groups.
Keywords
Multi-site · Investigator-initiated · Clinical coordinating center · Clinical trial
operations · Site management · Quality management · Regulatory compliance ·
Sponsor oversight · Research administration
Introduction
Clinical trials are often initiated by investigators in collaboration with their col-
leagues across multiple institutions. These multi-site investigator-initiated trials
require coordinating centers to lead the implementation of the trial. The primary
types of multi-site coordinating centers are the clinical coordinating center (CCC)
and the data coordinating center (DCC). The clinical coordinating center is respon-
sible for clinical trial operations, and the data coordinating center is responsible for
statistics and data management functions. Some of the functions performed by the
CCC and DCC may be combined under one coordinating center. The clinical and
data coordinating centers may be integrated under one organization or reside in
separate organizations. Collectively the CCC and DCC are responsible for imple-
mentation of multi-site clinical trials.
The clinical coordinating center of a multicenter research network provides the
infrastructure for operationalizing clinical trials across participating centers. The
primary responsibilities of the clinical coordinating centers include clinical trial
operations and site management, quality management, regulatory compliance, com-
munications, administration, and study results publication. The CCC is broadly
responsible for oversight of all trial-related activities, inclusive of applicable clinical
trial sponsor responsibilities.
In general, investigators seek to advance scientific discovery and positively
impact healthcare practices. Clinical research in a global environment has grown
increasingly complex. Regulatory oversight is paramount in all aspects of clinical
32 Responsibilities and Management of the Clinical Coordinating Center 595
trials. At the core of all operational activities is the dual mandate to protect the rights
and well-being of human subjects and to ensure research data integrity. Management
of the CCC requires strong leadership and clinical research expertise. This chapter
focuses on the responsibilities and management of the clinical coordinating center
and written from the perspective of nonprofit, academic-based clinical trial centers.
community investigators, and patient advocates. The study PI is the primary author
of the protocol.
The protocol document and related model informed consent document undergo
several stages of authoring, review, and revision, based on the standard operating
procedure (SOP) of the coordinating center. Case report forms are developed in
parallel. Prior to release and activation of the study to participating sites for imple-
mentation, the relevant funding agency (e.g., NIH), regulatory authorities (e.g., Food
and Drug Administration (FDA)), central institutional review board (CIRB), and
other regulatory review bodies must review and approve the protocol. Prior to
implementation at the site level, the institutional review board (IRB) or ethics
committee of record must review and approve the protocol, according to relevant
regulations. See Code of Federal Regulation (CFR) Title 21, Part 50 (Protection of
Human Subjects) and Part 56 (Institutional Review Boards).
During the life cycle of a clinical trial, the protocol, clinical operations, and
statistical teams are responsible for amending the protocol, when monitoring of the
clinical trial signals the need to adjust trial parameters to improve safeguards for
patient safety or mitigate logistical barriers in the conduct of the trial.
Site Selection
Selection of high-performing clinical trial sites is key to the successful initiation and
completion of a clinical trial. Site personnel directly control trial participant recruit-
ment, informed consent, visit schedules, trial procedures, data collection, and data
submission. The CCC chooses sites with demonstrated ability to conduct clinical
trials, in order to best ensure the integrity of research. High-performing sites have the
structure, resources, and standard processes to recruit participants, enroll patients on
study, adhere to protocol requirements, comply with applicable regulations, and
submit date in a timely manner. Investigators at high-performing sites promote
clinical trial participation, assess adverse events in real time, and provide necessary
oversight of sub-investigators and research personnel.
Substandard sites are prone to GCP non-compliance, low accrual rates, protocol
deviations, and delinquent and low-quality data submission. Such sites adversely
consume CCC resources to monitor and manage the site for low returns. It is
imperative to select qualified sites for successful clinical trial initiation, execution,
and completion.
Depending on the complexity of trial requirements and the number of sites
needed for successful trial completion, site selection could occur in multiple steps:
During the site selection and startup process, coordinating center staff evaluate
the following considerations:
Site Management
Following site selection, the CCC works with participating sites to prepare them for
activation of the trial. Site startup activities may include conducting a site initiation
visit (SIV) to confirm that sites have the appropriate facilities and equipment,
provide study-specific training to investigators and site staff, review and collect
regulatory documentation. Release of investigational product could be contingent on
successful completion of an SIV, in addition to execution of the clinical trial
agreement, IRB approval, access and training in the electronic data capture system,
and other clinical trial management systems, as appropriate.
The CCC is responsible for tracking participating sites and active research
personnel. A clinical trial management system (CTMS) is useful for maintaining
site contacts and other relevant information. The CCC is responsible for maintaining
a trial master file (TMF) of essential documents and ensuring that participating sites
32 Responsibilities and Management of the Clinical Coordinating Center 599
maintain essential documents in the investigator site file (ISF) or regulatory binder
(electronic or hard copy). These documents must be maintained for continued
enrollment of research subjects and clinical trial participation. ICH GCP sect. 8 pro-
vides guidance on required essential documents. Essential documents include the
statement of investigator on FDA Form 1572, IRB approvals (initial, amendment,
and continuing review) for research, and the delegation of authority log (DOA). The
CCC and DCC should have checks in place to ensure that sites with expired or
missing essential documents do not register research subjects to the trial. Upon
submission of the DOA to the CCC and during monitoring visits, the CCC confirms
that tasks delegated by the clinical investigator are appropriate and staff have
received the proper training, prior to performance of research-related tasks.
Upon activation of a clinical site, coordinating center personnel maintain com-
munication with sites to promote accrual of subjects to the clinical trial. In conjunc-
tion with the principal investigator, CCC staff develop a communication plan to
educate sites regarding the clinical trial, development accrual enhancement materials
such as trial websites, social media posts, brochures, newsletters, etc. It is a good
idea to host meetings with participating site investigators and clinical research
professionals to address questions and concerns, report on trial developments,
share best practices and frequently asked questions.
It is important that the trial principal investigator is available to address site inquiries
related to eligibility, patient and treatment management, disease status evaluation,
adverse event reporting, and other inquiries that affect safety of research participants
or the conduct of the clinical trial. Documentation of these inquiries and any related
actions and decisions should be tracked by CCC personnel. During these interactions,
the PI and CCC staff should monitor for potential protocol deviations. They should
advise sites of any deviations that should be reported to the IRB and in the electronic
data capture system. CCC personnel should be in close contact with the data managers
of the trial to share concerns for site management and data management. Research
nurses and clinical trial managers at the CCC assist the PI in addressing inquiries from
participating sites. The regulatory manager is also available to respond to inquiries
related to informed consents and IRB-related questions.
Regulatory Compliance
Compliance with regulatory requirements is required for all clinical trial activities.
The CCC ensures that the clinical trial is conducted according to the regulations
described below. In the United States, clinical trials are developed and implemented
according to the Code of Federal Regulations (CFR). These regulations are codified
to provide the rules for implementing laws enacted by the Congress.
All clinical trials funded in whole or in part by HHS agencies are required to
follow Title 45 CFR Part 46 – Protection of Human Subjects. 45 CFR 46 includes the
following subparts: A, Basic HHS Policy for Protection of Human Subjects; B,
Additional Protections for Pregnant Women, Human Fetuses and Neonates Involved
in Research; C, Additional Protections Pertaining to Biomedical and Behavioral
600 T. Ajazi
Inspection Readiness
Coordinating centers (CCC and DCC) must be ready for inspection by a regulatory
authority (RA), such as the FDA. Inspection readiness must be built into clinical
trial operations, quality monitoring, and quality assurance activities. Good docu-
mentation practice is a basic requirement for inspection readiness, in all areas of
research. Regulatory authorities will look for contemporaneous documentation
showing sponsor oversight. Evidence that the coordinating center monitored the
trial and maintained regulatory documentation in a timely manner is critical during
an inspection. The FDA Bioresearch Monitoring Program (BIMO) provides a
copy of the Compliance Program Guidance Manual (CPGM) utilized in FDA
inspections. The CCC should utilize the BIMO as a guide for ensuring inspection
readiness.
Quality Management
The CCC and DCC are responsible for quality management at multiple levels.
Quality management should be conducted as a partnership between clinical opera-
tions, statistics and data management, as well as research collaborators. The princi-
ples and practice of quality management is evolving. The quality focus varies by the
type of research organization and their aims, structure, and resources. In all cases, the
coordinating centers seek compliance with regulations, integrity of safety and
efficacy data, and the protection and well-being of subjects. The following is only
a slice of the discussions surrounding quality management with a focus of the
aspects of quality management conducted by the CCC and DCC.
Quality Control
The CCC is responsible for quality control (QC) and trial monitoring activities, as
well as quality assurance (QA) activities. The latter is usually conducted in the form
of systematic audits to ensure compliance. In November 2016, the International
Council for Harmonisation (ICH) Integrated Addendum to ICH E6(R1) Guideline
for Good Clinical Practice, E6(R2), was released. The FDA released its related
Guidance for Industry for E6(R2) in March 2018. The addendum to ICH GCP E6
sect. 5 promotes implementation of a system to manage quality throughout all stages
of the trial. According to ICH GCP E6(R2), a risk-based approach should define the
quality management system. This includes provisions for critical processes and data
identification of processes and data critical to ensure human subject protection and
reliability of trial results. Risk management includes risk identification, evaluation,
control, communication, review, and risk reporting.
Since the addendum was released, research organizations have developed strat-
egies and processes for risk-based centralized monitoring. There are varying appli-
cations of this approach. At the core of these activities is the development of
analytical reports and metrics that can be reviewed centrally and/or remotely.
Evaluation of key performance indicators (KPIs), also known as key risk indicators
(KRIs), such as data quality, data timeliness, query rates, protocol deviation rates,
serious adverse event rates, and subject enrollment levels, assist the coordinating
center in identifying sites that might be at risk. The determination leads to decisions
regarding increased monitoring, auditing, and/or other corrective or preventive
actions. Central review of monitoring KPIs is a joint activity between the CCC
and DCC.
With the use of electronic systems, source data review and verification can be
conducted remotely. The electronic tools for central monitoring are integrated with
the electronic data capture system (EDC) maintained by the DCC. A central monitor
assigned either by the DCC or CCC can review certified copies of source documents
uploaded into the EDC system or a source document portal. A central monitor can
remotely access a site’s electronic medical record (EMR) system. Electronic source
data documentation is reviewed against data recorded into the EDC. The success of
central monitoring is dependent on careful planning. The critical data points for
eligibility, trial intervention or treatment, adverse event/safety monitoring, and trial
endpoint evaluation should be identified to focus the central monitoring activity.
Central review of data, coupled with KPI metric reviews, enables the CCC and DCC
to identify sites that require intervention, follow-up, increased on-site monitoring,
auditing, or other forms of remediation.
Medical oversight by a designated medical officer or medical monitor is key to
quality management activities. The medical officer provides clinical expertise and
judgment. The medical officer is a point of escalation for other members of the study
team. The medical officer is also responsible for considering potential impact on
patient safety and recommending changes to the protocol based on trends analysis
reports. Such recommendations are discussed with the trial statistician and other
members of the study team.
604 T. Ajazi
The monitoring plan for a clinical trial should account for both centralized or
remote monitoring and on-site monitoring. Routine on-site monitoring may be
planned at a decreased frequency in combination with centralized monitoring. A
level of source data verification conducted through on-site monitoring should be
documented in the clinical monitoring plan. At-risk sites, identified through central
monitoring, are given priority for on-site monitoring.
Quality Assurance
Oversight of CROs
If the multicenter trial is large enough (e.g., 1000 research subjects and 50 sites), it
might be necessary to outsource monitoring functions to a Clinical Research Orga-
nization (CRO). The ICH EG(R2) addendum also emphasized the need for sponsors
to provide oversight to third parties contracted to perform responsibilities on their
behalf. A few tips regarding oversight include review of CRO qualification and staff
training, as well as frequent communication with CROs to share information on
changes to the trial, provide instruction, and address inquiries. If the CRO is
responsible for monitoring, review the monitoring reports and set and review metrics
for trip report completion, site follow-up, and compliance with the monitoring plan.
If the CRO is responsible for the trial master file, generate reports on timely
completion and check accuracy of the TMF. Implement a pathway for the CRO to
escalate issues to the coordinating center staff. Establish guidelines for how critical
and major non-compliance need to be addressed and the SOP for CAPAs to follow.
Be clear on what SOPs are applicable. Develop a quality management plan with the
CRO. Document all continuous oversight activities.
32 Responsibilities and Management of the Clinical Coordinating Center 605
The multi-site principal investigator and responsible personnel at the CCC and DCC
are required to continuously review expedited or serious adverse events, as they are
reported by participating sites. Following the FDA Guidance for Industry and
Investigators for Safety Reporting Requirements for INDs and BA/BE Studies,
investigators must assess if a reported adverse event meets the requirement for
expedited reporting to the FDA of a Serious and Unexpected Suspected Adverse
Reaction (SUSAR), in accordance with 21 CFR 312.32. The principal investigator is
also responsible for reporting the SUSAR to the pharmaceutical partner. The CCC
notifies participating sites of SUSARs (IND safety reports) and provides guidance on
any related changes to risks associated with an investigational product and the
informed consent document.
Management of the clinical coordinating center involves oversight by both the CCC
leadership and research administration at the institutional level. Research adminis-
tration provides the infrastructure and determines the policies that govern the
responsibilities and management of a CCC.
Research Administration
Industry Collaborations
1. Clinical trial agreement components: conduct of the study; human subject enroll-
ment requirement; participating site requirements; quality assurance and regula-
tory inspection readiness; vendor and subcontractor compliance; study data
ownership and data sharing; record keeping; confidentiality; publication; safety
reporting; inventions; compliance with law; term and termination; indemnifica-
tion and insurance; payment and payment schedule.
2. Budget (exhibit).
3. Statement of work (exhibit).
research tests and procedures that are not covered by insurance. This could be part of
a Medicare coverage analysis review. The study team also determines if funding
requests are needed for correlative science. The CCC and research administration
teams are responsible for tracking the terms and milestones of all agreements. This is
a shared administrative and finance function. The scientific and resource benefits that
translate into successful trial completion are well-worth the time and effort to
negotiate the collaborations.
Prior to initiation of the clinical trial and CCC activities, the multi-site PI’s institution
may require approval of the CCC, per institutional policies. Research administration
and the IRB will review the CCC to ensure compliance with regulatory requirements
for the protection of human subjects. Institutional requirements include review of the
clinical trial protocol and informed consent, the data and safety monitoring plan, as
appropriate for a multi-site trial. The institution may request feasibility questionnaire
for external sites. Institutional evaluation includes review of coordinating center
responsibilities, qualifications and training of research staff, site selection, site
management, data management procedures, statistical analysis plan, investigational
product distribution and accountability, pharmacovigilance and safety reporting,
protocol deviation monitoring, central and on-site monitoring, accrual plan, project
management, and multi-site communication plans, as applicable. The relevant infor-
mation may be contained in the protocol and other study plan documents. If
applicable for interventional trials, the institution may require the implementation
of an independent data and safety monitoring board.
The requirements for approval of a CCC may be obtained from research admin-
istration. For reference, a good example of CCC institutional requirements can be
found on the website of Dana Farber/Harvard Cancer Center (DF/HCC).
Management
The coordinating center, multi-site PI, and participating site PIs are collectively
responsible for meeting the criteria for grant awards and contracts. This requires
collaborative management practices by the leadership and management team. The
management team of the coordinating center is responsible for efficient and effective
management of all centralized clinical trial activities. Continuous monitoring, doc-
umentation, and progress evaluations are necessary.
The multi-site principal investigator is responsible for fulfilling the obligations of
the sponsor-investigator, including the initiation and management of the clinical trial
at all clinical trial sites. The CCC may already be established within a program at the
institution with a medical director who regularly oversees the activities of the CCC,
or it may be newly established under the direction of the multi-site PI. The structure
of the CCC management and staff may differ slightly between institutions. However,
608 T. Ajazi
Resource Management
In conjunction with the DCC, the clinical coordinating center is responsible for
timely trial development to keep pace with scientific advancements and meet the
requirements of the funding agency. Taking too long to launch a clinical trial may
affect the relevancy of the scientific question and the impact of the results on clinical
practice. In partnerships with industry, inefficient and slow timelines affect the
ability of the industry partner to submit marketing applications for investigational
products. In a competitive environment, delays place the CCC and its partner at a
disadvantage.
Operational efficiency is achieved and monitored in part through careful project
planning and management. The coordinating center needs to develop target timelines
and milestones for protocol development. The CCC utilizes tools and computer
applications for careful and constant monitoring deadlines with the goal of meeting
target timelines. Project managers are accountable to CCC leadership for ensuring
trial activation, progress, and completion. Project managers coordinate the activities
of internal and external personnel and coordinate execution of many processes.
During the study, project managers work closely with data analysis staff to generate
reports for trends analysis reporting to oversee the conduct of the study. Contingency
and escalation plans are part of the CCC SOPs and may be incorporated into the
project plan or other study-specific plans.
32 Responsibilities and Management of the Clinical Coordinating Center 609
Risk Management
Ultimately, the multi-site PI is responsible for the conduct of a clinical trial at all
participating sites. Beyond the study design and compliance with clinical trial
regulations, successful completion of a clinical trial starts with careful planning by
CCC management. This includes ensuring that adequate human and financial
resources are available throughout the life cycle of the trial. The management team
needs to budget for personnel at the CCC and participating sites and personnel and
resources for data coordinating center activities. There are a multitude of costs that
must be included in the budget calculations, including clinical trial management
systems, electronic data capture system, data and record storage, trial supply distri-
bution, training, travel, site recruitment, patient recruitment, monitoring, project
management, special equipment, laboratory services, site payments, and other
costs related to subcontracts and vendor management. Throughout the life of the
clinical trial, the management team is responsible for tracking expenses and ensuring
that the trial remains within budget.
At the beginning of the project, the CCC management team is primarily responsible
for risk assessment and risk management while the clinical trial is ongoing. Recent
developments in clinical research, exemplified by changes to ICH GCP guidelines and
FDA guidance, place a major emphasis on risk management. For example, a 2019
funding opportunity (PAR-19-329) posted by the National Heart, Lung and Blood
Institute (NHLBI) titled Clinical Coordinating Center for Multi-Site Investigator-
Initiated Clinical Trials emphasizes the focus on both risk management and opera-
tional efficiency, as well as the role of project management to proactively mitigate
risks. The NHLBI requires a trial management plan that describes the strategy of the
CCC to “ensure that management activities of the clinical trial are met including
directly supporting the needs of scientific leadership to identify barriers, make timely
responses, and optimize the allocation of limited resources” with a risk assessment and
management plan that identifies a range of contingencies and solutions.
Identification of potential risks at the beginning of the trial is key to successful
risk management. Monitoring of key risk indicators (KRIs), as part of quality
management, ensures corrective and preventative action in a continuous improve-
ment manner. Key risk considerations and indicators often depend on the type and
complexity of the trial. CCCs often consider the patient population, competing
clinical trial opportunities, accrual rate projections, expected screen failure rate,
investigational new drug or device status, data timeliness and quality, adverse
event rates, and protocol deviations rates. An example of a risk assessment tool is
the Risk Assessment and Categorization Tool (RACT) developed by TransCelerate
to assist with risk-based monitoring implementation.
The multi-site clinical trial and the CCC may be governed by a steering committee or
executive committee, chaired by the multi-site PI. Members of the executive
610 T. Ajazi
committee may include multi-PIs of participating sites and statistical center represen-
tation. At the beginning of the trial, the executive committee is concerned about trial
design, funding, and site selection. Upon trial initiation, this committee is charged with
monitoring the progress of the trial and making decisions regarding issues escalated by
CCC management. These issues often fall along the lines of key risk indicators
described above. The multi-site PI and members of the executive committee serve
as champions of the trial with other investigators and other research collaborators.
Upon analysis of trial results, they are responsible for overseeing implementation of
DSMB recommendation and publication of study results. The governance responsi-
bilities of the executive committee may be documented in a charter.
The coordinating center may include the CCC and the DCC within the same
institution, or the DCC may be part of a separate institution. The primary responsi-
bilities of the DCC include statistical design, data management, data analysis, and
publication of study results. Collectively the functions of the CCC and DCC
encompass the breadth of centralized clinical trial management activities.
The functions of the DCC must be integrated with clinical operations throughout
the life cycle of a clinical trial: development and activation, accrual phase, follow-
up/data maturation, data analysis and reporting, and close-out. The DCC and CCC
leaders are part of the multi-site management team and included in all levels of
management discussion, planning, and review. DCC personnel including statisti-
cians and data managers are members of the study team and key to both protocol
development and trial management. The SOPs and workflow between the CCC and
DCC are developed to dovetail and support all units. Frequent and open communi-
cations happen on a day-to-day basis for ongoing study management.
Alignment of CCC and DCC resources is important in multi-site coordinating
center obligations for timely trial activation and completion, as well as ongoing trial
management. The DCC is crucial not only for data analysis and results reporting but
also in ongoing quality management analyses. Figure 1 depicts examples of integra-
tion between the clinical and data coordinating centers, over the life cycle of a
clinical trial.
Fig. 1 Clinical coordinating center (CCC) and Data coordinating center (DCC) integrated function
examples
compliance for the protection of human subjects, trial integrity, and data quality.
Research collaborations with other research groups, including international partners,
as well as industry partnerships are important to the success of collaborative clinical
research. Within the academic research and the investigator-initiated clinical research
community, the choice for multicenter clinical trials includes implementation through
a network group or an institution-based coordinating center.
Summary
The clinical coordinating center is responsible for managing all stages of a clinical
trial life cycle: trial development, activation, accrual, follow-up, results reporting,
and closure. Multi-site clinical trial management functions include risk management,
project management, protocol development, site selection and management,
Clinical Trial
Development
and
Operations
Information Regulatory
Technology/ Affairs and
Systems Compliance
Quality
Management:
Clinical
Control Coordinating Governance
Monitoring Center
Assurance
Research and
Financial
Administration
Key Facts
Cross-References
References
FDA Site Investigational New Drug Application Resources. Available online. Accessed 07 Sep
2020. https://www.fda.gov/drugs/types-applications/investigational-new-drug-ind-application
E6(R2) Good Clinical Practice: Integrated Addendum to ICH E6(R1). Available online. Accessed
07 Sep 2020. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/
e6r2-good-clinical-practice-integrated-addendum-ich-e6r1
Food and Drug Administration (2017) Compliance Program 7348.810 Bioresearch Monitoring,
Sponsors, Contract Research Organizations and Monitors. Available online. Accessed 07 SEP
2020. https://www.fda.gov/inspections-compliance-enforcement-and-criminal-investigations/
fda-bioresearch-monitoring-information/compliance-program-7348810-bioresearch-
monitoring
Dana Farber/Harvard Cancer Center, Investigator-Sponsored Multi-center Clinical Trials. Available
online. Accessed 07 Sep 2020. https://www.dfhcc.harvard.edu/research/clinical-research-
support/office-of-data-quality/services-support/dfhcc-multi-center-trials/
NHLBI Funding Opportunity, PAR-19-329, Clinical Coordinating Center for Multi-Site Investiga-
tor – Initiated Clinical Trials. Available online. Accessed 07 Sep 2020. https://grants.nih.gov/
grants/guide/pa-files/par-19-329.html
TransCelerate Risk-Assessment and Categorization Tool. Available online. Accessed 07 Sep 2020.
https://www.transceleratebiopharmainc.com/initiatives/risk-based-monitoring/
Efficient Management of a Publicly Funded
Cancer Clinical Trials Portfolio 33
Catherine Tangen and Michael LeBlanc
Contents
Introduction: SWOG Statistics and Data Management Within the NCTN . . . . . . . . . . . . . . . . . . . . 616
Disease Committee Structure, Interactions, and Study Development . . . . . . . . . . . . . . . . . . . . . . . . . . 617
Disease Committee Structure Within the Statistical Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
Communications Within and Between the Statistical Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
Protocol Development Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
Using Expert Cross-Disease Teams (Cores), Strategic Meetings, and Standardized
Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 620
Recruitment and Retention Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
Patient-Reported Outcomes (PRO) Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
FDA Application Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
Clinical Trial Methods Core and Translational Medicine Methods Core . . . . . . . . . . . . . . . . . . 622
Rave ® Study Build Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622
Training Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Standardized Publication Policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623
Standardized Data Collection, Coding, and a Comprehensive Portfolio Database . . . . . . . . . . . . 623
In-House Tools for Design, Study Monitoring, and Analysis of Clinical Trials . . . . . . . . . . . . . . . 625
Comprehensive Statistical Reporting Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
Specimen Tracking Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626
Public Use Statistical Design Calculators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
Automatic Monthly Study Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627
Site Performance Metrics Reports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 628
Portfolio-Wide Data Safety Monitoring Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
General Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
Standard Report Formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
General Interim Analysis Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
Statistical Center External Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
Abstract
The implementation, management, and reporting of any single well-designed
cancer clinical trial is an extremely complex and expensive undertaking. However,
within our group, there is the opportunity to simultaneously design, implement,
monitor, and analyze up to approximately 100 publicly funded clinical trials across
the development spectrum. Operationally, we aggressively seek to optimize and
standardize processes and software that are common across studies, increase effi-
ciency, and focus on the quality and reliability of study results. Implementing novel
software applications increases the quality and efficiency of data evaluation, mon-
itoring, and statistical analysis across multiple disease and study types. Conventions
and cross-study tools are the key to quality monitoring and analysis of complex
portfolio of studies. A strategy of fully utilizing the commonalities across the trials
leads to better quality results of any given study in the portfolio as well as more
efficient utilization of public funds to conduct the studies.
In this chapter the structure and processes are described in the context
of SWOG Cancer Research Network, one of the four National Clinical Trial
Network (NCTN) adult cancer clinical trial groups funded by the US National
Institutes of Health (NIH) under a cooperative agreement with the US National
Cancer Institute (NCI).
Keywords
Clinical trial · Cancer · Statistics · Portfolio · Data management · Protocol
development · Statistical design · Translational medicine · Software · Data safety
monitoring committee · Database
The mission of the SWOG Cancer Research Network, as a partner in the NCTN (see
references for website link), is to significantly improve lives through cancer clinical
trials and translational research. The SWOG Network Operations Center, which
includes the office of the Group Chair, the contracts and legal team, and communi-
cations, is located in Portland, Oregon, and the SWOG Operations Center which
oversees protocol development and audit functions is located in San Antonio,
Texas. We will focus on trial portfolio strategies at the SWOG Statistics and Data
Management Center (Statistical Center) located in Seattle, WA, where all trial and
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 617
ancillary study data reside. The Statistical Center is led by the Director, referred to as
the Group Statistician. There are currently 12 statistical faculty who receive some
fraction of funding from SWOG. The faculty typically have other non-SWOG
research interests and receive additional funding from other grant activities outside
of SWOG. An additional 15 SRAs (MS degree statisticians) and 2 additional
Statistical Unit Assistants (BS degree statisticians), who are fully funded by
SWOG activities, round out the statistical team. The goal of the Statistical Center
is to provide leadership in the statistical design and data management of oncology
clinical trials for SWOG and to safely and efficiently monitor and report on clinical
investigations over a portfolio of clinical trials. Critically, the Group must analyze
the clinical outcomes in a consistent and reproducible way. The portfolio of managed
trials includes both trials to evaluate new cancer treatments (both single and multi-
arm Phase II and randomized Phase II, Phase II/III, and Phase III trials) and other
studies involving cancer prevention, supportive care and symptom management,
palliative care, as well as trials of comparative effectiveness of treatments. These
nontreatment studies include both randomized and cohort studies and are conducted
in collaboration with the NCI Community Oncology Research Program (NCORP)
program (see references for website link).
The SWOG Statistical Center designs, implements, and manages their trial
portfolio through (A) the SWOG disease committee structure, interactions, and
protocol development; (B) use of expert teams, strategic meetings, and standardized
policies; (C) use of standardized data collection, coding, and a comprehensive
portfolio database; (D) development of in-house tools for design, monitoring,
and analysis of clinical trials; (E) utilization of a portfolio-wide Data and Safety
Monitoring Committee; and (F) standardization of our interactions with outside
groups including biospecimen and data sharing. Expanded descriptions follow.
The Statistical Center structure, and the primary work of SWOG, is accomplished
through anatomic disease committees. Each committee is assigned at least one Ph.D.
statistician (faculty), one or more master’s level statistician(s) referred to as
Statistical Research Associates (SRA), and one or more data coordinator(s) who
work as part of a larger team with the clinical and translational medicine members of
the disease committee. Within the Statistical Center, these committees function
under the direction of the faculty statistician(s), with priorities set in consultation
with the respective clinical disease committee chair and the Group Statistician.
During study development, statisticians work with the study chair to develop the
trial design and help lead the protocol through the SWOG and NCI approval
processes and protocol implementation. Assessments of feasibility, experimental
design, sample size, randomization schemes, data analysis plans, and key elements
618 C. Tangen and M. LeBlanc
of data collection are further refined by the statisticians and the study team during the
development process. Statisticians and the protocol coordinators work with the study
chair to launch the proposed trial. Several statisticians have responsibilities and
methodological skills that address general needs across diseases, which facilitates
standardization within the Statistical Center (see the “Using Expert Cross-Disease
Teams (Cores), Strategic Meetings, and Standardized Policies” section).
Communications are critical for integrating the work of the statisticians with data
management staff. Important in-person meetings at the Statistical Center include chief
meetings with senior faculty, senior SRA, and data management and applications
development management. Chief meetings are used to set priorities within the Statis-
tical Center and to discuss how to respond to new initiatives, regulatory changes, and
other challenges. Other important meetings that primarily include Statistical Center
faculty and staff are twice-monthly meeting of all Statistical Center statisticians to
discuss policy issues, programming and software needs, standards and guidelines, and
statistical issues. One of these twice-monthly meetings includes a statistical analysis or
methodology presentation and discussion. There is a monthly meeting of the SRAs,
where they discuss study implementation issues, evaluate workloads and priorities, and
share ideas. This also serves as a forum for continued training. Monthly disease
committee meetings are led by a faculty statistician with attendance by the respective
disease committee statisticians and data coordinators. This meeting serves to set
analysis and data evaluation priorities, to summarize accrual and adverse event issues,
and to discuss any concerns for ongoing trials. In addition, electronic case report form
(CRF) development ideas, design concerns, study structure setup in the database, and
evaluation requirements are discussed for studies in development. There is also a
weekly statistical capsule review meeting and protocol review meeting. These meet-
ings are concept- or protocol-specific (see Protocol Development below for more
details). Finally, the biannual Statistical Center all-staff meeting is a format to showcase
scientific accomplishments over the past 6 months, to introduce new staff, and to
discuss performance goals for the upcoming 6 months.
Strong links exist between the Operations Center and the Statistical Center.
Importantly, the Group Statistician and Group Chair have a scheduled weekly one-
on-one teleconference, and they are in almost daily email contact to promote
the efficient scientific and administrative functioning of SWOG. Senior statistical
faculty also have close contact with the Director of Operations, who works with the
Group Chair to oversee and direct the operational and administrative activities of
SWOG in support of its scientific missions.
Statistical and other data-related issues are fundamental to any new study’s
approval. The Group Statistician and Statistical Center Deputy Director are members
of the SWOG Executive Committee (“triage”), along with the Group Chair, execu-
tive officers, Director of Operations, and Director of Protocols. The statistical design
review by the Group Statistician and Deputy Director at this meeting is an integral
component of each study evaluation. The committee meets weekly and reviews
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 619
Network Group
Operations Center
Protocol Coordination
Network Group
Chair’s Office Sites
Protocol
Contracts Study Chairs
Budgets
Center. The PRC meets weekly or as needed; committee members receive a copy of
the protocol, draft data collection forms, and a copy of the NCI Protocol Submission
Worksheet. This committee is chaired by one of the faculty statisticians and consists
of at least one other faculty statistician, the Deputy Director, and 5–6 master’s degree
level statisticians, with the goal of having multiple independent reviewers for each
study. The study team, the Director of Protocols, and the protocol coordinator from
the SWOG Operations Center attend by teleconference. Study chairs are encouraged
to participate and do so as their schedules permits.
The review provides critiques and recommendations to eliminate internal incon-
sistencies and provide clarification in the protocol document, especially for eligibil-
ity, data and specimen collection, and other implementation issues. Moreover, this
review ensures consistency across studies and disease committees for our approach
to the design and conduct of trials.
There are many commonalities among trials conducted across diseases within
our group. Trials are becoming more complex due to the addition of extensive
biospecimen collection and high dimensional lab assessments, the emerging
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 621
This group provides enhanced statistical input for accruing minority and medically
underserved populations. We have a committed Statistical Center team of staff and
faculty with expertise in these populations to support the activity of SWOG.
The SWOG Statistical Center directly serves trial recruitment goals with a full-
time staff expert (Recruitment and Retention Coordinator) for accrual-related issues
on trials and a faculty statistician with research interests in design analysis of studies
involving accrual and representativeness. The Recruitment and Retention Coordi-
nator works closely with study leadership, study teams, and the SWOG Recruitment
and Retention Committee to provide expertise and support in the development
of study-specific recruitment and retention strategies and materials. This coordinator
also participates in NCI working groups related to this mission.
This core team oversees the scientific review and conduct of PRO sub-studies
for treatment trials. The PRO core administers the review process for new PRO
proposals, including assessing scientific merit, feasibility, and resource allocation
within the Symptom Control and Quality of Life Committee. For approved sub-
studies, the PRO core provides the statistical design, monitoring, and analysis
resources for the conduct of the PRO study, guided by a set of key design and
analysis principles developed for PRO studies. The staff, funded primarily through
NCORP, includes a faculty statistician, master’s level statisticians, and PRO expert
data coordinators.
The goal of this core team is to ensure efficiency in process and procedures for FDA
registration trials across disease committees. This is accomplished by reviewing of
case report forms with extended data requirements and validated data systems,
including biomarker-driven treatment assignment. The team provides training with
622 C. Tangen and M. LeBlanc
These cores have been introduced with the goal of introducing new designs and
analysis strategies, enhancing consistency across disease committees, and providing
a sounding board for ideas. Members of these groups include SWOG statistical
faculty as well as non-SWOG faculty who may not be directly supported by our
grants but who have methodological interests in clinical trials or translational
medicine. A goal of the cores is to assess solutions for trial design and TM analyses
that may be appropriate across committees. Each core identifies and facilitates short
topics of discussion and relevant journal papers for the broader group to review
during statistics meetings, thereby aiding the dissemination of new approaches and
methods. These cores are also responsible for maintaining guideline documentation
for best practices within SWOG. These cores ensure that state-of-the-art solutions
are used and help identify new statistical methodologies that are needed for the best
conduct of clinical trial activities.
Data science skills are supplemented by leveraging unique expertise located
outside of the Statistical Center but within the parent institution. These individuals
are included as necessary for special projects and funding flows from appropriate
grant sources for a finite amount of time to address special topics such as biomarker
treatment designs, tumor heterogeneity, genomics, mobile data, computational lin-
guistics, and natural language processing.
all users to treatment assignment to avoid any potential bias in the review of the
patient’s data. The Statistical Center has developed expertise in Rave ® in the areas of
custom functions, calendaring, case report form presentation, and edit checks, all of
which help to personalize the user experience to the specific patient and study while
at the same time providing consistent quality processes across studies. Our philos-
ophy is to maximize data accuracy and to create case report forms that are easy to
understand and personalized to the patient and that include comprehensive and
informative edit checks. Our approach is consistently rigorous, whether the trial
has FDA registration potential or not.
Training Opportunities
The first steps of quality and statistical standardization are derived from (1) devel-
opment of protocols that are clearly stated and inclusive of all criteria and procedures
and (2) data collection necessary to address the key objectives of the trial.
The SWOG data capture system has evolved over time starting with CRF-
scanned data into the database via an optical character recognition system and then
to an in-house EDC system which allowed users to enter and amend data from a
web-based portal as well as upload source documentation. This was used until
624 C. Tangen and M. LeBlanc
acquisition and implementation of Rave ® mandated by the NCI in 2014 for all
network trial groups. Medidata Rave® is a configurable EDC system that includes
data capture, management, and monitoring.
Rave ® has excellent features for single studies. However, to best utilize cross-
portfolio strategies, dynamic integration of data across trials is chosen, and CRF data
from Rave ® is uploaded to the SWOG database. The variables are then mapped into
standardized coding across studies. As described in the “Comprehensive Statistical
Reporting Tool” section below, this allows for cross-study reporting, improved
umbrella trial support, more extensive patient follow-up, and further exploratory
analysis opportunities. A key feature of the SWOG Statistical Center study design
processes is the mapping of data elements to a standard set of domains and codes,
which facilitates efficient analysis. Where possible, many of the coding conventions
are standardized across types and stages of cancer.
Unlike some other NCTN groups that leave their data stored at a remote, central
location, the Statistical Center, in cooperation with Medidata, uses Rave® Web
Services to create a process for downloading data from Rave® into the SWOG
database on a nightly basis. An important advantage of our approach is that it allows
for unified monitoring and reporting across SWOG coordinated studies. Having the
data from Rave® CRFs stored in the SWOG database is critical to the Statistical
Center data management and statistical analysis processes. It allows continued use of
our suite of custom-built applications, such as patient evaluation tools, and the
Statisticians’ Report Worksheet (SRW, described in a later section) and other reports
that are informed by data in the SWOG database. Having Rave® data in the SWOG
database also allows us to combine both Rave® and clinical trials data collected on
earlier pre-Rave ® EDC for further analysis. This facilitates our ability to conduct
SWOG database analyses that combine multiple SWOG trials over a long period of
time.
Having an organization that manages a portfolio of trials also allows for a unified
approach with respect to network security, information exchange security, access
controls, and disaster recovery and contingency plans. The Statistical Center
approaches data security and confidentiality with a focus on the confidentiality,
integrity, and availability of data. Processes and procedures involving network
services and data management applications are influenced by federal requirements
for computer, network, and data security. Network security is based on best practices
for electronic computing and networking and regulatory compliance. Security is
addressed through a defense-in-depth approach. Multiple layers of defense are
utilized to address potential security vulnerabilities. Policies and procedures for
disaster recovery are modified as needed and reviewed annually. Outside profes-
sional consultation, review, and auditing provide additional feedback and result in
updates to policies, procedures, and training as appropriate. All web application
traffic is secured and protected from tampering or eavesdropping by use of industry
standard cryptographic protocols. Operating system and database controls restrict
inappropriate access privileges to data, files, and other objects that require protection
from modification.
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 625
Fig. 2 A tiled representation of a report from the Statisticians Report Worksheet (SRW) program
named across studies. The reporting mechanism normalizes how Phase II and Phase
III study data are presented but with study-specific flexibility with respect to tables.
A collage representation of a study report is presented in Fig. 2.
For all SWOG studies, CRAs use the application to log specimens and indicate
when those specimens are shipped to the appropriate biorepository or laboratory.
Specimen-specific questions can be configured to gather information about the
specimens for use by the laboratory processing the sample (e.g., when slides were
cut, how long a sample was frozen). Laboratory staff use the application to indicate
when those shipments are received and in what condition and to indicate if the
specimens were aliquoted and/or shipped to another destination. Laboratory staff can
also enter assay test results which are communicated in real time to CRAs at the
institutions or the Statistical Center for eligibility, stratification, and/or treatment
decisions.
Every patient registered to a SWOG study is assigned a pseudo patient ID that can
be used when transmitting data to a laboratory. Only Statistical Center staff can link
these pseudo patient IDs to the clinical data. Thus, a laboratory performing an assay
with specimens received from the bank will not have treatment assignment or
clinical outcome access when performing the assay. Prior to merging clinical data
with lab data, the Statistical Center requires that the lab send us their data so that it
can be stored in our database for future use.
Every year, a large number of prospective clinical trial and translational medicine
statistical design specifications must be evaluated in order to identify the optimal
design. To facilitate this statistical development and encouraging standards across
the portfolio, SWOG has designed a suite of trial power and sample size calculators.
Clearly, efficient methods are needed to facilitate evaluation of the potential design
and underlying model scenarios. While many sample size and power calculators are
available for simple trial designs, continued development of improved tools for
design and analyses involving multiple subgroups and complex trial monitoring
are needed. Re-implementation and expansion of existing tools to be mobile acces-
sible are ongoing to move them to a cloud-based setting using OpenCPU, a system
for embedded scientific computing and reproducible research (Ooms 2014). Inte-
gration of both JavaScript-based and R-based calculations can be accomplished
while retaining a common look and feel in the mobile environment. As an example,
Fig. 3 shows a statistical calculator that provides an interaction test for a predictive
marker and randomized treatment assignment in terms of survival outcome data.
While each type of calculator requires different input parameters, there is a relatively
standard presentation of input and output parameters across the set of power and
sample-size calculators.
The comprehensive SWOG database also facilitates the generation of ongoing study
monitoring data summaries which are reviewed by statisticians, data coordinators,
628 C. Tangen and M. LeBlanc
Fig. 3 An example of one of the web-based calculators used in the design of clinical and
translation medicine studies for SWOG. Each calculator uses standard coloring for input and output
parameters and includes a linked help file
and in some cases study chairs. For instance, summary reports of adverse events,
SAE, and treatment data are generated monthly and emailed to study team members
for careful monitoring of trial data.
SWOG Institutional Performance Report (IPR) measures the timeliness of data and
specimen submission across all SWOG studies. Additional metrics are in develop-
ment including assessment of responsiveness to queries, serious adverse event
(SAE) reporting timeliness, patient eligibility rates, and specimen quality indicators.
Site principal investigators and their staff as well as SWOG leadership receive these
reports on a regular basis, monitor progress of concerning sites, and provide inter-
vention and support as appropriate and disciplinary action as a last resort.
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 629
General Structure
Table 1 Example interim analysis table provided to DSMC for an ongoing monitored trial. Phase III two-armed trial testing superiority of an experimental
agent
# of expected events Interim testing
Interim Expected time since Standard Experimental arm % of expected death Superiority one-sided α Futility one-sided α
analysis start of trial arm (assuming Ha: HR ¼ 0.75) information (Ho: HR ¼ 1.0) (Ho: HR ¼ 0.75)
1 3.2 years 107 88 38% N/A 0.01
2 4 years 190 158 67% 0.005 0.01
3 5 years 246 207 86% 0.005 0.01
Final 5.75 years 283 240 100% 0.022 N/A
Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio
631
632 C. Tangen and M. LeBlanc
The DSMC meeting includes three parts. The first part is an open session in which
members of the study team and respective disease committee leadership may be invited
by the DSMC to answer questions or present their requests. Following the open session,
there is a closed session limited to DSMC members and possibly the study statistician in
which outcome results will be presented either by a member of the DSMC, the
designated SWOG Statistician, or the study statistician. A fully closed executive session
follows in which the DSMC discusses outcome results and then votes. At the fully
closed executive session, those present are limited to DSMC members.
The DSMC provides written recommendations to the SWOG Group Chair. If he
agrees, he will forward the DSMC recommendations to the National Cancer Institute
for their evaluation. Details of this process of communication and required actions
are covered in our DSMC policy.
Individuals invited to serve on the DSMC (voting and nonvoting) disclose to the
Group Chair any potential, real, or perceived conflicts of interest. The Statistical
Center representative to the DSMC is also a member of SWOG’s Conflict Manage-
ment Committee, serving as a liaison between the two committees.
While there is some study flexibility, the SWOG Statistical Center sets standards
with respect to interim analysis strategies. Stopping rules are based on group
sequential designs to preserve overall error rates but allow for early stopping if
extreme results are observed. In addition to the specification of Type I and Type II
errors, a typical design for a Phase III study would call for the specification of a small
number of interim analyses, between two and five, with a small probability of
concluding that treatment is efficacious under the null hypothesis. The timing
of interim analyses is based on overall information or event calculation for time-
to-event studies. SWOG statisticians also typically define a one-sided test of “futil-
ity” using a similar early stopping rule based on testing the alternative hypothesis,
rather than performing a test based on conditional power. For many studies, critical
p-values are chosen for interim stopping at a small number of conservative early
assessments that test the alternative hypothesis (e.g., Green et al. 2016; Fleming et al.
1984). Some plans also include an assessment at 50% information time and stop if an
estimated hazard ratio is not favoring the experimental treatment. This guideline
is easy to describe and can be less conservative than testing the alternative hypoth-
esis. With respect to a time scale of interim analyses, a study by Freidlin et al. (2016)
is supported using combined-arm event information in most instances. However,
there is flexibility depending on the specific trial features. Regardless of the interim
analysis strategy, the design properties such as power, Type 1 error, and stopping
probabilities must be assessed and presented in the statistical section of the protocol.
Most Phase III trials involve two treatment arms. Trials with more than two arms
or biomarker-based subgroups are fully addressed in the interim analysis plans.
To facilitate interpretation by the SWOG DSMC, the report includes a table of
estimates/p-values and actions defined by the statistical analysis plan in the protocol.
33 Efficient Management of a Publicly Funded Cancer Clinical Trials Portfolio 633
SWOG’s Statistical Center staff need to effectively interact with external entities to
carry out their clinical trial mission. Interested investigators approach SWOG
to access biospecimens, trial data, or both. Standard, transparent processes need to
be in place to handle and evaluate these queries.
Data Sharing
There are two paths for obtaining trial data from the group’s studies: through the
NCTN/NCORP Data Archive and by direct application to SWOG to request
data. The Statistical Center follows established procedures for archiving SWOG
data with the official NCTN/NCORP Data Archive. Data from qualifying
studies must be archived at the NCI within 6 months of publication. Qualifying
datasets include data from recently reported Phase III trials. The scope of
required data sharing has also increased to include data used in many second-
ary analyses of NCTN and NCORP trial data. Standard operating procedures
(SOPs) are developed at the Statistical Center to document detailed steps of the
processes. A statistician creates the files required to be archived with the NCI
that are sufficient to reproduce all results reported in the primary manuscript.
The Statistical Center administrator assists with creating the data dictionary and
reviews the datasets to ensure compliance with the guidelines. For trial data
that are not stored in the NCTN/NCORP Data Archive, a second path is used
to obtain data. Investigators submit a brief proposal to SWOG that includes
some background for their proposed data analysis, objectives, statistical anal-
ysis methods to be used, and data elements requested. After evaluating for
feasibility and ensuring no overlap with ongoing work, the SWOG Executive
Committee approves the proposal, and a data usage agreement is executed
between the investigator and SWOG. Requested data are then shared with the
investigator. A pseudo-patient ID is used to link records. Data are shared in the
preferred format of the investigator which typically involves Excel spreadsheets
or SAS datasets.
Biospecimen Sharing
Requests for SWOG specimens are common and may arise from SWOG, other
NCTN groups, or nonaffiliated investigators. For SWOG-led intergroup studies,
the specimens are usually housed in the SWOG biorepository at Nationwide
Children’s Hospital. Appropriate permissions are required, usually from the
NCI Correlative Sciences Steering Committee. Once a material usage agree-
ment (MUA) and data usage agreement (DUA) (if applicable) are executed and
communicated to the biobank and Statistical Center, statisticians use our linked
biospecimen inventory at the Statistical Center to produce pull lists that have
634 C. Tangen and M. LeBlanc
the proper required consent for the translational study, which is then commu-
nicated to the biorepository. With the introduction of the NCI Navigator
system, there will be increased opportunities and effort for Statistical Center
statisticians to both support assessing the feasibility of proposed translational
medicine studies and, where appropriate, collaborate on the design and analysis
of the resulting studies.
Most trial specimens can be requested via a new resource recently launched
by the NCI: NCTN Navigator. Cancer researchers interested in conducting
studies using biological specimens and clinical data collected from cancer
treatment trials in the NCTN can use this resource. It includes information
about specimens, such as tumor and blood samples, donated by patients in
NCI-sponsored clinical trials. The clinical trials included in Navigator are
published Phase III studies that evaluated cancer treatments. Investigators can
use the NCTN Navigator website to search the inventory for specimens with
specific characteristics. Investigators who develop proposals and get approval
can use the specimens, along with the trial participants’ clinical information, in
their research. SWOG has a full specimen inventory database from our unified
biobank at Nationwide Children’s Hospital. Specimen data are linked to SWOG
clinical trial data elements to enable efficient overall specimen data manage-
ment. This simplifies specimen utilization for patients meeting various clinical
criteria as well as the creation of reports or datasets that combine data from the
clinical and biorepository databases. It also enhances the efficiency of interac-
tions with the NCI Navigator system. Our enhanced database includes coding
for projects for which specimens are requested and indicates the disbursement
of specimens, project completion, and the return of unused specimens to the
SWOG biorepository.
Key Facts
To provide statistical design and data management and efficiently monitor and report
over a portfolio of clinical trials, communications need to be facilitated between
statisticians and data management staff including regularly scheduled meetings
between senior faculty, senior statistical research associates, data management, and
applications development management.
A standardized, multidisciplinary process for development and review of com-
monly structured protocols results in clear, scientifically sound documents that result
in quality and efficiency in the conduct of our trials and provide enrolling sites with a
format in which they are familiar.
Portfolio-wide processes are developed to avoid the need to develop expertise
(e.g., recruitment and retention, patient-reported outcomes, FDA application intent,
trial design, study build, translational medicine) within each disease committee. We
form expert teams (or cores) that can be accessed by all committees when designing,
conducting, and analyzing clinical trials.
A key feature of our study design processes is the mapping of data elements to a
standard set of domains and codes, which facilitates efficient analysis and enables
our ability to conduct analyses that combine multiple trials over a long period of
time.
Creation of custom software applications helps to address the needs of a group
running a wide array of clinical trials. Because of standardized data collection and a
comprehensive database structure, applications such as a standardized yet flexible
statistical report writing tool, a specimen tracking system, and site performance
metrics reports can be developed.
Having an organization that manages a portfolio of trials allows for a unified
approach with respect to network security, information exchange security, confiden-
tiality, access controls, and disaster recovery and contingency plans. Additionally,
cross-portfolio training can be applied to address common features and issues,
recognizing that study-specific training also is necessary. A portfolio-wide Data
Safety Monitoring Committee can also be utilized.
Standard, transparent processes need to be in place to handle and evaluate queries
from external entities wishing to access biospecimens, trial data, or both.
References
Fleming TR, Harrington DP, O’Brien PC (1984) Designs for group sequential tests. Control Clin
Trials 5(4):348–361
Freidlin B, Othus M, Korn EL (2016) Information time scales for interim analyses of randomized
clinical trials. Clin Trials 13(4):391–399. https://doi.org/10.1177/1740774516644752. PMID
27136947
Gentleman R, Temple Lang D (2007) Statistical analyses and reproducible research. J Comput
Graph Stat 16:1–23. https://doi.org/10.1198/106186007X178663
Green S, Benedetti J, Smith A, Crowley J (2016) Clinical trials in oncology, 3rd edn. CRC Press,
Boca Raton
636 C. Tangen and M. LeBlanc
Iqbal SA, Wallach JD, Khoury MJ, Schully SD, Ioannidis J (2016) Reproducible research practices
and transparency across the biomedical literature. PLoS Biol 14(1):e1002333. https://doi.org/
10.1371/journal.pbio.1002333. PMID: 26726926. PMCID: PMC4699702
NCI Community Oncology Research Program (NCORP). www.cancer.gov/research/areas/clinical-
trials/ncorp
NCI’s National Clinical Trials Network (NCTN). www.cancer.gov/research/areas/clinical-trials/
nctn
Ooms J (2014) The OpenCPU system: towards a universal interface for scientific computing
through separation of concerns. https://arxiv.org/abs/1406.4806
Archiving Records and Materials
34
Winifred Werther and Curtis L. Meinert
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
Trial Master File (TMF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640
Key Study Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
Study Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641
Consent Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
Data Collection Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
Investigator’s Brochure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
Key Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
IRB Transmissions and Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
Reports of Adverse Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
Directives from Sponsors and Regulatory Agencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
Inquiries from Persons or Journalists Concerning the Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643
The Trial Data System and Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644
Other Study Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645
Access to the Archive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
TMF Retention Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
W. Werther (*)
Center for Observational Research, Amgen Inc, South San Francisco, CA, USA
e-mail: [email protected]
C. L. Meinert
Department of Epidemiology, School of Public Health, Johns Hopkins University, Baltimore,
MD, USA
e-mail: [email protected]
Abstract
An archive, in the context of a trial, is a collection of documents and records
relevant to the design and conduct of the trial maintained as a historical reposi-
tory. Archiving is a process that starts before the first person is enrolled and
continues to the end of the trial when all analyses are complete and the investi-
gator group disbands.
So, when a trial is finished, money has run out, and investigators have
dispersed, what do you have archived and where? The answer to the first question
is “everything you may need later,” and the answer to the second is “someplace
readily accessible far into the foreseeable future.” Both answers are correct but
not helpful because the first question requires a crystal ball of what might be
needed and the second requires a place like the Smithsonian and there are no
Smithsonians for archiving records of clinical trials.
This chapter is about the process of archiving and about what to archive.
Keywords
Trial master file · Archiving · Electronic
Introduction
An archive is a place where records or historical materials are stored and preserved.
The place may be a physical location, like the National Archives where records can
be accessed and viewed, or an electronic address serving the same purpose, the latter
usually the case for clinical trials. The International Council for Harmonisation
(ICH) good clinical practice (GCP) guidelines are foundational for guidelines on
archiving. These guidelines are as set forth by the European Medicines Agency
(EMA 2018).
But even if there were no legal requirements for documentation and archiving,
investigators would document on their own. They need documentation should they
need to retrace steps or check on what they did. They need documentation if
questions arise from outside the trial regarding what they did or how they did it.
Archiving can be a safeguard for questions that might occur during and after
conduct of the trial. A few examples of questions follow. First, investigators in
VIGOR (Vioxx Gastrointestinal Outcomes Research) published their results in
November 2000 in the NEJM (Bombardier et al. 2000). The NEJM expressions of
concern regarding counts in VIGOR came 5 years later (Curfman et al. 2005, 2006).
Second, troubles in the National Surgical Adjuvant Breast and Bowel Project
(NSABP) came from falsified data in breast cancer trials (Crewdson 1994). Third,
the University Group Diabetes Program (UGDP) was a randomized multicenter
secondary prevention trial designed to test whether commonly used treatments for
type 2 diabetes were useful in delaying the cardiovascular and neurological sequelae
of the disease (UGDP 1970a). The trial started in 1960 and finished in 1978. About
34 Archiving Records and Materials 639
mid-way through investigators stopped the use of one of the treatments, tolbutamide
(an oral drug widely regarded as safe and effective in the diabetic community),
because there were concerns regarding safety (UGDP 1970b). The decision brought
an avalanche of criticisms from diabetologists and ultimately led to a review of the
trial by a special committee commissioned by the International Biometric Society.
The committee met several times from 1972 thru 1974 and published its report in
JAMA in 1975 (Gilbert et al. 1975). The first meeting of the committee was at the
coordinating center for the trial in the fall of 1972. The first thing the committee
wanted to see was a description of the randomization procedure used in the trial,
written in 1960, before the trial started. The problem was that, somehow, after all
those years in a dark filing cabinet, various sentences, “crystal clear” when written,
had morphed into puzzling statements.
Lesson: Foundational documents, like the system for randomization, should be
read and reviewed by multiple members of the investigational team before archiving.
As exemplified in these historical examples of clinical trial archive use, investigators
and sponsors can and should expect many different reasons for needing and using the
archive for clinical trial activities.
The TMF, broadly, is a collection of documents and files created over the course of a
trial that enables sponsors, monitors, agencies, authorities, or persons to check and
reconstruct what was done. The TMF is discussed in the European Medicines
Agency Guideline on the content, management, and archiving of the clinical trial
master file (paper and/or electronic) (EMA 2018). The TMF is often managed and
maintained electronically and is referred to as the electronic TMF or eTMF.
The executive summary of the Guideline on the content, management, and
archiving of the clinical trial master file provides an overview of the intention of
the TMF and is quoted here:
Trial master file (TMF) plays a key role in the successful management of a trial by the
investigator/institutions and sponsors. The essential documents and data records stored in the
TMF enable the operational staff as well as monitors, auditors and inspectors to evaluate
compliance with the protocol, the trial’s safe conduct and the quality of the data obtained.
This guideline is intended to assist the sponsors and investigators/institutions in complying
with the requirements of the current legislation (Directive 2001/20/EC and Directive 2005/
28/EC), as well as ICH E6 Good Clinical Practice (GCP) Guideline (‘ICH GCP guideline’),
regarding the structure, content, management and archiving of the clinical trial master file
(TMF). The guidance also applies to the legal representatives and contract research organi-
sation (CROs), which according to the ICH GCP guideline includes any third party such as
vendors and service providers to the extent of their assumed sponsor trial-related duties and
functions. The ICH GCP guideline provides information in relation to essential documents to
be collected during the conduct of a clinical trial. The risk-based approach to quality
management also has an impact on the content of the TMF. To ensure continued guidance
once the Clinical Trials Regulation (EU) No. 536/2014 (‘Regulation’) comes into
640 W. Werther and C. L. Meinert
application, this guidance already prospectively considers the specific requirements of the
Regulation with respect to the TMF.
The table of contents of the Guideline on the content, management, and archiving
of the clinical trial master file is a good reference point when planning a TMF and is
provided below:
1. Executive summary
2. Introduction
3. Trial master file structure and contents
3.1. Sponsor and investigator trial master file
3.2. Contract research organisations
3.3. Third parties-contracted by investigator/institution
3.4. Trial master file structure
3.5. Trial master file contents
3.5.1. Essential documents
3.5.2. Superseded documents
3.5.3. Correspondence
3.5.4. Contemporariness of trial master file
4. Security and control of trial master file
4.1. Access to trial master file
4.1.1. Storage areas for trial master file
4.1.2. Sponsor/CRO electronic trial master file
4.1.3. Investigator electronic trial master file
4.2. Quality of trial master file
5. Scanning or transfers to other media
5.1. Certified copies
5.2. Other copies
5.3. Scanning or transfer to other media
5.4. Validation of the digitisation and transfer process
5.5. Destruction of original documents after digitisation and transfer
6. Archiving and retention of trial master file
6.1. Archiving of sponsor trial master file
6.2. Archiving of investigator/institution trial master file
6.3. Retention times of trial master file
6.4. Archiving, retention and change of ownership/responsibility
7. References
Prerequisites
The study team needs to document in real time for the TMF to be sufficient to allow
people or authorities to reconstruct how the trial was conducted after it is finished. To
accomplish this there must be understanding prior to starting the trial of what gets
documented and by whom. Also, required are understandings as to where documents
34 Archiving Records and Materials 641
The study protocol, consent forms, and data collection forms are at the heart of the
trial. All three should be open to the public, except for details that have the potential
of biasing results, for example, details regarding masking and randomization
schemes. The Investigator’s Brochure is another key study document that is neces-
sary when the clinical trial involves an investigational drug.
Study Protocol
The protocol is roughly akin to a blueprint for a building but far less detailed than
blueprints. The protocol, unlike blueprints, allows room for clinical judgment. To
facilitate inclusion of the proper details for the trial conduct, SPIRIT (Standard
Protocol Items: Recommendations for Interventional Trials) is a published document
with a 33-item list of information to be included in protocols (Chan et al. 2013).
Protocols should be open to the public. One way to accomplish this is by posting
on registration sites, such as ClinicalTrials.gov. ClinicalTrials.gov has a field to
include protocols, but it is only sparingly used: 5,721 postings out of 145,844
completed trials, 3.9% of completed trials, as of 3 April 2020 (US NLM 2020).
Publishing protocols as standalone manuscripts in a clinical trial journal is one
way to make them public, but perhaps the best way is as supplemental material in
results publications, but even if done for every results publication the practice would
cover only a fraction of trials since the majority of trials are never published. A
642 W. Werther and C. L. Meinert
Consent Forms
Consents are necessary prerequisites to enrolling persons into clinical trials. All
transactions concerning consents must be archived should questions arise later
regarding content in each version of the informed consent and when they were
used at the trial sites and signed by the trial participant.
Consents may be oral or written depending on settings and circumstance. The
language and content of the informed consent document are controlled by local
IRBs, even if the trial is done with a central IRB (NIH 2016; FDA 2006).
Clinics in multicenter trials may be provided with prototype consent statements
prepared by the coordinating center or some other leadership center in the trial, but
individual clinics and local IRBs are free to change language or add statements to the
prototype, provided the primary information transmitted remains unaltered. Individ-
ual clinics are responsible for archiving their own consent statements, as approved
by local IRBs. The trial leadership is responsible for archiving master consent
statement and changes thereto in the TMF.
The data collection schedule is outlined in the protocol. Copies of data collection
forms (electronic or paper) and changes thereto during the trial should be
documented in sufficient detail to permit reconstruction of the data collection effort,
if necessary, after the trial is finished. Electronic data collection systems need to have
audit trail functions to track changes to data collection during the conduct of the trial.
Investigator’s Brochure
studying the product. Changes to the Investigator’s Brochure over time should be
archived in the TMF.
Key Communications
Individual study centers are responsible for communications to and from their
respective IRBs and for archiving same. The study coordinating center, office of
the chair, sponsor, or some other party in multicenter trials is responsible for
archiving communications to and from the study parent IRB, including communi-
cations concerning the study protocol and changes to it, prototype consent forms,
and data collection forms.
Adverse events must be reported to IRBs and all clinics in the trial. Typically, the
clinic in which the event occurred reports the event to the study coordinating center
or like leadership center in multicenter trials, and it in turn reports the event to all
study centers and sponsors. Sponsors have an obligation to maintain an adverse
event database for the investigational or marketed product and an obligation to report
adverse events to regulatory authorities with specific timelines. Regulatory author-
ities may place clinical trials on hold based on adverse event reporting. Communi-
cations on adverse event reporting should be included in the study archive.
Directives from sponsors and regulatory agencies concerning the trial must be
communicated to study IRBs and study centers and implemented as indicated.
Archiving is the responsibility of study coordinating center or like leadership center
in the trial.
Queries from persons or the press concerning the trial should be logged with details
as to resolution. Questions from patients in trials usually are addressed at the clinic
level. Multicenter trials will have structures for dealing with questions from the press
644 W. Werther and C. L. Meinert
or others not involved in the trial. The usual course is to refer those queries to the
study chair or some other responsible person in the organization structure of the trial.
Correspondence should be logged and archived.
The trial data system, including the database, is the soul of the trial. In this electronic
age, it will likely be comprised of dozens of programs to construct and manage the
data system and as many to monitor and analyze data during and after completion of
the trial, many of which will be updated or changed over the course of the trial. The
developer and operator of the data system are responsible for archiving data system
programs. Data analysts are responsible for archiving analysis programs.
The prize possession of a trial is its data. Obviously, the finished, identified, dataset
must be archived but must be edited and cleaned of outstanding edits and checks before
archiving. Once the dataset is frozen for archiving, changes or updates are nuisances,
especially if the changes impact on counts or results in published papers.
The archive must be secure, password protected, and in a location likely to allow
access for a minimum of 20 years after deposit.
But nothing is forever, and, hence, eventually files maybe unreadable because
technologies change. When the UGDP ended, the dataset was deposited at the
National Technical Information Service (NTIS) on magnetic tape. Even if it still
exists there, it would be hard to find anyone capable of reading magnetic tapes.
Dataset can have value long after trials are finished. The Coronary Drug Project
(CDP) ran from 1966 to 1985 (CDP 1973). Just recently a person requested the
dataset to do a follow-up study of enrollees. The dataset disappeared when the
institution housing the coordinating center for the trial ceased to exist in 2010.
Investigators must decide if they produce a deidentified dataset for use by people
outside the research group. Increasingly, the expectation is that there will be a
deidentified dataset available, but deidentifying data is no mean task. It takes time,
costs money, and requires skilled people to do the deidentifying. Clinical research
teams can hire experts on deidentification to ensure proper procedures are followed.
If done, the set may be available on request or may be deposited at a commercial
enterprise specialized in such services.
Another use of deidentified data may be participation in meta-analyses or pooling
of placebo-treated patients across trials to better understand the underlying patient
population. Cooperative groups and groups led by medical associations are leading
some efforts to pool deidentified patient level data.
Registration
medicinal products authorized in the European Union and outside the EU if they are
part of the Pediatric Investigation Plan from 1 May 2004 onwards. It has been
established in accordance with Directive 2001/20/EC. Protocol and results informa-
tion on interventional clinical trials are made publicly available through the Euro-
pean Union Clinical Trials Register since September 2011 (EMA 2020).
Trials are to be registered prior to start of enrollment and registrations are to be
updated to completion of the trial. The ClinicalTrials.gov website has a field for
posting protocols and logging updates to it over the course of the trial and for listing
citations to publications from the investigator group.
Results, without written comments, are to be posted to the website within 1 year
after completion of the trial. The bad news is that only a small fraction of the
registrations contains posted results. For example, for trials completed in 2018,
only 13% had posted results, as of 9 April 2020 (US NLM 2020).
Trials, like people, need curriculum vitae (CV) to list key facts, activities, and accom-
plishments. An example is the CV for the National Emphysema Treatment Trial (NETT
1999) posted at trialsmeinertsway.com (Meinert 2020). Its content is as below:
• Manuals of operations and study handbooks, data collection forms, and revision
histories
646 W. Werther and C. L. Meinert
Trials, especially multicenter trials, have a website built specifically for use by
investigators in the trial. Typical study websites will include the Investigator’s
Brochure, current version of the study protocol, study handbooks and manuals,
copies of data collection forms, and other information of importance to investigators
in the trial. Access must be password protected. If public access to the protocol,
consent forms, and data collection forms is provided, it will be provided on a public
website.
Access to the TMF or eTMF is controlled by the coordinating center, and staff
working on the trial will be given access according to roles; for example, an editor
can upload documents, while staff who need to access files only will be given read-
only access.
For trials conducted under Directive 2001/20 of the European Union, the sponsor
and investigator must ensure that the documents in the TMF are retained for at least
5 years after the conclusion of the trial or in accordance with national regulations; for
example, Germany requires a 10-year period of retention (de Mey 2018). Some
countries require 20 years or longer.
Archiving records and materials is a critical activity during the conduct of clinical
trials. The repository for the archive is referred to as the TMF. The most used
guideline on creating and maintaining the TMF is published by the EMA. Successful
archiving includes specified roles and responsibilities of the staff charged with
archiving for the trial. Special attention should be made to key documents of the
trial including the protocol and consent forms and the various versions used during
the trial. Correspondence and communications with trial sponsors, press, and others
are included in the archive. The database system and database demand special
considerations when locking for archiving. One method for archiving publicly is
to include the trial protocol and results in a clinical trial registry, such as
ClinicalTrials.gov. Access to the archive is controlled by the leadership of the trial.
Retention for archives varies by country and region and needs to be taken into
consideration when planning the archive.
34 Archiving Records and Materials 647
Key Facts
• The trial master file (TMF) serves as the archive for a clinical trial and can be
paper and/or electronic.
• Guidelines on the TMF have been published by the EMA.
• Many considerations go into creating and maintaining the TMF including key
documents, communications, data systems, and other documents. All versions of
documents used during the trial are included in the archive.
• Registration of the trial offers an opportunity to provide documents from the
archive to the public.
• Access to the TMF is controlled by the clinical trial leadership.
• Retention time is a consideration when choosing where to house the TMF.
Cross-References
References
Bombardier C, Laine L, Reicin A, Shapiro D, Burgos-Vargas R, Davis B, Day R, Ferraz MB,
Hawkey CJ, Hochberg MC, Kvien TK, Schnitzer TJ for the VIGOR Study Group (2000)
Comparison of upper gastrointestinal toxicity of rofecoxib and naproxen in patients with
rheumatoid arthritis. N Engl J Med 343:1520–1528
Chan A-W, Tetzlaff JM, Altman DG, Laupacis A, Gøtzsche PC, Krleža-Jeric K, Hróbjartsson A,
Mann H, Dickersin K, Berlin J, Doré C, Parulekar W, Summerskill W, Groves T, Schulz K, Sox
H, Rockhold FW, Rennie D, Moher D (2013) SPIRIT 2013 statement: defining standard
protocol items for clinical trials. Ann Intern Med 158:200–207
Coronary Drug Project Research Group (1973) The coronary drug project: design, methods, and
baseline results. Circulation 47(Suppl I):I-1–I-50
Crewdson J Fraud in breast cancer study, Chicago tribune, 13 March 1994
Curfman GD, Morrissey S, Drazen JM (2005) Expression of concern: Bombardier et al., compar-
ison of upper gastrointestinal toxicity of Rofecoxib and naproxen in patients with rheumatoid
arthritis. N Engl J Med 343:1520–1528. N Engl J Med 2005;353:2813–2814
Curfman GD, Morrissey S, Drazen JM (2006) Expression of concern reaffirmed. N Engl J Med 354:
1193
De Mey C (2018) Archiving – how long? https://wwwacps-networkcom/2018/11/08/ct-lost-in-
delegation-2/. Accessed 23 Dec 2020
European Medicines Agency (EMA) Good Clinical Practice Inspectors Working Group (2018)
Guideline on the content, management and archiving of the clinical trial master file (paper and/
or electronic) https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-content-
management-archiving-clinical-trial-master-file-paper/electronic_en.pdf. Accessed 23 Dec
2020
European Medicines Agency (EMA) (2020) EudraCT public home page. https://eudracte
maeuropaeu/. Accessed 23 Dec 2020
648 W. Werther and C. L. Meinert
Gilbert JP, Meier P, Rümke CL, Saracci R, Zelen M, White C (1975) Report of the Committee for
the Assessment of biometric aspects of controlled trials of hypoglycemic agents. JAMA 231:
583–608
Meinert CL (2020) Trials Meinerts Way. https://jhuccs1us/clm/defaultasp. Accessed 9 Apr 2020
National Emphysema Treatment Trial Research Group (1999) Rationale and design of the National
Emphysema Treatment Trial (NETT): a prospective randomized trial of lung volume reduction
surgery. Chest 116:1,750–1,761
United States Food and Drug Administration (FDA) (2006) Using a Centralized IRB Review
Process in Multicenter Clinical Trials Guidance for Industry. https://www.fda.gov/regulatory-
information/search-fda-guidance-documents/using-centralized-irb-review-process-multicenter-
clinical-trials. Accessed 23 Dec 2020
United States National Institutes of Health (NIH) (2016) Final NIH policy on the use of a single
institutional review Board for Multi-Site Research. https://grantsnihgov/grants/guide/notice-
files/not-od-16-094html. Accessed 23 Dec 2020
United States National Library of Medicine (US NLM) (2020) ClinicalTrials.gov https://www.
clinicaltrials.gov/. Accessed 3 Apr 2020
University Group Diabetes Program Research Group (1970a) A study of the effects of hypoglyce-
mic agents on vascular complications in patients with adult-onset diabetes: I. Design, methods,
and baseline characteristics. Diabetes 19(suppl 2):747–783
University Group Diabetes Program Research Group (1970b) A study of the effects of hypoglyce-
mic agents on vascular complications in patients with adult-onset diabetes: II. Mortality results.
Diabetes 19(Suppl 2):785–830
Good Clinical Practice
35
Claire Weber
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
GCP Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
ICH and GCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
GCP Historical Timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651
Key Aspects of ICH GCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653
GCP Documents Also Known as Essential Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
Abstract
Good clinical practice (GCP) is an international quality standard that is provided
by the International Council on Harmonization (ICH), an international body that
defines standards, which governments can transpose into regulations for all
phases of clinical trials involving human subjects. GCP applies to the trial
sponsor team, the institutional review boards (IRB)/ethics committees (EC) and
the investigator site teams. This chapter describes the GCP concepts, a GCP
historical timeline, and how GCP in all phases of clinical trials and drug devel-
opment through regulatory approval is the standard for clinical research.
Keywords
ICH · GCP · Ethical · Consent · Privacy · Regulations · Guidelines · Sponsor ·
IRB/EC · Investigator
C. Weber (*)
Excellence Consulting, LLC, Moraga, CA, USA
e-mail: [email protected]
Introduction
GCP Definition
An international ethical and scientific quality standard for designing, conducting, recording,
and reporting trials that involve the participation of human subjects. Compliance with this
standard provides public assurance that the rights, safety, and well-being of trial subjects are
protected, consistent with the principles that have origin in the Declaration of Helsinki, and
that the clinical trial data are credible (ICH E6 [R2] Introduction Page 1).
A standard for the design, conduct, performance, monitoring, auditing, recording, analyses,
and reporting of clinical trials that provide assurance that the data and reported results are
credible and accurate, and that the rights, integrity, and confidentiality of trial subjects are
protected (ICH E6 [R2] Glossary Section 1.24).
GCP, combined with good manufacturing practice (GMP) standards, good labo-
ratory practice (GLP) standards, good pharmacovigilance practice (GPVP), good
distribution practice (GDP), and good documentation practice (GDoP) are referred to
as GxP. GxP applies to all aspects of drug development. This chapter only pertains to
GCP, but it should be noted that GCP shares some common elements with definitions
of other areas of GxP, since they each are standards to ensure drug products are safe,
pure and not adulterated, and effective. Although GCP was developed for clinical
investigational drug studies, the principles are also used in medical device studies,
and other social/behavioral studies.
The following key events were instrumental for the development of GCP:
The Nuremberg Code of 1947:
• On August 20, 1947, the judges delivered their verdict in the “Doctors Trial”
against Karl Brandt and 22 others. These trials focused on doctors involved in the
human experiments in concentration camps. The suspects were involved in over
3,500,000 sterilizations of German citizens.
• Instituted informed consent and absence of coercion, and voluntary participation.
• One of the first and most important actions of the World Medical Association
(WMA), regarding a physicians’ dedication to the humanitarian goals of medicine
• Physician’s oath, to be sworn at the time a person enters the Medical profession,
was added to the Declaration of Geneva and adopted by the General Assembly of
the World Medical Association
• Pledge in view of the medical crimes that been committed in Nazi Germany and
includes “I will maintain the utmost respect for human life; even under threat, I
will not use my medical knowledge contrary to the laws of humanity”
Informed consent.
• FDA officials began to lobby members of Congress and draft legislation in the
late 1950s to address gaps in oversight in drug manufacturing and marketing.
• Drug manufacturers are required to prove to the FDA the effectiveness and safety
of the product before marketing.
• These efforts coincided with how an FDA official refused to give a positive
opinion on a drug called thalidomide, used to treat morning sickness and nausea
and found to have caused hundreds of birth defects, in Western Europe.
• Raised the issue of the importance of keeping good records and documentations.
• FDA had veto power over new drugs entering the market.
• Drugs now had to demonstrate evidence of effectiveness as well as safety,
dramatically increasing the amount of time, resources, and scientific expertise
required to develop a new drug. The Modern Clinical Trial System was
implemented, the1962 Amendment required interpretation of effectiveness to
include “substantial evidence” in “adequate and well-controlled investigations.”
The Medicines Act of 1968 from the Department of Health and Social Services
(DHSS):
Consent
Consent is a critical aspect of GCP and is defined as:
In addition to consent, trials with children and with impaired individuals may
need to include an assent signed by the subject, in addition to the consent signed by
the legal representative.
GCP requires that all elements of consent/assent are appropriately obtained and
documented.
IRB/EC
The IRB/EC is defined as:
654 C. Weber
GCP requires that all studies are reviewed and overseen by IRB/ECs.
Privacy
The US Health Insurance Portability Accountability Act (HIPAA) of 1996 Privacy
Rule and the European Union (EU) General Data Protection Regulation (GDPR),
and other similar international regulations are important aspects of GCP as safe-
guards to protect the privacy of personal health information and rights to examine
and obtain a copy of health records.
GCP requires that the subject’s privacy is protected.
International
Standard laws, Regulations
Operating (CFR, CTD)
Procedures Guidelines
35 Good Clinical Practice 655
Documents which individually and collectively permit evaluation of the conduct of a study
and the quality of the data produced (ICH E6 [R2] Glossary Section 1.23).
• Protocol
• Consent form
• Regulatory authority approvals (Country, IRB/IEC)
• Investigator’s brochure
• Plans – e.g., monitoring, medical oversight, risk management, statistical analysis
plan (SAP), blinding/masking, pharmacovigilance, etc.
• Investigator site source documents
• Investigator statement – FDA form 1572 and financial disclosure
• Electronic or paper case report form (CRF)
• Clinical study report (CSR)
• Standard operating procedures and training
• Trial master file (sponsor and investigator site)
Essential documents demonstrate that GCP is followed, and the trial master file is
the master archive of the essential documents.
GCP is an international quality standard that is provided by the ICH for all phases of
clinical trials, and drug development through regulatory approval. The goals of GCP
are to protect all research participants and assure that only worthy treatments are
approved for use for future patients. ICH E6 (R2) includes GCP principles (referred
to as ICH GCP) that are widely recognized as authoritative in defining obligations of
sponsors, investigator site teams, and IRB/ECs. The IRB/EC and implementation of
consent and privacy are critical aspects of GCP. ICH GCP principles are included in
standard operating procedures, as well as international, local, and regional laws,
directives, and regulations. Essential documents demonstrate that GCP is followed.
The trial sponsor team, the investigator site team, and IRB/EC are all responsible for
protecting human subjects who volunteer to participate and must be trained on and
follow GCP. If the clinical study follows GCP, the data generated from the trial will
be mutually accepted by many of the regulatory agencies around the world in
support of an approval to market the drug.
Key Facts
The facts covered in this chapter include: Goals and definitions of GCP and ICH,
historical timeline of key events leading to GCP, key aspects of GCP, clinical
research teams responsible for following GCP, and GCP essential documents.
656 C. Weber
Cross-References
References
Act of October 10, 1962 (Drug Amendments Act of 1962), Public Law 87-781, 76 STAT
780, which amended the Federal Food, Drug, and Cosmetic Act to assure the safety, effective-
ness, and reliability of drugs, authorize standardization of drug names, and clarify and
strengthen existing inspection authority
Clinical Trial Directive: Directive 2001/20/EC of the European Parliament and of the Council of
4 April 2001 on the approximation of the laws, regulations and administrative provisions of the
Member States relating to the implementation of good clinical practice in the conduct of clinical
trials on medicinal products for human use
Declaration of Helsinki of 1964
EMA GCP Directive 2005/28/EC
FDA Code of Federal Regulations (CFR), Title 21, Part 312
Health Insurance Portability and Accountability Act of 1996 (HIPAA)
ICH E6 (R2) International Council for Harmonisation Guideline for Good Clinical Practice
Guideline for Good Clinical Practice (Introduction, p 1, Glossary Section 1.24, pp 8–10,
Glossary Section 1.28, Glossary Section 1.31, Glossary Section 1.23)
Medicines Act 1968 c.67
Regulation (EU) 2016/679 (General Data Protection Regulation), and Directive 95/46/EC
The Belmont report: ethical principles and guidelines for the protection of human subjects of
research (1978). The Commission, Bethesda
The Nazi doctors and the Nuremberg Code: human rights in human experimentation (1995). Oxford
University Press, New York
World Medical Association (2001) World Medical Association declaration of Helsinki. Ethical
principles for medical research involving human subjects. Bull World Health Organ 79(4):
373–374. World Health Organization
Institutional Review Boards and Ethics
Committees 36
Keren R. Dunn
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
History of Research Ethics and Emergence of IRBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658
The National Commission and the Belmont Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659
Ethics Violations and Calls for Reform at the Turn of the Century . . . . . . . . . . . . . . . . . . . . . . . . 661
Revision of the Common Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662
IRB Functions and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663
IRB Review of Research and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664
IRB Review Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666
IRB Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 667
Criteria for IRB Approval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668
Informed Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
Documentation of Informed Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669
Waivers of Informed Consent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
Single IRB Review and IRB Reliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
Ethics Committees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677
Abstract
Institutional review boards (IRBs) are committees established in accordance with
US federal regulations to review and monitor clinical trials and other research
with human subjects. IRBs evolved from a history of egregious ethical violations
in research with human subjects and the ethics codes and declarations that ensued,
and were first mandated by US law in 1974, with the passing of the National
Research Act. IRBs help to ensure the protection of the rights and welfare of
K. R. Dunn (*)
Office of Research Compliance and Quality Improvement, Cedars-Sinai Medical Center, Los
Angeles, CA, USA
e-mail: [email protected]
human subjects by applying the ethical principles of the Belmont Report, respect
for persons, beneficence, and justice, in their review of research projects. They
have the authority to approve, require modifications to, or disapprove proposed
research. IRBs review plans to obtain and document informed consent from
research participants and can waive the requirements for informed consent in
certain circumstances. IRBs may exist within the institution where research is
being conducted or institutions can rely on an external IRB with a written
agreement. While the term IRB is unique to the USA, clinical trials internation-
ally adhere to the ethical principles of the Declaration of Helsinki, which requires
independent review by an ethics committee.
Keywords
Institutional review board · IRB · Ethics committee · Belmont Report · Common
Rule · Informed consent
Introduction
IRBs in the United States and ethics committees around the world conduct indepen-
dent review of research with human subjects and provide a core protection for the
rights and welfare of participants in clinical research. IRB or ethics committee
review is also critical in gaining and maintaining public trust of clinical research
due to a history of egregious ethical violations in the conduct of clinical research.
This chapter provides an overview of the history of ethical violations in clinical
research and emergence of IRBs and ethics committees, a summary of IRB functions
and operations, an outline of the requirements for informed consent, and an overview
of recent changes to the system of IRB review and oversight. A timeline of key
milestones in research ethics and the establishment of IRBs in shown in Fig. 3.
The foundation of modern-day research ethics in the USA and around the world
begins with the Nuremberg Code, which emerged in 1947 from the Nuremberg
trials, in which Nazi physicians were tried for their conduct of atrocious medical and
scientific experiments on prisoners in concentration camps (White 2020). The
Nuremberg Code includes ten basic principles for the conduct of ethical research
with human subjects, covering voluntary and informed consent, risk/benefit assess-
ment that is favorable, subject right to withdraw, and research expertise and respon-
sibility (U.S. Government Printing Office 1949; Rice 2008). In 1964, the World
Medical Association created the Declaration of Helsinki, an ethical code of conduct
that built upon the principles outlined in the Nuremberg Code, but added the tenets
that the interests of subjects must be placed above the interests of society and that
every subject should be given the best known treatment (Rice 2008). Additionally,
36 Institutional Review Boards and Ethics Committees 659
the Declaration of Helsinki expanded upon the requirement for voluntary and
informed consent from the Nuremberg Code to address the ethical participation of
children and compromised adults in research (White 2020). Despite US involvement
in the development of both the Nuremberg Code and Declaration of Helsinki, there
are multiple documented cases of serious research ethical violations in the US
throughout the 1950s and 1960s (White 2020).
In 1966, a well-respected anesthesiologist from Massachusetts General Hospital,
Henry Beecher, published an article in the New England Journal of Medicine
outlining multiple examples of ethical violations he had garnered from a review of
publications in an “excellent journal” (Harkness et al. 2001). Beecher’s examples
included, among other violations, studies where known effective treatment was
withheld from subjects and studies where subjects were exposed to excessive and
unjustified risk of harm (Beecher 1966). In conclusion, Beecher advocated that “it is
absolutely essential to strive for (informed consent) for moral, sociologic and legal
reasons” (Beecher 1966). Additionally, he concluded, “there is the more reliable
safeguard provided by the presence of an intelligent, informed, conscientious,
compassionate, responsible investigator” (Beecher 1966). Notably, Beecher was
not an advocate for independent review and oversight, despite the influence his
work had on the emergence of the system of institutional review boards (IRBs)
(Harkness et al. 2001).
Perhaps the most infamous research ethics violation in the USA in the twentieth
century, “Tuskegee Study of Untreated Syphilis in the Negro Male,” was exposed in
an article published in the Washington Star by Jean Heller in 1972 (White 2020). The
study began in 1932, when there were no safe and effective treatments available for
syphilis and enrolled 600 African American men from the community around
Tuskegee, Alabama (White 2020). Although penicillin was proven to be an effective
treatment for syphilis by 1945 and was widely used, the men in the study were not
informed and not offered treatment so that the researchers could continue to learn
about the natural course of the disease (White 2020). The study continued for
40 years until it was publicly exposed in 1972 (White 2020).
Public outcry about the Tuskegee Study and other ethics violations, as well as
concern from the medical community following Beecher’s article, led congress to
pass the National Research Act in 1974, which established federal regulations for the
protection of human subjects (45 CFR 46) and paved the way for the modern system
of institutional review boards (IRBs) for the oversight of research with human
subjects (Rice 2008). The National Research Act mandated that any entity applying
for an NIH grant or contract must provide assurances that it has established an IRB to
protect the rights of human subjects in biomedical and behavioral research
(US Congress Senate 1974).
The National Research Act also established the National Commission for the Protection
of Human Subjects of Biomedical and Behavioral Research (the Commission). The
660 K. R. Dunn
The Commission's deliberations begin with the premise that investigators should not have
sole responsibility for determining whether research involving human subjects fulfills ethical
standards. Others, who are independent of the research, must share this responsibility,
because investigators are always in positions of potential conflict by virtue of their concern
with the pursuit of knowledge as well as the welfare of the human subjects of their research.
The Commission believes that the rights of subjects should be protected by local review
committees operating pursuant to federal regulations and located in institutions where
research involving human subjects is conducted. Compared to the possible alternatives of a
regional or national review process, local committees have the advantage of greater
familiarity with the actual conditions surrounding the conduct of research. Such committees
can work closely with investigators to assure that the rights and welfare of human subjects
are protected and, at the same time, that the application of policies is fair to the
investigators. They can contribute to the education of the research community and the
public regarding the ethical conduct of research. The committees can become resource
centers for information concerning ethical standards and federal requirements and can
communicate with federal officials and with other local committees about matters of
common concern.
Fig. 1 Transcribed excerpt from the Commission’s report: Institutional Review Boards
duties of the Commission were: 1) to identify the basic ethical principles which should
guide the conduct of research with human subjects; 2) develop guidelines for the
conduct of research with human subjects in accordance with those ethical principles;
and 3) advise the secretary on administrative actions to apply the guidelines and on any
other matters related to the protection of human research subjects (US Congress
Senate 1974).
The Commission published multiple reports between 1975 and 1979. On
September 1, 1978, the Commission published a report entitled Institutional Review
Boards, which outlined recommendations for the IRB review mechanism and
evaluation of IRB performance, as well as steps to improve the ethical review
process (National Commission 1978). Figure 1 includes an excerpt from this report
on IRBs.
The Belmont Report, named for the location of the Commission’s four-day
intensive meetings at the Belmont Conference Center in 1976, was issued September
30, 1978 and published in the Federal Register April 18, 1979 (National Commis-
sion 1979). The Belmont Report described the boundaries between the practice of
medicine and research, outlined the basic ethical principles to guide research with
human subjects, and delineated the application of these ethical principles (National
Commission 1979). Figure 2 includes a summary of the ethical principles and their
applications outlined in the Belmont Report.
In 1981, revised regulations for the protection of human subjects (45 CFR 46)
incorporating most of the recommendations of the Belmont Report were signed by
the secretary of the Department of Health and Human Services (DHHS) and the
Food and Drug Administration (FDA) adopted similar regulations covering
36 Institutional Review Boards and Ethics Committees 661
Fig. 2 Summary of the Belmont Report’s ethical principles and their applications
requirements for IRBs (21 CFR 56) and informed consent (21 CFR 50) in FDA
regulated clinical investigations (White 2020). In an effort to harmonize regulations
across the federal government, the Federal Policy for the Protection of Human
Subjects (45 CFR 46) was adopted in 1991 by 15 federal departments and agencies
to become known as the “Common Rule” (White 2020). The FDA has not signed on
to the Common Rule, but has committed to amending its own regulations at 21 CFR
parts 50 and 56 to align with the Common Rule to the extent possible (White 2020).
Ethics Violations and Calls for Reform at the Turn of the Century
In 1999 and 2001, there were two highly publicized tragic deaths of research subjects
participating in studies at separate renowned institutions (White 2020). Jesse
Gelsinger was born with a mild form of ornithine transcarbamylase (OTC) defi-
ciency that was well managed with diet and medication and had just turned 18 when
he volunteered to participate a phase 1 gene therapy study for the treatment of OTC
deficiency (White 2020). Shortly after receiving the experimental gene therapy,
Gelsinger experienced an acute inflammatory response leading to multiorgan failure
and died just 4 days later (White 2020). This led to an investigation, which raised
questions about significant ethics and regulatory violations, including, among others,
whether Gelsinger was enrolled in violation of the eligibility criteria in the
IRB-approved protocol and a conflict of interest for the director of the gene studies
662 K. R. Dunn
program that was not disclosed (White 2020). Ellen Roche was a healthy 24-year-old
lab technician when she volunteered in 2001 to participate in a physiology study in
which subjects were administered inhaled hexamethonium (White 2020). Within
24 h Roche developed significant pulmonary abnormalities, which progressed to
multiorgan failure, and ultimately, she died within a month (White 2020). Concerns
raised from the investigation of this case included lack of identification of reported
complications associated with hexamethonium in the literature, failure to apply for
an investigational new drug application (IND) or to inquire with the FDA about the
need for an IND, lack of information in the consent form about the regulatory status
of hexamethonium, missing reports of complications in prior publications, and use
of a chemical grade agent, rather than pharmaceutical grade (White 2020).
In September 2000, in response to the Gelsinger tragedy (and before the tragedy
of Roche’s death), Donna Shalala, secretary of Health and Human Services,
published a plan and urgent call to action to strengthen protections for human
research subjects in the New England Journal of Medicine (Shalala 2000). Shalala
outlined several steps taken by the government, including expansion of the role of
the Office for Protection from Research Risks (OPRR) and renaming it the Office for
Human Research Protections (OHRP), along with the appointment of new leadership
(Shalala 2000). However, Shalala made the case that ultimate responsibility to
protect human subjects lies with the institutions performing research (Shalala
2000). With respect to IRBs, Shalala stated:
IRBs, the key element of the system to protect research subjects, are under increasing strain.
In June 1998, the Office of Inspector General of the Department of Health and Human
Services issued four investigative reports, which indicated that IRBs have excessive work-
loads and inadequate resources. At a number of institutions, IRB oversight was inadequate,
and on occasion, researchers were not providing the boards with sufficient information for
them to evaluate clinical trials fully.
A path to revise the Common Rule began in 2011, when the federal government
sought input from the public with the release of an advance notice of proposed
rulemaking (ANPRM). The ANPRM described shortcomings of the current regula-
tions, citing changes to the research enterprise since the Common Rule was first
enacted 20 years earlier, including “the proliferation of multi-site clinical trials and
36 Institutional Review Boards and Ethics Committees 663
2001
1974 Association for
the
National Accreditation
Research Act of Human
1947 HHS 1981 Research
The regulations FDA Protection
Nuremberg established (45 Regulations Programs
Code CFR 46) adopted established
All research involving human subjects (see definitions in Table 1) that is conducted
or supported by a federal department or agency is subject to the regulations in the
Common Rule (45 CFR 46), including the requirements for IRB review and
36 Institutional Review Boards and Ethics Committees 665
Table 1 Selected definitions transcribed from the Common Rule (45 CFR 46.102) and FDA
regulations (21 CFR 56.102)
Term Definition
Human subject Common Rule: A living individual about whom an investigator
(whether professional or student) conducting research
(i) Obtains information or biospecimens through intervention or
interaction with the individual, and uses, studies, or analyzes the
information or biospecimens
(ii) Obtains, uses, studies, analyzes, or generates identifiable
private information or identifiable biospecimens
FDA: An individual who is or becomes a participant in research,
either as a recipient of the test article or as a control. A subject may be
either a healthy human or a patient
Intervention Includes both physical procedures by which information or
biospecimens are gathered (e.g., venipuncture) and manipulations of
the subject or the subject’s environment that are performed for
research purposes
Interaction Includes communication or interpersonal contact between
investigator and subject
Private information Includes information about behavior that occurs in a context in which
an individual can reasonably expect that no observation or recording
is taking place, and information that has been provided for specific
purposes by an individual and that the individual can reasonably
expect will not be made public (e.g., a medical record)
Identifiable private Private information for which the identity of the subject is or may
information readily be ascertained by the investigator or associated with the
information
Identifiable A biospecimen for which the identity of the subject is or may readily
biospecimen be ascertained by the investigator or associated with the biospecimen
Research A systematic investigation, including research development, testing,
and evaluation, designed to develop or contribute to generalizable
knowledge
Clinical trial (Common A research study in which one or more human subjects are
Rule only) prospectively assigned to one or more interventions (which may
include placebo or other control) to evaluate the effects of the
interventions on biomedical or behavioral health-related outcomes
Clinical investigation Any experiment that involves a test article and one or more human
(FDA only) subjects and that either is subject to requirements for prior submission
to the Food and Drug Administration under section 505(i) or 520
(g) of the act, or is not subject to requirements for prior submission to
the Food and Drug Administration under these sections of the act, but
the results of which are intended to be submitted later to, or held for
inspection by, the Food and Drug Administration as part of an
application for a research or marketing permit
Test article (FDA only) Any drug for human use, biological product for human use, medical
device for human use, human food additive, color additive, electronic
product, or any other article subject to regulation under the act or
under sections 351 or 354-360F of the Public Health Service Act
Minimal risk The probability and magnitude of harm or discomfort anticipated in
the research are not greater in and of themselves than those ordinarily
encountered in daily life or during the performance of routine
physical or psychological examinations or tests
666 K. R. Dunn
approval and informed consent. Although these regulations only apply to federally
conducted or supported research, institutions with an FWA are required to apply
similar protections to all their research involving human subjects (OHRP 2021).
Additionally, the International Committee of Medical Journal Editors (ICMJE) notes
that authors should seek approval to conduct research from an independent review
body such as an IRB or ethics committee (ICMJE 2019), additional incentive for
researchers to seek and for institutions to require IRB approval of all research with
human subjects. Since FDA oversight is focused on drugs, biologics, and medical
devices, different terminology is used to define the research subject to IRB review.
The FDA regulations at 21 CFR Parts 50 and 56 (informed consent and IRB review)
apply to clinical investigations, as defined in Table 1.
IRBs have the authority to approve, require modifications to, or disapprove
research and are required to conduct continuing review of research at least annually,
except that most minimal risk research and ongoing research that remains open only
for long-term data collection and analysis does not require continuing review in
accordance with the revised Common Rule (45 CFR 46.109 and 21 CFR 56.109).
IRBs are required to notify investigators and the institution in writing of its decisions
to approve or disapprove research activities. The reason for disapproval must be
explained in writing and the investigator must be given an opportunity to respond in
writing (45 CFR 46.109 and 21 CFR 56.109).
A complete list of the categories of research eligible for expedited IRB review is
posted in the Federal Register (DHHS NIH 1998). Minor changes to previously
approved research can also be reviewed by expedited IRB review. While the IRB
chairperson or designated member has the authority to approve or require modifica-
tions to research activities eligible for expedited review, only the convened IRB has
the authority to disapprove research (45 CFR 46.110 and 21 CFR 56.110).
Certain categories of research with human subjects are exempt from the require-
ments for IRB review and informed consent under the Common Rule (45 CFR
46.104). Exempt research includes the following categories of research under
specified circumstances for each category:
• Education research
• Surveys, interviews, educational assessments, and observation of public behavior
• Benign behavioral interventions
• Research with information or biospecimens collected for other purposes (e.g.,
clinical)
• Federal research and demonstration projects
• Taste and food quality evaluation
• Storage of information or biospecimens for secondary research with broad
consent
• Secondary research with information or biospecimens under broad consent
Certain exempt research categories require that the IRB conduct “limited IRB
review,” a form of expedited review focused on protections of privacy and confi-
dentiality (45 CFR 46.104).
IRB Records
Both the Common Rule and FDA regulations require that IRBs prepare and maintain
records in paper or electronic form, documenting their activities (45 CFR 46.115 and
21 CFR 56.115). IRB records must be maintained for at least 3 years after comple-
tion of the research and they must be made available to applicable federal oversight
agencies for inspection and copying upon request. Both the Common Rule and FDA
regulations note that IRB records must include the following:
668 K. R. Dunn
obtain and review information sufficient to determine whether the proposed research
meets criteria for IRB approval. Table 2 outlines the general criteria for IRB approval
of research. In addition to the general criteria for IRB approval, regulations outline
additional protections for pregnant women, fetuses, and neonates, prisoners, and
children in research, including additional requirements for IRB membership, criteria
for inclusion of these potentially vulnerable populations, and additional consider-
ations for informed consent and child assent (45 CFR 46 Subparts B, C, and D and
21 CFR 50 Subpart D).
Informed Consent
One of the key assertions outlined in the Nuremberg Code, the Declaration of
Helsinki, and the Belmont Report is that informed consent is critical to the ethical
conduct of research with human subjects. Therefore, it is not surprising that the
process and plans for documentation of informed consent are a significant focus of
the IRB review process. The informed consent of the subject or their legally
authorized representative is required for all research subject to IRB review unless
the IRB determines the research is eligible for a waiver or alteration of the require-
ments for informed consent. Researchers are required to provide information in
language that is understandable to the subject and subjects must be given sufficient
opportunity to discuss and consider their decision to participate in a setting and
manner that minimizes any possibility of coercion or undue influence. Additionally,
regulations specify that the informed consent cannot include any exculpatory lan-
guage where subjects appear to give up any legal rights (45 CFR 46.116 and 21 CFR
50.20). The revised Common Rule also specifies that subjects should be given
information that a “reasonable person” would want to make a decision about
participation in the research, that the consent must begin with a concise summary
of key information, and that the informed consent must be organized in a manner that
facilitates understanding of reasons why one may not want to participate (45 CFR
46.116).
Table 2 Criteria for IRB approval transcribed from 45 CFR 46.111 and 21 CFR 56.111
Topic Regulatory criteria for IRB approval
Minimizing risks Risks to subjects are minimized (i) by using procedures that are
consistent with sound research design and that do not unnecessarily
expose subjects to risk, and (ii) whenever appropriate, by using
procedures already being performed on the subjects for diagnostic or
treatment purposes
Favorable benefit/risk Risks to subjects are reasonable in relation to anticipated benefits, if
ratio any, to subjects, and the importance of the knowledge that may
reasonably be expected to result. In evaluating risks and benefits, the
IRB should consider only those risks and benefits that may result
from the research (as distinguished from risks and benefits of
therapies subjects would receive even if not participating in the
research). The IRB should not consider possible long-range effects
of applying knowledge gained in the research (e.g., the possible
effects of the research on public policy) as among those research
risks that fall within the purview of its responsibility
Equitable selection of Selection of subjects is equitable. In making this assessment the IRB
subjects should take into account the purposes of the research and the setting
in which the research will be conducted. The IRB should be
particularly cognizant of the special problems of research that
involves a category of subjects who are vulnerable to coercion or
undue influence, such as children, prisoners, individuals with
impaired decision-making capacity, or economically or
educationally disadvantaged persons
Note: The language describing potentially vulnerable populations
was updated in the revised Common Rule. FDA regulations still
contain original language, which also includes specific reference to
pregnant women, handicapped, or mentally disabled persons
Informed consent Informed consent will be sought from each prospective subject or the
subject’s legally authorized representative
Documentation of Informed consent will be appropriately documented or appropriately
informed consent waived
Note: FDA regulations do not include provisions for IRBs to waive
consent; however, there has been FDA guidance issued on this topic,
which is described in the informed consent section of this chapter
Data and safety When appropriate, the research plan makes adequate provision for
monitoring monitoring the data collected to ensure the safety of subjects
Privacy and When appropriate, there are adequate provisions to protect the
confidentiality privacy of subjects and to maintain the confidentiality of data
Vulnerable subjects When some or all of the subjects are likely to be vulnerable to
coercion or undue influence, such as children, prisoners, individuals
with impaired decision-making capacity, or economically or
educationally disadvantaged persons, additional safeguards have
been included in the study to protect the rights and welfare of these
subjects
Note: Like the section on equitable selection of subjects, the
language in this section was updated in the revised Common Rule.
FDA regulations still contain original language to describe
potentially vulnerable populations
36 Institutional Review Boards and Ethics Committees 671
Table 3 Elements of informed consent transcribed from 45 CFR 46.116 and 21 CFR 50.25
Category Required elements of consent
Basic elements of informed consent (45 CFR 1. A statement that the study involves research,
46.116(b) and 21 CFR 50.25(a)) an explanation of the purposes of the research
and the expected duration of the subject’s
participation, a description of the procedures to
be followed, and identification of any
procedures that are experimental.
2. A description of any reasonably foreseeable
risks or discomforts to the subject.
3. A description of any benefits to the subject
or to others that may reasonably be expected
from the research.
4. A disclosure of appropriate alternative
procedures or courses of treatment, if any, that
might be advantageous to the subject.
5. A statement describing the extent, if any, to
which confidentiality of records identifying the
subject will be maintained.
6. For research involving more than minimal
risk, an explanation as to whether any
compensation and an explanation as to whether
any medical treatments are available if injury
occurs and, if so, what they consist of, or where
further information may be obtained.
7. An explanation of whom to contact for
answers to pertinent questions about the
research and research subjects’ rights, and
whom to contact in the event of a research-
related injury to the subject.
8. A statement that participation is voluntary,
refusal to participate will involve no penalty or
loss of benefits to which the subject is
otherwise entitled, and the subject may
discontinue participation at any time without
penalty or loss of benefits to which the subject
is otherwise entitled.
9. One of the following statements about any
research that involves the collection of
identifiable private information or identifiable
biospecimens:
a. A statement that identifiers might be
removed from the identifiable private
information or identifiable biospecimens and
that, after such removal, the information or
biospecimens could be used for future research
studies or distributed to another investigator
for future research studies without additional
informed consent from the subject or the
legally authorized representative, if this might
be a possibility, or
b. A statement that the subject’s information
or biospecimens collected as part of the
(continued)
672 K. R. Dunn
Table 3 (continued)
Category Required elements of consent
research, even if identifiers are removed, will
not be used or distributed for future research
studies.
Notes:
Item number 9 was added in the revised
Common Rule and is not included in FDA
regulations.
There is one additional basic element of
consent required under FDA regulations: the
possibility that the FDA may inspect the
records.
Additional elements of informed consent, to be 1. A statement that the particular treatment or
included when appropriate (45 CFR 46.116 procedure may involve risks to the subject
(c) and 21 CFR 50.25(b)) (or to the embryo or fetus, if the subject is or
may become pregnant) that are currently
unforeseeable.
2. Anticipated circumstances under which the
subject’s participation may be terminated by
the investigator without regard to the subject’s
or the legally authorized representative’s
consent.
3. Any additional costs to the subject that may
result from participation in the research.
4. The consequences of a subject’s decision to
withdraw from the research and procedures for
orderly termination of participation by the
subject.
5. A statement that significant new findings
developed during the course of the research
that may relate to the subject’s willingness to
continue participation will be provided to the
subject.
6. The approximate number of subjects
involved in the study.
7. A statement that the subject’s biospecimens
(even if identifiers are removed) may be used
for commercial profit and whether the subject
will or will not share in this commercial profit.
8. A statement regarding whether clinically
relevant research results, including individual
research results, will be disclosed to subjects,
and if so, under what conditions.
9. For research involving biospecimens,
whether the research will (if known) or might
include whole genome sequencing (i.e.,
sequencing of a human germline or somatic
specimen with the intent to generate the
genome or exome sequence of that specimen).
(continued)
36 Institutional Review Boards and Ethics Committees 673
Table 3 (continued)
Category Required elements of consent
Notes:
Item numbers 7, 8, and 9 were added in the
revised Common Rule and are not included in
FDA regulations.
There is one additional element of consent
required under FDA regulations for applicable
clinical trials (21 CFR 50.25(c)): a statement
and brief description about registration of the
clinical trial on ClinicalTrials.gov, a clinical
trial registry.
There are separate requirements for “broad
consent” for storage, maintenance, and
secondary research use of identifiable private
information or identifiable biospecimens
defined at 45 CFR 46.116(d). These
requirements have been omitted from this table
for brevity.
approved by IRBs as an option for documenting informed consent, along with the
use of an interpreter, for subjects who require an informed consent process in a
non-English language where the need for a written translation of the full informed
consent form had not been anticipated. The Common Rule also allows the IRB to
waive the requirement for obtaining a signed informed consent form for certain
minimal risk research or where a breach of confidentiality is the primary risk and the
signed consent form would be the only record identifying the subject (45 CFR
46.117).
While the Common Rule allows an IRB to waive or alter requirements for
informed consent, FDA regulations do not. However, FDA issued guidance in
2017 noting an intent to update its regulations to allow waivers and alterations of
informed consent for certain minimal risk clinical investigations, which will align
with waivers allowed under the Common Rule. In the meantime, the FDA notes it
will not object to IRBs allowing such waivers (DHHS, FDA 2017). FDA regulations
do allow for an exception from the requirement for informed consent to treat a
patient/subject with an investigational drug or device in a life-threatening emergency
situation. When this exception from the requirement for informed consent is used, a
report must be submitted to the IRB within 5 business days (21 CFR 50.23).
Both DHHS and the FDA also allow the IRB to approve a waiver of the
requirements for informed consent in research involving human subjects in emer-
gency medical situations (e.g., heart attack, stroke, and trauma) where it may not be
possible to obtain consent from the subject or their legally authorized representative.
Application of this exception requires significant consideration, time, effort, and
planning by the researchers and the IRB. Additional required steps and protections
necessary for the IRB to grant a waiver of consent for emergency research include
community consultation, public disclosure about the research, and procedures to
inform subjects or their representative about the research and their right to discon-
tinue participation at the earliest opportunity (21 CFR 50.24).
The NIH released a policy that became effective in January 2018, requiring the use
of a single IRB for the review of multisite human subjects research funded by the
NIH (2016). Subsequently, the revised Common Rule required the use of a single
IRB for multisite research that became effective in January 2020. In a single IRB
review model, institutions are required to document reliance on an IRB it does not
operate and the delineation of responsibilities for each entity, the relying institution
and reviewing IRB (45 CFR 46.103(e)). FDA regulations allow single IRB review of
multisite clinical investigations, but it is not required.
The purpose of single IRB review was summarized in the NIH notice as follows
(NIH 2016):
The goal of this policy is to enhance and streamline the IRB review process in the context of
multi-site research so that research can proceed as effectively and expeditiously as possible.
Eliminating duplicative IRB review is expected to reduce unnecessary administrative bur-
dens and systemic inefficiencies without diminishing human subjects protections. The shift
in workload away from conducting redundant reviews is also expected to allow IRBs to
concentrate more time and attention on the review of single site protocols, thereby enhancing
research oversight.
In preparation for the federal mandates and to support and facilitate single IRB
review, beginning in 2014, the NIH funded an initiative to develop a standard
national master IRB reliance agreement (Cobb et al.). This evolved into the
36 Institutional Review Boards and Ethics Committees 675
Ethics Committees
The research protocol must be submitted for consideration, comment, guidance and
approval to the concerned research ethics committee before the study begins. This
committee must be transparent in its functioning, must be independent of the researcher,
the sponsor and any other undue influence and must be duly qualified. It must take into
consideration the laws and regulations of the country or countries in which the research is to
be performed as well as applicable international norms and standards but these must not be
allowed to reduce or eliminate any of the protections for research subjects set forth in this
Declaration.
The committee must have the right to monitor ongoing studies. The researcher must provide
monitoring information to the committee, especially information about any serious adverse
events. No amendment to the protocol may be made without consideration and approval by
the committee. After the end of the study, the researchers must submit a final report to the
committee containing a summary of the study’s findings and conclusions.
Fig. 4 Excerpt on ethics committees from the Declaration of Helsinki (WMA 2013)
676 K. R. Dunn
IRBs in the USA and ethics committees internationally have been a cornerstone in the
review and oversight of clinical trials and other research involving human subjects
since the 1970s, with the passing of the National Research Act in 1974 and amendment
of the Declaration of Helsinki in 1975. IRBs are guided by the ethical principles of the
Belmont Code and they carry out their responsibilities for the protection of the rights
and welfare of human research subjects in accordance with the Common Rule (45 CFR
46) and/or FDA regulations at 21 CFR parts 50 and 56, depending on the scope and
funding source of the research they are reviewing. In order to approve research, IRBs
are required to ensure that risks are minimized and there is a favorable risk/benefit
ratio, there is an adequate plan for data and safety monitoring and protection of privacy
and confidentiality, that selection of subjects is equitable, and there is a plan to obtain
and document informed consent from subjects. Additionally, IRBs give special con-
sideration to the protection of potentially vulnerable subject populations. The changing
research landscape and some highly publicized tragedies led to calls for reform around
the turn of the 21st century, which resulted in development of programs for the
accreditation of IRBs and institutional human research protection programs, a revised
Common Rule that became effective in 2019, and the trend toward single IRB review
of multisite studies. While responsibility for the protection of human subjects is shared
among multiple parties, IRBs and ethics committees play a critical role in the review
and oversight of clinical trials and research with human subjects.
Key Facts
Cross-References
References
Beecher HK (1966) Ethics and clinical research. N Engl J Med 274(24):1354–1360. https://doi.org/
10.1056/NEJM196606162742405
Cobb N, Witte E, Cervone M, Kirby A, MacFadden D, Nadler L, Bierer BE (2019) The SMART
IRB platform: a national resource for IRB review for multisite studies. J Clin Transl Sci 3(4):
129–139. https://doi.org/10.1017/cts.2019.394
Department of Health and Human Services (2011) Human subjects research protections: enhancing
protections for research subjects and reducing burden, delay, and ambiguity for investigators.
Fed Register 76(143):44512–44531. https://www.federalregister.gov/documents/2011/07/26/
2011-18792/human-subjects-research-protections-enhancing-protections-for-research-subjects-
and-reducing-burden. Accessed 26 Jun 2021
Department of Health and Human Services, FDA (2017) IRB Waiver or alteration of informed
consent for clinical investigations involving no more than minimal risk to human subjects
guidance for sponsors, investigators, and institutional review boards. https://www.fda.gov/
media/106587/download. Accessed 26 Jun 2021
Department of Health and Human Services, NIH (1998) Protection of human subjects: catego-
ries of research that may be reviewed by the Institutional Review Board (IRB) through an
expedited review procedure. Fed Register 63(216):60364–60367. https://www.hhs.gov/
ohrp/regulations-and-policy/guidance/categories-of-research-expedited-review-procedure-
1998/index.html. Accessed 4 Jun 2021
Department of Health and Human Services OHRP and FDA (2017) Minutes of Institutional
Review Board (IRB) meetings guidance for institutions and IRBs. https://www.hhs.gov/
ohrp/minutes-institutional-review-board-irb-meetings-guidance-institutions-and-irbs.
html-0. Accessed 26 Jun 2021
Emanuel EJ, Wood A, Fleischman A, Bowen A, Getz KA, Grady C, Levine C, Hammerschmidt
DE, Faden R, Eckenwiler L, Muse CT, Sugarman J (2004) Oversight of human participants
research: identifying problems to evaluate reform proposals. Ann Intern Med 141(4):282–291.
https://doi.org/10.7326/0003-4819-141-4-200408170-00008
Harkness J, Lederer SE, Wikler D (2001) Laying ethical foundations for clinical research. Bull
World Health Organ 79(4):365–366
International Committee of Medical Journal Editors (2019) Recommendations for the conduct,
reporting, editing, and Publication of scholarly work in Medical Journals. http://www.icmje.org/
icmje-recommendations.pdf. Accessed 1 Jul 2021
Menikoff J, Kaneshiro J, Pritchard I (2017) The common rule, updated. N Engl J Med 375:613–
615. https://doi.org/10.1056/NEJMp1700736
National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research
(1978) Reports and recommendations institutional review boards. https://www.hhs.gov/ohrp/
regulations-and-policy/belmont-report/access-other-reports-by-the-national-commission/index.
html. Accessed 30 Jun 2021
National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research
(1979) The Belmont report. https://www.hhs.gov/ohrp/sites/default/files/the-belmont-report-
508c_FINAL.pdf. Accessed 4 Jun 2021
National Institutes of Health (2016) Final NIH policy on the use of a single institutional review
board for multi-site research. NOT-OD-16-094. https://grants.nih.gov/grants/guide/notice-files/
NOT-OD-16-094.html. Accessed 1 Jul 2021
Office for Human Research Protections (OHRP) (2021) Assurance process frequently asked
questions. https://www.hhs.gov/ohrp/register-irbs-and-obtain-fwas/fwas/assurance-process-
faq/index.html. Accessed 1 Jul 2021
Rice TW (2008) The historical, ethical, and legal background of human-subjects research. Respir
Care 53(10):1325–1329
Shalala D (2000) Protecting research subjects – what must be done. N Engl J Med 343(11):808–
810. https://doi.org/10.1056/NEJM200009143431112
678 K. R. Dunn
Steinbrook R (2002) Improving protection for research subjects. N Engl J Med 346(18):1425–1430.
https://doi.org/10.1056/NEJM200205023461828
US Congress Senate (1974) (Reprint of) National Research Act. https://www.govinfo.gov/content/
pkg/STATUTE-88/pdf/STATUTE-88-Pg342.pdf. Accessed 4 Jun 2021
U.S. Government Printing Office (1949) The Nuremberg Code: Trials of war criminals before the
Nuremberg military tribunals under control council law No. 10, vol 2. pp. 181–182. https://
history.nih.gov/display/history/Nuremberg+Code. Accessed 4 Jun 2021
White MG (2020) Why human subjects research protection is important. Ochsner J 20(1):16–33.
https://doi.org/10.31486/toj.20.5012
World Medical Association (2013) WMA Declaration of Helsinki – ethical principles for medical
research involving human subjects as amended by the 64th WMA General Assembly, Fortaleza,
Brazil. https://www.wma.net/policies-post/wma-declaration-of-helsinki-ethical-principles-for-
medical-research-involving-human-subjects/. Accessed 4 Jun 2021
Data and Safety Monitoring and Reporting
37
Sheriza Baksh and Lijuan Zeng
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
DSMB Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682
Charter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684
Meeting Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684
Meeting Settings: In-person vs Remote . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
Quorum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688
Confidentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688
DSMB Meetings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
Structure of Meetings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690
Recommendations and Follow-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696
Abstract
Data safety and monitoring boards (DSMB) are comprised of a group of clinical
experts, statisticians, and other representatives with pertinent experience, who
collectively monitor the data and conduct of ongoing clinical trials to ensure the
safety of trial participants and the integrity of the trial. Over the years, the
frequency of the use of a DSMB has increased; its mandate has been expanded
to evaluate interim efficacy results, make recommendations for early termination
of a trial, conduct sample size reassessments, and support the technical aspects of
S. Baksh (*)
Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
e-mail: [email protected]
L. Zeng
Statistics Collaborative, Inc., Washington, DC, USA
a trial through other recommendations. Given the complex issues a DSMB may
face, it is important for a DSMB to gain the support the members need from
relevant parties in order to function effectively and independently and to make
informed judgments. This chapter starts by introducing when a DSMB is
warranted, and provides guidance on the formation of a DSMB, highlighting
approaches to ensuring adherence to data confidentiality and principles of inde-
pendence. The chapter then provides an overview of different types of DSMB
meetings, templates for a DSMB charter, and considerations for open and closed
reports. Lastly, a listing of guidance documents on DSMB from regulatory
agencies and others is provided for reference.
Keywords
Data Monitoring Committee · Data Safety Monitoring Board · Interim data
sharing
Introduction
A data and safety monitoring board (DSMB), also known as a data monitoring
committee (DMC), or an independent data and safety monitoring committee
(IDMC), serves an integral part of many trials in ensuring study participant
safety, assessing data integrity, and monitoring study progress. In this chapter,
we will use DSMB as an umbrella term to refer to data and safety monitoring
boards, data monitoring committees, and independent data and safety monitoring
committee.
A study’s DSMB serves as an independent resource for study investigators
and sponsors to ensure the integrity of the data, the ethical conduct of the study,
and the safety of study participants. DSMBs are often formed as part of large,
phase 3, multicenter clinical trials, but they may be also used in smaller, phase
1 and 2 clinical trials, where study participants are considered to comprise a
vulnerable population or when interventions are high risk. Additionally, DSMBs
may be needed in emergency trials, when consent might be waived (Eckstein
2015). The independence of a DSMB enables study recommendations to be
made in the best interests of the study population and for the maximum benefit
for the intended target population. Note that not all trials need DSMB. For
example, having a DSMB may not be practical for trials with fast enrollment
or short duration, nor would it be necessary for trials for non-critical indications
or low-risk investigational drugs.
The DSMB can have a variety of duties based upon the needs of a particular study
or request from the study Sponsor. These are often outlined in a DSMB charter or a
data and safety monitoring plan (DSMP) that the DSMB, study investigators, and
study Sponsor agree upon at the beginning of the study. Depending on the timing of
the formation of the DSMB, the DSMB may have varying input on the development
of the study protocol. Among their duties are reviewing study protocols, statistical
37 Data and Safety Monitoring and Reporting 681
analysis plans, consent documents, and other participant facing documents, advising
the trial’s Steering Committee, evaluating data for stopping the trial, and reviewing
interim analyses (Clemens et al. 2005). Through the course of the study, the DSMB
periodically meets to review and discuss the emerging data and the study perfor-
mance so as to provide recommendations in line with the jurisdiction outlined in the
charter. Recommendations may stem from the discussions in these meetings. A
summary of the recommendations from the meeting may also be shared with the
institutional review boards and other regulatory bodies to keep them abreast of any
potential safety concerns for study participants.
While not explicitly required for all clinical trials, the jurisdiction for DSMBs
has been spelled out by various regulatory and governmental agencies across the
world. While there might be slight variations in what each agency requires, each
of these governing bodies outlined the following as integral to a functional and
effective DSMB: primacy of patient safety, ensuring data integrity, and continual
oversight of study performance metrics. Table 1 lists key guidance documents
that outline the purview of DSMBs from various regulatory agencies across the
world. Investigators undertaking clinical trials in specific countries should seek
to abide by requirements outlined in the guidance documents pertinent to the
countries in which their trials are being conducted. While this list is not exhaus-
tive, it provides a sample of what one can expect when organizing a DSMB
across countries.
DSMB Organization
Formation
Once a study has been funded and initial planning is underway, Sponsors may elect
to appoint a DSMB to assist with study oversight. One goal in forming an effective
DSMB is to ensure the expertise necessary for monitoring the risks and benefits to
study participants with a limited number of individuals. In some instances, an
ethicist or patient advocate, or both, as part of the DSMB might be prudent.
Members with this perspective can be especially helpful when the study involves
participants for whom consent is waived or for whom their condition or the studied
intervention is of a sensitive or controversial nature. While the optimal number of
DSMB members is often up for debate, the expertise of the members should be
balanced for discussion of trial issues and consensus formation.
The independence of a DSMB is essential in order that members consider both the
safety of trial participants as well as the potential risk and benefits to the intended
target patient population for the intervention under study. This holistic approach to
trial integrity depends on the DSMB’s independence from competing interests,
research activities, and financial incentives. Without these assurances, both those
charged with study oversight, as well as the general public, cannot be assured that the
recommendations stemming from the DSMB are in the best interest of patient safety
and their corresponding benefit-risk profile. There are many ways to protect against
potential or perceived bias. Among these strategies is a disclosure of conflict of
interests (COI) at the time of DSMB formation and at the beginning of each data
review.
In some situations, DSMB may choose to designate voting and non-voting
members. In these situations, both voting and non-voting members participate in
discussions of study data; however, the voting members are tasked with deciding
upon study recommendations, including determination of continuation with recruit-
ment. Best practices typically recommend against this however, and instead advo-
cate for recommendations stemming from consensus views in the closed session
(Fleming et al. 2017). While compositions may vary from trial to trial, DSMBs may
have a clinical expert, statistician, clinical trialist, patient advocate or representative,
and/or a Sponsor representative (Fig. 1). Non-voting members of the DSMB tend to
be those from the investigative team, and the voting members are generally those
who remain independent from the study activities. Including sponsor representatives
in the closed sessions of a DSMB is more common in government-sponsored trials
than in industry-sponsored trials, where industry sponsors usually hire an Indepen-
dent Statistical Reporting Group (ISRG) for preparing and presenting closed and/or
open reports to DSMB (Fig. 2). Given that the principal investigator is steeped in the
clinical area, he/she may recommend individuals best suited to adjudicate patient
safety and interests for a disease area, but the Sponsor ultimately signs off on the
members for the DSMB. Members of the clinical study team are not typically present
in either the open or closed session of the DSMB meeting; however, the principal
investigator may attend the open portion of the meeting to provide a scientific and
operational update of the study, and answer questions from the DSMB. Study
37 Data and Safety Monitoring and Reporting 683
Chair
Clinical
Sponsor Specialist
(1-2 members)
Ethicist/Patient
Statistician
Advocate
Study
Statistician
Chair
Ethcist/Patient
Statistician
Advocate
Fig. 2 Example of DSMB composition in industry-sponsored trials. Voting members are shown in
blue rectangles, and non-voting members are shown in red ovals
684 S. Baksh and L. Zeng
Charter
The DSMB charter serves as a guideline for DSMB operations, outlining DSMB
responsibilities, providing principles for guiding DSMB decisions, and describing
procedures and workflow for the DSMB (Herson 2017; Fleming et al. 2017). The
trial sponsor usually prepares a draft charter which is later reviewed collectively by
the sponsor, DSMB, ISRG, and any other key parties involved. Table 2 provides an
outline of the organization of a typical DSMB charter.
Although the DSMB may vary in composition and practices depending on the
study, core elements of the DSMB charter remain similar across studies. Templates
for DSMB charters have been proposed in reference books (e.g., Ellenberg et al.
2019; Herson 2017). DAMOCLES, (Data Monitoring Committees: Lessons, Ethics,
Statistics) Study Group (DAMOCLES Study Group, 2005) also provides templates
for DSMB charters.
Meeting Types
The types of DSMB meetings that are held during the trial should be described in the
DSMB Charter. The objectives, frequency, and schedule of meetings are generally
decided upon during the formation of the DSMB in conjunction with the investiga-
tors and Sponsor. The main meeting types are as follows:
Initial/Organizational/Kick-Off Meeting
The initial DSMB meeting, also known as the organizational or kick-off meeting,
should ideally be held prior to the first patient first visit. During this meeting, DSMB
members can get acquainted with each other and the sponsor’s study team, exchange
thoughts on the study design, and share their own experiences and insights. The
sponsor or investigator usually presents the current version of the protocol and
DSMB charter, and the independent reporting statistician may present the draft
37 Data and Safety Monitoring and Reporting 685
report templates to solicit any feedback from the DSMB during the early stages of
interaction. This is a valuable opportunity for study investigators to gather input
from other leaders in the clinical field.
To have a productive meeting, the study materials such as protocols, important
forms, patient-facing materials, the draft DSMB charter, and other relevant materials
should be made available to the DSMB prior to the initial meeting. Shortly after the
686 S. Baksh and L. Zeng
meeting, the Sponsor or DSMB, or both will approve and sign the charter, according
to the Sponsor’s SOPs.
Ad-hoc Meetings
Between pre-specified safety and efficacy review meetings, the DSMB and Sponsor
may request ad-hoc meetings to review ad-hoc analyses, address emerging safety
issues from monthly safety reports (or SAE narratives), or discuss important new
information external to trials. When the DSMB requests an ad-hoc meeting, the
details of the meetings should not be communicated to the Sponsor until the
conclusion of the trial unless the DSMB issues a recommendation to modify or
terminate the trial in response to findings from the meetings. The documentation for
the ad-hoc meetings should still follow the same process as the periodic data review
meetings. Note that the implications of additional analyses should be considered and
factored into the alpha-spending as specified in the SAP a priori.
37 Data and Safety Monitoring and Reporting 687
The actual format of various DSMB meeting will depend on the scheduling, DSMB
preferences, and complexity of issues to be discussed at the meeting. In-person
DSMB meetings, which often allow for more effective interactions and communi-
cation, are preferred at the initial meeting, interim efficacy/futility analysis meeting,
final meeting and/or other pre-specified review meeting.
For example, it is generally preferable to have an in-person meeting for the
study kick-off. This allows DSMB members to get familiarized with each other and
share their experiences. When an important decision is made regarding whether the
DSMB is recommending early termination of a study due to safety, efficacy, or
futility, it is valuable to have DSMB members in the same room, if possible, in
order to assess the benefit and risk profiles of study drugs carefully, thoughtfully
exchange their opinions and concerns, and finally come to a consensus had there
been conflicting feedback. Moreover, having the DSMB meet in-person on a
regular basis (e.g., annually) is recommended. However, meeting in-person may
not always be necessary or efficient. Once the DSMB becomes very familiar with
the trial or has observed no major safety issue after numerous meetings, it may be
sufficient to meet by teleconference or videoconference. Meeting in-person may
not be possible for ad-hoc discussions on emerging trial issues given the short
notice and not practically feasible if the DSMB needs to closely monitor the trial
population and meet frequently (i.e., every other week) to review new information.
Quorum
DSMB members should make every attempt to attend each meeting either
in-person or by teleconference. However, in cases where not all members can be
present, the DSMB Chair, or designee, should contact any absent individual before
or after the meeting, or both, to obtain their opinion in writing after their review of
all materials discussed during the meeting. The inclusion of opinions from absent
members is at the discretion of the DSMB, as outlined and pre-specified in the
charter.
Usually, at a minimum, the DSMB Chair and the DSMB Statistician should be
present to hold a meeting. However, many charters require that all voting members be
688 S. Baksh and L. Zeng
Independence
To provide an objective assessment of the benefit and risk profile of study drugs and
make recommendations on the studies, members of the DSMB must remain inde-
pendent and avoid all COIs that could affect their decision making. COIs can arise in
many situations – some are easier to ascertain (for example, financial or research
interests), while others can be harder to avoid or cannot be fully eliminated. For
example, owning shares or investing in the sponsor’s company stocks are obvious
financial COIs, which preclude one’s eligibility from serving on the committee.
DSMB members usually receive some honorarium (financial compensation) from
the sponsors for their time serving on the boards; however, the amount of the
honorarium should not be so high that it might potentially bias the DSMB’s decision
making. For more details on financial COIs, refer to ▶ Chap. 28, “Financial Con-
flicts of Interest in Clinical Trials.”
Other than financial incentives, potential research-driven or intellectual COIs are
also common among clinical and statistical experts who are usually involved in or
serve as consultants for multiple research projects. In general, the investigators in a
trial may not serve on a DSMB for a competing trial. One may not even serve as the
DSMB for competing trials at the same time to avoid inadvertently sharing confi-
dential information across trials.
It is not, however, uncommon for a single DSMB to monitor multiple ongoing
trials in the same or related programs as this allows the DSMB to more efficiently
make informed recommendation based on information from the associated trials.
Requiring members to be completely free from any COIs is difficult to achieve given
the varying subject matter expertise required on each board. As such, full disclosure
of any potential COI is critical to avoid compromising the DSMB’s recommenda-
tions as the trial proceeds. If, through the course of the study, any of these tenets of
independence have changed, the members should disclose their status to the chair of
the DSMB and the study Sponsor who will decide whether the member is still
sufficiently independent to remain on the Board.
Confidentiality
Unblinded comparative safety and efficacy data should be accessible only to the
DSMB and ISRG. The FDA Guidance (2006) states: “Even for trials not conducted
in a double-blind fashion, where investigators and patients are aware of individual
treatment assignment and outcome at their sites, the summary evaluations of com-
parative unblinded treatment results across all participating centers would usually
not be available to anyone other than the DSMB.”
Although it may be tempting to use positive trial data from interim analyses to
inform subsequent planning of product development, caution needs to be taken for
interpretating and relying on the immature trial results from interim analyses as
studies (Woloshin et al. 2018; Wayant and Vassar 2018) have shown inconsistent
results in magnitude and even direction between interim assessments and final
analyses at the end of the trials. The spread of unreliable interim comparative
efficacy data may adversely affect patient adherence to study drugs, recruitment,
and long-term follow-up (Ellenberg et al. 2019). Inappropriate release of interim data
could even lead to early termination due to breach of confidentiality (see example
below regarding the LIGHT trial, Nissen et al. 2016). The FDA Guidance on
Establishment and Operation of Clinical Trials Data Monitoring Committees
(2006) states the following:
Knowledge of unblinded interim comparisons from a clinical trial is generally not necessary
for those conducting or sponsoring the trial; further, such knowledge can bias the outcome of
the study by inappropriately including its continuing conduct or the plan of analyses.
Unblinded interim data and the results of comparative interim analysis, therefore, should
generally not be accessible by anyone other than DSMB members or the statistician(s)
performing these analyses and presenting to the DSMB.
In some cases, the DSMB needs to notify the Sponsor and release safety data to
Sponsors who will inform regulatory agencies. When DSMBs observe an increased
risk in certain safety events, the committee may raise the concern to the Sponsor and
recommend informing investigators and patients. The DSMB may recommend
collecting additional information to support further safety reviews, or on some
occasions, modifying the trial procedures to protect patients in the trial. DSMBs,
in this case, may share relevant safety data with limited individuals from the Sponsor
to support subsequent procedures.
In anticipating the need to release unblinded data and meeting materials from the
DSMB, a data access plan or other relevant SOPs is important to limit the spread of
confidential information by specifying when the data could be shared, who will have
access, and how unblinded materials and results will be communicated or trans-
ferred. Often, the Sponsor may appoint a ‘firewalled’ group, internally or externally
(e.g., members from Executive Committees or Steering Committees), to receive
these data if the DSMB recommendation warrants this.
The following example shows the detrimental impact of inappropriate handling of
confidential interim trial data and highlights the importance of maintaining confi-
dentiality of data from the ongoing trials to preserve the integrity of the ongoing trial.
Example of early termination due to inappropriate public release of confidential
interim data by the sponsor – the LIGHT trial (Nissen et al. 2016)
690 S. Baksh and L. Zeng
DSMB Meetings
Structure of Meetings
DSMB meetings, much like many other types of study meetings, are often a
dance between study investigators, experts in the field, and other vested parties,
such as the study Sponsor. As such, the meeting structure reflects this power
dynamic and enables important information to reach the DSMB members in the
closed session, while offering others from the investigative team an opportunity
to weigh in during the open session. Additionally, any interpretation of the
recommendations, summaries, or subsequent actions taken must be done in
light of these interpersonal dynamics. A typical data review meeting consists
of an open session, a closed session, and optional executive and closed sessions
(described below) afterwards. Each session has a predetermined roster, agreed
upon and outlined in the DSMB charter. Adhering to these agreements maintains
data integrity while allowing for recommendations and decision-making in light
of study data presented by treatment group. Both the open and closed sessions
may have an accompanying report with the data to be reviewed and discussed.
We have provided a sample table of contents in Table 3. While not exhaustive,
this list contains elements one might consider presenting in a meeting following
the first patient, first visit.
37 Data and Safety Monitoring and Reporting 691
Open Session
The open session, attended by representatives from sponsors or investigators, the
DSMB, and the ISRG, provides the DSMB opportunities to discuss with the Sponsor
issues related to data quality, trial conduct, and trial management in a blinded
manner. Topics include but are not limited to enrollment, dropouts, timeliness of
data from different sources, protocol deviations, and inclusion/exclusion questions.
The Sponsor can use this opportunity to seek advice from the DSMB on emerging
trial issues.
The open sessions usually start by checking with the DSMB to assess if any new
conflict of interest has arisen. The study team or Sponsor representative then may
take the lead in presenting their perspectives on the trial progress and new informa-
tion from relevant clinical programs or literature external to the trial that may have an
impact on the study.
The Sponsor representatives should also provide updates regarding action items
from the Sponsor from previous meetings if they were not resolved soon after the
meetings. In general, the open session for a periodic safety data review should be
concise to ensure that the DSMB has enough time to discuss contents by treatment
692 S. Baksh and L. Zeng
Closed Session
In the closed session, attended by DSMB and ISRG only, the DSMB reviews data on
such issues as enrollment, trial status, safety, and efficacy presented by treatment
group, and discusses overall benefit and risk profile of the study drugs. Variations
abound as to how to approach a closed session: some DSMB Chairs may lead the
discussions; others may assign members with different expertise to lead topics
related to different issues; others may designate the high-level review to an ISRG
statistician who is most familiar with data and reports and is able to highlight new
information from the previous reviews, answer questions related to data, and inter-
pret the presentations included in the closed session report. Regardless of the
meeting styles, all DSMB members should have thoroughly reviewed the reports
prior to the meetings.
To facilitate a productive data review and discussion during the closed session,
the closed report should contain comprehensive data that are presented in a com-
prehensible manner (Buhr et al. 2018). Depending on the study objective and the
focus of the review, the structure and contents of the reports may vary. Typically, the
closed report starts with an executive summary table of the study, highlighting high-
level study status, safety, and/or efficacy events by treatment group, followed by
more detailed summaries presented in tables and figures. When detailed information
on patients and events is needed, listings of event of interest are provided to
supplement the review. The presentations in the closed reports should be presented
by unblinded treatment groups to inform assessments on the relative benefit-to-risk
profiles. In general, the closed reports should cover pre-specified efficacy assess-
ments, adverse events with corresponding clinical details by treatment group, pro-
tocol deviations and subsequent actions, and any unanticipated events that may have
occurred. These data are also discussed in light of the information presented in the
scientific updates from the open session. The DSMB may discuss areas where they
37 Data and Safety Monitoring and Reporting 693
may like to request additional analyses or subgroup analyses to better understand the
patterns that are emerging.
At the end of the closed session, the DSMB should strive to come to a consensus
with respect to recommendations for the trial continuation, instead of using a
majority vote approach. In situations where consensus cannot be met, a superma-
jority may be recorded with discussions and rationale for the decision-making
documented in the closed session minutes. In addition, the DSMB can discuss and
request ad-hoc analyses for the ISRG or Sponsor to address in follow-up meetings or
correspondences.
Executive session
After the closed session, there may be an optional executive session limited to the
actual members of the DSMB, where the DSMB has the opportunity to escalate
action items to the Sponsor and communicate the meeting recommendations ver-
bally. Whether an executive session is warranted is at the discretion of the DSMB.
There is typically no data prepared specifically for discussion during the executive
session. The outcome of this executive session, however, is recorded into the
meeting minutes and might be shared with the IRB or other regulatory authorities.
A DSMB meeting can result in a variety of outcomes implicating the trajectory of the
study. Typically, the DSMB can recommend one of four things: 1) continuation of
the study without modification, 2) continuation of the study with recommended
modifications, 3) termination of the study, or 4) suspension of enrollment pending
resolution of issues or concerns. Each of these options carries considerable risks and
benefits to the final interpretation of trial results and overall conclusions for the
patient population. In addition to these overarching recommendations, the DSMB
may also recommend that the investigative team amend the current protocol, change
enrollment strategies, improve the speed and accuracy of data entry, open or close
clinical sites, audit clinical sites, as well as other changes to trial activities. These
suggested changes may emanate from changes in trial data or other external factors
discussed during the meeting. After the conclusion of the meeting, the DSMB chair,
in consultation with the other members of the DSMB, typically prepare a letter
recommending continuation or termination of the trial along with any other
suggested changes or additional analyses. This letter is then submitted to all the
IRBs involved in the trial. Below, we highlight a few real-world examples of DSMB
recommendations for consideration.
Example of external regulatory authorities intervening in trial conduct – the
ATMOSPHERE trial (Swedberg et al. 2016)
The Aliskiren Trial to Minimize Outcomes in Patients with Heart Failure
(ATMOSPHERE) trial provides an example of several external factors determining
the trajectory of a clinical trial. In this study, participants were randomized to
enalapril, aliskiren, or a combination of both drugs for the prevention of death
694 S. Baksh and L. Zeng
from cardiovascular causes or hospitalization for heart failure. Aliskiren had previ-
ously been approved for patients with hypertension in the United States and
European Union. Concurrent with ATMOSPHERE, aliskiren was also used in two
similar trials, ASTRONAUT and ALTITUDE for slightly different populations. The
DSMB for ATMOSPHERE also served as the DSMB for ASTRONAUT and they
were aware of the accumulating data from ALTITUDE. ASTRONAUT, which had
closed recruitment, showed a higher proportion of participants with renal dysfunc-
tion on aliskiren than on placebo (14.1% vs. 10.2%). ALTITUDE had accumulated
69% of projected events and reported increased adverse events associated with
aliskiren. After reviewing the data from ALTITUDE and ASTRONAUT, but not
ATMOSPHERE, the Clinical Trials Facilitation Group of the European Union
requested to the sponsor Novartis discontinuing aliskiren in all patients with diabetes
in ATMOSPHERE. Despite the assurance from the DSMB for ATMOSPHERE that
they had carefully considered the data from ALTITUDE and ASTRONAUT in their
recommendation to proceed, Novartis complied with the request of the Clinical
Trials Facilitation Group to pause the treatment among diabetic patients. Because
of the censoring of follow-up time during the treatment pause, the study had to
extend for an additional year to meet the targeted number of events.
Example of DSMB stopping trial based on primary outcome – the EOLIA trial
(Harrington and Drazen 2018)
In the ECMO to Rescue Lung Injury for Severe ARDS (EOLIA) trial, investiga-
tors studied the use of extracorporeal membrane oxygenation (ECMO) compared to
standard of care in the treatment of severe acute respiratory distress syndrome
(ARDS) (Combes et al. 2018). Because of the nature of the intervention, treating
clinicians were unmasked to the treatment groups. Consequently, following the
protocol those randomized to standard of care could, theoretically, be switched to
ECMO during the course of treatment for rescue use. By the end of the trial, 28% of
those randomized to the standard of care had switched to ECMO, with 57% of these
crossover participants dying. The investigators noted that the high proportion of
crossover inhibited their ability to draw conclusions about the use of ECMO for the
primary outcome of mortality at 60 days. They did, however, see a significant effect
of ECMO on the secondary outcome of treatment failure, defined as death in the
ECMO group versus death or crossover in the standard of care group. After roughly
three-quarters of the projected participants were enrolled, the DSMB stopped the
trial for futility at the fourth interim analysis. Critics of this decision contend that had
the trial continued, investigators might have had greater evidence for the secondary
outcomes, some of which were trending toward favoring ECMO as a treatment.
These critics encourage future DSMB members to treat the stopping guidelines as
true guidelines and consider the impact of these decisions on other outcomes, both
safety and efficacy, in the trial (Harrington and Drazen 2018).
Example of emerging evidence influencing DSMB – the MOXCON trial (Pocock
et al. 2004)
The MOXonidine CONgestive Heart Failure (MOXCON) trial provides a classic
example of a trial that was halted for safety concerns. The study was designed to
investigate the use of moxonidine for the prevention of all-cause mortality in patients
37 Data and Safety Monitoring and Reporting 695
with NYHA class II–IV heart failure (Cohn et al. 2003). Initially powered to detect a
20% reduction in all-cause mortality, the study required 724 deaths. Of note, a
concurrent dose-finding trial of moxonidine was not completed at the time of the
start of MOXCON. While concerns were raised, MOXCON was permitted to start,
despite the fact it was studying the highest dose used in the dose-finding study. An
interim analysis when roughly one-quarter of the expected enrollment had occurred
began showing a trend of increased mortality with the use of moxonidine. Despite
the early indicators of increased risk for mortality, the small numbers of deaths
combined with the lack of safety concerns in the then completed dose-finding trial
led the DSMB to recommend continuation with a planned teleconference before the
next 6-month safety analysis. At this analysis, a nominal p-value of less than 0.05
was observed, with 37 deaths in the moxonidine group and 20 deaths in the placebo
group. After an investigation of the potential causes of the deaths, time to death,
dosing, other serious adverse events, and baseline characteristics, the DSMB was put
in the discomforting position of recommending termination after less than 10% of
the expected deaths had occurred.
The DSMB then discussed this concern with the MOXCON Executive Commit-
tee and came to a consensus to recommend stopping randomization and closing out
participants currently enrolled and being treated in the trial. They ultimately left the
final decisions up to the MOXCON Senior Management. In their published debrief
of this experience, the DSMB also noted that they recognized the difficult position
the Executive Committee faced: continuing to proceed with MOXCON despite a
contrary recommendation from the DSMB could potentially raise serious concerns
about the interpretation of the results at the completion of the trial. The importance of
this power dynamic and delineation of roles is especially evident at the time of
decision-making.
Each of these examples provides unique snapshots of how DSMB recommenda-
tions can be unpredictable and determinative of a study’s direction. Regardless of
how impactful or mundane the recommendations, they are recommendations. Ulti-
mately, decision-making for the study rests with the study Sponsor, as do the
consequences of those decisions.
A DSMB provides an integral role in the conduct of many multicenter clinical trials.
It serves as a check on competing interests in the name of participant safety and a
balance to the inherent biases trial investigators and Sponsors may hold regarding
the outcome of the trial. While they are not an overarching governing body for a
clinical trial, their recommendations do carry weight, and when presented to the
outside scientific community, can influence the interpretation of trial results as
illustrated in the examples in this chapter. What can sometimes seem like a rudi-
mentary task at the outset of a trial, developing a DSMB charter and defining the
purview of this group can impede or enhance the utility of a DSMB in the conduct of
the trial. It is imperative that the members of the DSMB, trial investigators, and trial
696 S. Baksh and L. Zeng
Sponsors carefully consider the needs of the study, the independence of the DSMB,
and the potential concerns of the patient group impacted by the study results when
developing this document. Through this collaborative and intentional effort, the
DSMB can best serve in its capacity to monitor the trial for study integrity, safety,
and efficacy.
Key Facts
1. DSMBs best serve as an independent check on data integrity and patient safety
and as a balance to the decision-making power of study leadership.
2. While DSMBs provide recommendations for the trajectory of a trial, their
suggestions command respect from the broader scientific community, as the
recommendations balance competing interests of other study stakeholders.
3. Investing in the development of a comprehensive DSMB Charter, with consid-
eration for the needs of the study and the patient population, will not lead to
anticipating every possible decision to be made, but rather will provide the
parameters for effective decision-making when difficult situations arise.
Cross-References
References
ANVISA (2015) Resolution of the board of directors – RDC no. 9. Ministry of Health. Retrieved
from http://antigo.anvisa.gov.br/documents/10181/3503972/RDC_09_2015_COMP.pdf/
e26e9a44-9cf4-4b30-95bc-feb39e1bacc6
Buhr KA, Downs M, Rhorer J, Bechhofer R, Wittes J (2018) Reports to independent data
monitoring committees: an appeal for clarity, completeness, and comprehensibility. Ther
Innov Regul Sci 52(4):459–468. https://doi.org/10.1177/2168479017739268. Epub 2017
Nov 13
Clemens F, Elbourne D, Darbyshire J, Pocock S (2005) Data monitoring in randomized controlled
trials: surveys of recent practice and policies. Clin Trials 2(1):22–33. https://doi.org/10.1191/
1740774505cn064oa
Cohn JN, Pfeffer MA, Rouleau J, Sharpe N, Swedberg K, Straub M, ... Wright TJ (2003) Adverse
mortality effect of central sympathetic inhibition with sustained-release moxonidine in patients
with heart failure (MOXCON). Eur J Heart Fail 5(5):659–667. https://doi.org/10.1016/s1388-
9842(03)00163-6
Combes A, Hajage D, Capellier G, Demoule A, Lavoué S, Guervilly C, ... Mercat A (2018)
Extracorporeal membrane oxygenation for severe acute respiratory distress syndrome. N Engl
J Med 378(21):1965–1975. https://doi.org/10.1056/NEJMoa1800385
DAMOCLES Study Group (2005) A proposed charter for clinical trial data monitoring committees:
helping them do their job well. Lancet 365:711–722
37 Data and Safety Monitoring and Reporting 697
Eckstein L (2015) Building a more connected DSMB: better integrating ethics review and safety
monitoring. Account Res 22(2):81–105. https://doi.org/10.1080/08989621.2014.919230
Ellenberg SS, Fleming TR, DeMets DL (2019) Data monitoring committees in clinical trials: a
practical perspective, 2nd edn. Wiley, Hoboken, NJ
European Medicines Agency (2005) Guideline on data monitoring committees. (EMEA/CHMP/
EWP/5872/03 Corr). European Medicines Agency, London. Retrieved from https://www.ema.
europa.eu/en/documents/scientific-guideline/guideline-data-monitoring-committees_en.pdf
Fleming TR, DeMets DL, Roe MT, Wittes J, Calis KA, Vora AN, Meisel A, Bain RP, Konstam MA,
Pencina MJ, Gordon DJ, Mahaffey KW, Hennekens CH, Neaton JD, Pearson GD, Andersson
TL, Pfeffer MA, Ellenberg SS (2017) Data monitoring committees: promoting best practices to
address emerging challenges. Clin Trials 14(2):115–123. https://doi.org/10.1177/
1740774516688915. Epub 2017 Feb 1. PMID: 28359194; PMCID: PMC5380168
Harrington D, Drazen JM (2018) Learning from a trial stopped by a data and safety monitoring
board. N Engl J Med 378(21):2031–2032. https://doi.org/10.1056/NEJMe1805123
Herson J (2017) Data and safety monitoring committees in clinical trials, 2nd edn. Taylor & Francis,
Boca Raton, FL
National Health and Medical Research Council (2018) Data safety monitoring boards (DSMBs).
(978-1-86496-004-4). National Health and Medical Research Council. Retrieved from www.
nhmrc.gov.au/guidelines-publications/EH59C
National Institutes of Health (1998) NIH policy for data and safety monitoring. National Institutes
of Health. Retrieved from https://grants.nih.gov/grants/guide/notice-files/not98-084.html
Nissen SE, Wolski KE, Prcela L et al (2016) Effect of naltrexone-bupropion on major adverse
cardiovascular events in overweight and obese patients with cardiovascular risk factors: a
randomized clinical trial. JAMA 315(10):990–1004. https://doi.org/10.1001/jama.2016.1558
Pharmaceutical and Food Safety Bureau (2013) Guideline on data monitoring committee (PFSB/
ELD notification No.0404-1). Ministry of Health, Labour and Welfare, Japan. Retrieved from
https://www.pmda.go.jp/files/000232300.pdf
Pocock S, Wilhelmsen L, Dickstein K, Francis G, Wittes J (2004) The data monitoring experience
in the MOXCON trial. Eur Heart J 25(22):1974–1978. https://doi.org/10.1016/j.ehj.2004.
09.015
Swedberg K, Borer JS, Pitt B, Pocock S, Rouleau J (2016) Challenges to data monitoring
committees when regulatory authorities intervene. N Engl J Med 374(16):1580–1584. https://
doi.org/10.1056/NEJMsb1601674
Tanzania Food and Drugs Authority (2017) Guidelines for application to conduct clinical trials in
Tanzania, 3rd edn. Retrieved from https://www.tmda.go.tz/uploads/publications/
en1554368837-TANZANIA%20CLINICAL%20TRIAL%20GUIDELINES-%202017.pdf
U.S. Food and Drug Administration (2006) Guidance for clinical trial sponsors: establishment and
operation of clinical trial data monitoring committees. March 2006. Available at: https://www.
fda.gov/media/75398/download
U.S. Securities and Exchange Commission (2015) From 8-K. Orexigen Therapeutics, Inc. File
number 001-33415. March 3, 2015. Available at: https://www.sec.gov/Archives/edgar/data/
1382911/000119312515074251/d882841d8k.htm
Wayant C, Vassar M (2018) A comparison of matched interim analysis publications and final
analysis publications in oncology clinical trials. Ann Oncol 29:2384–2390. https://doi.org/10.
1093/annonc/mdy447
Woloshin S, Schwartz LM, Bagley PJ, Blunt HB, White B (2018) Characteristics of interim
publications of randomized clinical trials and comparison with final publications. JAMA 319
(4):404–406. https://doi.org/10.1001/jama.2017.20653
Post-Approval Regulatory Requirements
38
Winifred Werther and Anita M. Loughlin
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700
History of US and EU Regulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
History of Post-Approval Studies in the USA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702
History and Legal Framework of Post-Approval Studies in Europe . . . . . . . . . . . . . . . . . . . . . . . 703
Post-Approval Terminology and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
Post-Approval Study Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713
Enforcement of Post-Approval Studies by Regulatory Agencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
US PMC and PMR Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
EU PAM Enforcement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
Systematic Reviews of Post-Approval Studies in the USA and EU . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
Reviews of Post-Approval Studies in the USA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
Reviews of Post-Approval Studies in the EU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 723
Abstract
Health authorities throughout the world have regulations for requesting additional
research in the post-approval setting. This chapter focuses on the regulations in
the USA and European Union (EU). The history of post-approval studies can be
traced through changing regulations enforced by the US Food and Drug Admin-
istration (FDA) and the EU European Medicines Agency (EMA).
W. Werther (*)
Center for Observational Research, Amgen Inc, South San Francisco, CA, USA
e-mail: [email protected]
A. M. Loughlin
Corrona LLC, Waltham, MA, USA
e-mail: [email protected]
Keywords
Post approval · Post marketing · Post authorization · Pharmacovigilance ·
Pharmacoepidemiologic
List of Abbreviations
CFR Code of Federal Regulations
EMA European Medicines Agency
EU European Union
FDA Food and Drug Administration
FDAAA Food and drug Administration Amendments Act
MAH Market authorization holder
PAES Post-authorization efficacy study
PAM Post-authorization measure
PAS Post-authorization study
PASS Post-authorization safety study
PMC Post-marketing commitment
PMR Post-marketing requirement
PREA Pediatric Research Equity Act
REMS Risk evaluation and mitigation strategy
USA United States
Introduction
The collection of information on safety and efficacy of medical treatments often does
not end with the approval of medical products. Health authorities throughout the
world have regulations for requesting additional research in the post-approval
38 Post-Approval Regulatory Requirements 701
First, in the USA, the 1997 FDA Modernization Act introduced the requirement for the
FDA regarding post-approval studies, as referred to in the US regulations as post-
marketing studies or post-marketing requirements (PMRs). In 1999, the FDA
published the rule regarding post-marketing commitment (PMC), which was defined
as studies, including clinical trials, conducted by an applicant after FDA has approved
a drug for marketing or licensing that were intended to further refine the safety,
efficacy, or optimal use of a product or to ensure consistency and reliability of product
quality. In 2006, as a complement to the final rule from 1999, the FDA issued a
guidance for industry on PMCs. In 2007, the FDA Amendments Act (FDAAA), which
clarified reasons for post-marketing studies, was signed into law by the US president.
FDAAA included the new provision that gave the FDA the authority to require risk
evaluation and mitigation strategy (REMS), in addition to PMR and PMC. In 2011, a
new guidance for industry was released on post-marketing studies and clinical trials
with implementation into Section 505(o)(3) of the Federal Food, Drug, and Cosmetic
Act which stated that the FDA can require post-approval clinical trials and studies
(FDA 2011). In this guidance, clinical trials were defined as any prospective investi-
gation in which the applicant or investigator determines the method of assigning the
drug product or other interventions to one or more human subjects, and studies were
defined as all other investigations.
In the USA, with the 2007 FDAAA, there was a change in the rationale for
requesting a post-approval study from a sponsor. Before 2007, the following three
reasons were used when requiring post-marketing studies:
After FDAAA in 2007, the reasons for a post-marketing study were broadened to:
There are four mechanisms that provide the FDA with the authority to require
PMR. They are through Accelerated Approval, the Animal Rule, the Pediatric
Research Equity Act, and FDAAA. These authorities are described in Table 1.
38 Post-Approval Regulatory Requirements 703
The European Medicines Agency (EMA), the EU Member States, and the European
Commission are responsible for implementing and operating the legislation that
deals with post-approval studies, as referred to in EMA legislation as post-authori-
zation studies, including pharmacovigilance studies. Pharmacovigilance studies are
research studies with the objective of studying drug safety. The EMA plays a key
role in coordinating activities relating to post-approval studies by working with a
wide range of stakeholders including the European Commission, pharmaceutical
companies, national medicines regulatory authorities, patients, and healthcare pro-
fessionals to ensure effective implementation and operation of the pharmacov-
igilance legislation, which includes post-authorization safety studies (PASS). Post-
704 W. Werther and A. M. Loughlin
authorization efficacy studies (PAES) are another type of study conducted in the
post-approval setting. However, PAES are not part of the pharmacovigilance
legislation.
Per the directive and regulation implemented in 2012 and the EMA website
(EMA 2020a), a PASS is a study that is carried out after a drug has been authorized.
The purpose of the PASS is to evaluate the safety and benefit-risk profile of a drug
and support regulatory decision-making. A PASS aims to (1) identify, characterize,
or quantify a safety hazard; (2) confirm the safety profile of a drug; or (3) measure the
effectiveness of risk management measures. Risk management measures are activ-
ities carried out by the sponsor to assess the risks associated with drugs. Risk
management measures are tracked in risk management plans (RMP). Sponsors are
required to submit an RMP to the EMA when applying for a marketing authorization.
A PAS design is either clinical trial or observational. A PAS is either imposed or
voluntary. The EMA’s Pharmacovigilance Risk Assessment Committee (PRAC) is
responsible for assessing the protocols of imposed PASS and for assessing their
results. A voluntary PASS is conducted by sponsors on their own initiative. Non-
imposed PAS that are requested by the EMA in RMPs are deemed voluntary PASS.
An RMP includes activities agreed upon by EMA and sponsors to continually study
the risks of a drug.
EMA has published guidance on the format and content of study protocols and
final study reports for non-interventional studies, together with the PRAC assess-
ment report templates. The guidance is based on Commission Implementing Regu-
lation No 520/2012 of 19 June 2012, which was implemented in January 2013. For
clinical trials, sponsors should follow the instructions in volume 10 of the rules
38 Post-Approval Regulatory Requirements 705
governing medicinal products in the European Union (EU). Further guidance for
PASS is available in the following document: Guideline on good pharmacovigilance
practices: Module VIII – Post-authorisation safety studies (EMA 2017).
• At the time of granting the initial marketing authorization (MA) where concerns
relating to some aspects of the efficacy of the medicinal product are identified and
can be resolved only after the medicinal product has been marketed [Art 9(4)(cc)
of REG/Art 21a(f) of DIR]
• After granting of a MA where the understanding of the disease or the clinical
methodology or the use of the medicinal product under real-life conditions
indicates that previous efficacy evaluations might have to be revised significantly
[Art 10a(1)(b) of REG/Art 22a(1)(b) of DIR]
Also, PAES can be imposed outside of the scope of Delegated Regulation (EU)
No 357/2014. PAES may be imposed in the following specific situations:
The recommended study designs for PAES include randomized and non-random-
ized designs. Consideration for clinical trial and observational study methodologies
is described in the PAES guidance document (EMA 2016) as follows:
Clinical trial design options for the design of PAES could include explanatory and
pragmatic trials. Explanatory trials generally measure the benefit of a treatment
under ideal conditions to establish whether the treatment works. Pragmatic trials
examine interventions under circumstances that approach real-world practice,
with more heterogeneous patient populations, possibly less-standardized treat-
ment protocols and delivery in routine clinical settings as opposed to a research
706 W. Werther and A. M. Loughlin
Terminology used by the FDA and EMA is described in the table below and includes
term, definition, terminology usage, examples, timing, reporting, and registration.
Briefly, the FDA uses the term risk evaluation and mitigation strategy (REMS) to
track post-approval safety studies that can be either post-marketing requirement
(PMR) or post-marketing commitment (PMC). However, PMR and PMC can be
conducted outside of REMS. Timing and reporting are described in the table. The
EMA uses post-authorization measures (PAMs) to track post-authorization safety
studies (PASS) and post-authorization efficacy studies (PAES) (Tables 2 and 3).
Clinical Trials
Providing results from clinical trials in the post-approval setting can be necessary
under many conditions. To name a few, confirmatory efficacy findings are required,
patients with unique characteristics are studied, patients with new indications are
studied, and efficacy measures have changed significantly during the conduct of the
registrational trials.
Clinical trial designs may include pragmatic trials, as well as synthetic trials.
Large pragmatic trials are trials where simple designs are used to study large
numbers of patients with high external validity (Patsopoulous 2011). Synthetic trials
are clinical trials that use real-world data or pooled clinical trial data to recreate
38
requirement any or all of three studies or clinical studies FDA becomes aware ClinicalTrials.gov
(PMR) purposes: trials, including those Meta-analyses of new safety for clinical trials and
To assess a known required by four Clinical trials with safety information studies
serious risk related to authorities: FDAAA, endpoint evaluated
the use of the drug Pediatric Research, Safety studies in animals
To assess signals of Accelerated In vitro laboratory safety
serious risk related to Approval, Animal studies
the use of the drug Rule Pharmacokinetic studies
To identify an or clinical trials
unexpected serious Studies or clinical trials
risk when available to evaluate drug
data indicates the interactions or
potential for serious bioavailability
risk
(continued)
707
708
Table 2 (continued)
Term Definition Terminology usage Examples/components Timing Reporting Registration
Post- Studies (including Describes studies and Drug and biologic At time of approval Annual to Voluntary
marketing clinical trials), clinical trials that quality studies FDA registration at
commitment conducted by an applicants have Pharmacoepidemiologic ClinicalTrials.gov
(PMC) applicant after FDA agreed to conduct, but studies on natural history for clinical trials and
has approved a drug that will generally not of disease or background studies
for marketing or be considered as rates for adverse events
licensing, that were meeting a statutory in a population not
intended to further purpose and so will treated with the drug
refine the safety, not be required Studies and clinical trials
efficacy, or optimal for non-serious risk or
use of a product or to safety signals
ensure consistency Clinical trials with
and reliability of primary endpoint related
product quality to further defining
efficacy
Reference: FDA REMS Overview_121110-cln.pdf from FDA website https://www.fda.gov/aboutfda/transparency/basics/ucm325201.htm
W. Werther and A. M. Loughlin
Table 3 Terminology for post-approval studies for the European Union
38
Examples/
Term Definition Terminology usage components Timing Reporting Registration
Post- Additional data post PAMs fall within
authorization authorization, as it is one of the following
measures necessary from a categories [EMA
(PAM) public health codes]:
perspective to Specific
complement the obligation [SOB]
available data with Annex II
additional data about condition [ANX]
the safety and, in Additional
certain cases, the pharmacovigilance
efficacy or quality of activity in the risk
authorized medicinal management plan
Post-Approval Regulatory Requirements
(continued)
710
Table 3 (continued)
Examples/
Term Definition Terminology usage components Timing Reporting Registration
Action (CAPA),
pediatric [P46]
submissions,
MAH’s justification
for not submitting a
requested variation)
Recommendation
[REC], e.g., quality
improvement
Post- Any study relating to Includes clinical PASS categories in As a condition of Imposed, non- Clinical trials must
authorization an authorized trials and non- RMP for clinical granting marketing interventional be registered at
safety study medicinal product interventional trials or non- authorization or PASS reporting to European Union
(PASS) conducted with the studies interventional: after granting PRAC within Clinical Trials
aim of identifying, May be imposed Category 1: market 12 months of end of Portal and Database
characterizing, or (required) or imposed PASS authorization if data collection (www.
quantifying a safety voluntary (not Category 2: there are concerns Abstract of results clinicaltrialsregister.
hazard, of required) specific obligation about risks of the posted to EU PAS eu)
confirming the safety Category 3: authorized Register (www. Non-interventional
profile of the required as part of medicinal product encepp.eu) studies must be
medicinal product, or RMP registered at
of measuring the Note: European Union
effectiveness of risk Categorization is Post-Authorization
management from GVP V.B.6.3 Study (EU PAS)
measures Register (www.
encepp.eu)
W. Werther and A. M. Loughlin
38
Post- Post-authorization Clinical trials and From 2014 to 2017, Clinical trials must
authorization efficacy studies non-interventional all PAES have been be registered at
efficacy study (PAES) of medicinal studies imposed for clinical trials European Union
(PAES) products are studies conditional Clinical Trials
conducted within the marketing Portal and Database
authorized authorization or
therapeutic marketing
indication to authorization under
complement exceptional
available efficacy circumstance and
data in the light of other conditions
well-reasoned
scientific
uncertainties on
aspects of the
Post-Approval Regulatory Requirements
evidence of benefits
that should be, or can
only be, addressed
post-authorization
Note: Not a legal
definition, it is a
working definition
from
EMA/PDCO/CAT/
CMDh/PRAC/
CHMP/261500/2015
Draft-scientic-
guidance-post-
authorisation-
efficacy-studies-first-
version_en.pdf
711
712 W. Werther and A. M. Loughlin
clinical trial arms with the intent to provide comparative analyses within the data
source or as comparators to external clinical trials. Synthetic trials provide compar-
ative effectiveness results by analyzing existing data sources without collecting new
information (Berry et al. 2017; Zauderer et al. 2019).
There are operational concerns when conducting trials in the post-approval
setting including the need to maintain equipoise. In the accelerated/expedited
approval setting, ongoing trials are typically listed as required post-approval trials
so that the trials continue to completion. This is especially important if approvals are
granted on surrogate outcomes. The ongoing trials will often provide hard outcomes,
such as survival, as compared to disease progression.
Clinical trials in special populations, such as pediatric patients, may be conducted
in the post-approval setting as part of a pediatric investigation plan. For diseases that
38 Post-Approval Regulatory Requirements 713
occur rarely in pediatric populations, these trials are not typically completed at the
time of filing for regulatory approval and thus continue in the post-approval setting.
Observational Studies
Table 5 Sources for guidance for observational studies for drug safety research
Guidance for safety studies
European Network of Centers for Pharmacoepidemiology and Pharmacovigilance (ENCePP),
Guidelines on Good Pharmacovigilance (GVP) – Module VIII – Post-authorization safety studies
(Revision 3). 2017. Online: https://www.ema.europa.eu/en/documents/scientific-guideline/
guideline-good-pharmacovigilance-practices-gvp-module-viii-post-authorisation-safety-studies-
rev-3_en.pdf. Accessed 12 Jun 2020.
FDA. Best Practices for Conducting and Reporting Pharmacoepidemiology Safety Studies
Using Electronic Healthcare Data Sets. May 2013. Online: https://www.fda.gov/regulatory-
information/search-fda-guidance-documents/best-practices-conducting-and-reporting-
pharmacoepidemiologic-safety-studies-using-electronic. Accessed 12 Jun 2020.
Guidelines for pharmacoepidemiologic studies
ENCePP, Guidelines on Methodological Standards in Pharmacoepidemiology (Revision 7),
2010. Online: http://www.encepp.eu/standards_and_guidances/documents/
ENCePPGuideonMethStandardsinPE_Rev7.pdf, Accessed 12 Jun 2020.
International Society for Pharmacoepidemiology (ISPE), Guidelines for Good
Pharmacoepidemiology Practice (Revision 3), 2015. Online:https://www.pharmacoepi.org/
resources/policies/guidelines-08027/#1. Accessed 12 Jun 2020.
Berger ML, Sox H, Willke RJ, et al. Good practices for real-world data studies of treatment and/
or comparative effectiveness: Recommendations from the joint ISPOR-ISPE Special Task Force
on real-world evidence in health care decision making. Pharmacoepidemiol Drug Saf. 2017;26
(9):1033–1039.
Berger ML, Dreyer N, Anderson F, Towse A, Sedrakyan A, Normand SL. Prospective
observational studies to assess comparative effectiveness: the ISPOR good research practices task
force report. Value Health. 2012;15(2):217–230.
Dreyer NA, Schneeweiss S, McNeil B, et al.; on behalf of the GRACE Initiative. GRACE
principles: recognizing high-quality observationalstudies of comparative effectiveness. Am J
Manag Care 2010;16:467–71.
714 W. Werther and A. M. Loughlin
The design of the observational study should be guided by the research question,
the availability of data to answer the question, and an understanding of the limita-
tions of the study to be conducted. This section will describe study designs com-
monly used for post-authorization studies but is not intended to replace textbooks on
epidemiologic study design.
A thorough study protocol provides the parameters of the observational study
including the following sections that define the design and conduct of the study
(FDA 2013; ENCEPP 2010; ICPE 2015; Dreyer et al. 2010).
• Research Question and Objectives – this section defines both the issue that
which leads to the study and specified hypotheses or outcomes that will be
measured. This should include both primary and secondary objectives.
• Study Design – this section provides the overall research design (e.g., cohort,
case-control) that will be used to answer the research question.
• Study Population – this section describes the source population and the com-
parison groups. In defining the source population, the protocol will outline the
inclusion and exclusion criteria for the study population. For example, the study
population may restrict to patients with a specific diagnosis likely to receive the
drug or may include a subset of this population (e.g., pregnant women and their
infants). Within the source population, the comparison groups in post-approval
studies are defined by either exposure to a specific drug(s) of interest or by the
presence of specific safety outcome of interest.
• Data Source – this section describes the data used to assess the research question.
Data source includes primary data collection, patient-based or exposure-based
registries, and secondary data sources (e.g., administrative claims data and elec-
tronic health records).
• Data Collection and Covariates – these sections describe both how drug
exposures of interest and safety outcomes are operationally defined and mea-
sured, as well as how other risk factors, comorbidity, co-medication, potential
confounders, and effect modifying variables are defined and measured.
• Analysis Plan – this section both describes how confounding or other biases
are assessed and/or controlled and the statistical methods used to describe and
compare the comparison groups, including the occurrence of outcome (e.g.,
incidence) and measure of association (e.g., relative risk, odds ratios, mean
differences) and its confidence interval. It may describe additional planned
such as intention-to-treat analysis, as treated analysis, subgroup analyses,
sensitivity analyses and meta-analytic techniques to combine findings across
data sources.
• Limitations of Research Methods – this section describes any potential limita-
tions of study design, data sources, and analytic methods, including issues of
confounding, bias, generalizability, and random error, as well as efforts made to
reduce limitations with the proposed research plan.
Changes to study protocol should be documented in final report and the impact of
changes on interpretation of study discussed (FDA 2013).
38 Post-Approval Regulatory Requirements 715
One strength of primary data collection, such as in registries, is that patients are
well characterized over time, and needed information on patient exposure, outcomes,
and potentially confounding factors can be captured. A second strength for registries
is the ability to collect information on patient using survey tools and validated
instruments. Patient-reported health data may include quality of life, symptoms
(e.g., pain scores or fatigue), use of over-the-counter medications, patient prefer-
ences, behavioral data, family history, and biological specimens. Yet, there are
limitations; following many patients for a long time and the collection of prospective
data are both time-consuming and very expensive. As in clinical trials, protocol-
specified inclusion and exclusion criteria help to limit systematic selection bias in
registries, and the use of validated and standardized assessments can reduce mis-
classification of disease and outcomes; yet while patient-reported events are essential
to registry data, these data are subjective and certain forms of bias, such as recall
bias, may influence the data.
systems (e.g., Optum EHR Research Database, Flatiron) that integrate electronic
medical records across different systems into a common data platform, inclusive of
data from a generalizable national sample of private practices, hospitals, and inte-
grated health networks.
Cohort Studies
The efficiencies gained by using large healthcare databases make cohort studies a
viable alternative to clinical trials, for large comparative effectiveness and safety
studies. Cohort studies identify a population at risk and an exposure to medical
718 W. Werther and A. M. Loughlin
products of interest and follow patients over time for the occurrence of events. In
cohort studies, the comparison cohorts are selected from the same population at risk
yet are unexposed at time of enrollment into the cohort and are similarly followed
over time for the occurrence of events. Cohort studies provide the opportunity to
determine the incidence rate of adverse events in addition to the relative risk of an
adverse event. They are useful for identification of multiple events in the same study.
In addition, cohort studies are useful for examining safety concerns in special
populations, such as the children, the elderly, pregnant women, or patients with
comorbid conditions that are often underrepresented in clinical trials (EMA 2017;
FDA 2013).
Prospective and retrospective cohort studies serve different purposes in the post-
approval setting. Prospective cohort studies are used for safety studies early after
approval and release of a drug into the market. Retrospective cohort studies are
conducted in secondary data sources when a drug has been on the market for some
time and there is a new safety concern. See Table 4 for the strengths and weaknesses
of prospective and retrospective cohort study designs.
results from imbalance of determinants of disease (or their proxies) across compared
groups (FDA 2013). Channeling refers to the situation where drugs are prescribed to
patients differently based on the presence or absence of factors prognostic of patient
outcomes. Confounding by indication is a type of channeling bias that occurs when the
indication, which is associated with drug exposure, is an independent risk factor for the
outcome. Biases that threaten pharmacoepidemiologic safety studies conducted in
secondary data sources, and methods to handle these biases, need to be taken into
consideration when planning these studies.
Study designs with new users, with active comparators, or that are matched by
disease risk score are methods that reduce these biases, in that a comparison is made
between patients with the same indication initiating different treatments (ENCePP
2010).
Study design choices that make the study groups more similar are important tools
for controlling for confounding and biases. The goal of the study design is to
facilitate comparisons of people with similar chance of benefiting from the treatment
or experiencing harm. There are few epidemiology and statistical method used to
handle confounding in pharmacoepidemiologic studies (e.g., restriction, matching,
adjustment, and weighting). Methods, such as propensity score (PS) matching and
inverse probability treatment weighting (IPTW) using the PS, are two common ways
to reduce bias in comparative safety studies using real-world large secondary data
sources (Austin 2011; Austin and Stuart 2015; Rosenbaum and Rubin 2007/1983).
PS is defined as the conditional probability of being treated with drug of interest
given observed set of pretreatment characteristics. PS is estimated using logistic
regression, where treatment group is the dependent variable. Potential independent
variables in the logistic regression will include a priori specified characteristics,
potential confounding factors, and effect modifying factors. Exposed and unexposed
treatment groups are matched based on PS score, using a greedy matching algorithm.
Treatment groups matched by PS should be well balanced with respect to known
(and possibly unknown) confounders; therefore, the outcomes observed across
treatment groups can be directly compared.
When they are imposed, post-approval studies are reviewed for compliance by the
regulatory agencies. For clinical trials that are ongoing at the time of approval, often
these are classified as PMC in the USA or PAM in the EU. Findings of these trials
can be submitted to the health authorities for addition to the prescribing information,
including expansion of the indications and/or update of the efficacy data. On the
contrary, imposed observational studies for safety concerns can lead to changes in
prescribing information and add to list of potential adverse effects. The FDA and
EMA both track progress on PMC/PMRs and PAMs, respectively.
In the USA, PMC and PMR studies are registered at ClinicalTrials.gov, whether
they are clinical trials or observational studies. In Europe, the EMA publishes the
protocols, abstracts, and final study reports of PAS in the EU PAS register hosted on
720 W. Werther and A. M. Loughlin
The FDA publishes an annual report that provides a summary of the progress of
PMC and PMR that were agreed upon at the time of medicinal product approval.
This annual report is required according to the FDA Modernization Act of 1997 and
is published to the Federal Register. The report includes data from the PMR/PMC
database maintained by the FDA (2020b). The PMR/PMC database is searchable
and available to the public at the FDA website https://www.accessdata.fda.gov/
scripts/cder/pmc/.
The most recent FDA annual report includes fiscal year 2018. PMC/PMRs are
categorized as pending, ongoing, delayed, terminated, submitted, fulfilled, and
released. In addition, PMRs/PMCs may be characterized as open or closed. Open
PMRs/PMCs comprise those that are pending, ongoing, delayed, submitted, or termi-
nated, whereas closed PMRs/PMCs are either fulfilled or released. Open PMRs are
described as on- or off-schedule. On-schedule PMRs/PMCs are those that are pending,
ongoing, or submitted. Off-schedule PMRs/PMCs are those that have missed one of
the milestone dates in the original schedule and are categorized as either delayed or
terminated.
The fiscal year 2018 annual report shows that 69% of PMR/PMC annual status
reports were received on time. For those that were open but not due, 79% for new
drug application and 86% for biologics license application PMRs were progressing
on schedule, and most open PMCs – 76% for new drug applications and 84% of
biologics license applications – were also on schedule (FDA 2019).
EU PAM Enforcement
The EMA assesses compliance to specific obligations specified in PAMs through the
analysis of their database including due dates. The assessment is conducted annually,
for both the annual renewal (for conditional marketing authorizations) and annual
reassessment (for marketing authorizations under exceptional circumstances) (EMA
2020b).
When issues of non-compliance with PAM are identified, the relevant EMA
committees can take one or more of the following actions:
Also, if the medicinal product has conditional approval, the marketing authori-
zation can be varied, suspended, or revoked.
38 Post-Approval Regulatory Requirements 721
Several analyses of post-approval study conducted in the USA and EU have been
published. A summary of these analyses is described below.
In the USA, following accelerated or expedited approval, the FDA may require
additional confirmatory clinical trials. These required trials and their results have
been the subject of several systematic review studies (Beaver et al. 2018; Naci et al.
2017; Wallach et al. 2018).
Naci et al. studied characteristics of pre- and post-approval clinical trials reviewed
at the US FDA from 2009 to 2013 (Naci et al. 2017). They reported on trials for 22
drugs with 24 indications examined. Of these post-approval trials, 42% (10 of 24
indications studied) confirmed the efficacy of a previously analyzed surrogate
endpoint within 3 years of Accelerated Approval. Among the 58% of post-approval
trials that had not confirmed the indication for trial at time of review, half of the trials
were still ongoing and the other half either were terminated, failed to confirm results,
or were delayed by more than a year. There were two indications where the post-
approval trial failed to confirm clinical benefit, yet these findings did not result in
reversal of the approval, and no additional trials were imposed.
Wallach et al. studied post-approval studies required by the US FDA between
2009 and 2012 and allowed for at least 4 years of follow-up (Wallach et al. 2018).
Among the 134 prospective cohort studies, registries, and clinical trials, 102 (76%)
were registered on ClinicalTrials.gov. There were 65 completed studies, and 47
(72%) of these had reported results in ClinicalTrials.gov or in a publication. How-
ever, most (32 of 47, 68%) did not report results in the timeframe stated in the post-
marketing requirement.
Beaver et al. reviewed the accelerated US FDA approvals of oncology and
malignant hematology medicinal products from 1992 to 2017 (Beaver et al. 2018).
They identified 93 products with Accelerated Approvals for new indications. Of
these, 51 (55%) completed their post-approval studies and confirmed benefit within
a median of 3.4 years, while 5 (5%) post-approval studies concerned indications that
were withdrawn from the market. The remainder have ongoing confirmatory studies.
In the EU, several studies have been conducted to measure compliance with the EU
regulation and registration of PAS.
722 W. Werther and A. M. Loughlin
Blake et al. conducted the first review of the PAS register maintained by ENCePP
(Blake et al. 2011). This analysis included PAS required by EMA between 2007 and
2009. As assessed in 2009, 60 PAS had been registered for 32 medicinal products
and 52 had progressed to data collection, 7 were deemed no longer necessary by the
Committee for Medicinal Products for Humans (CHMP), and the final study did not
have a final decision. Of the 47 studies being “carried out” at the time of publication,
14 were randomized controlled trials; the remainder were either non-controlled trials
or observational (non-interventional) studies.
Engel et al. specifically studied PASS protocols reviewed under the EU
pharmacovigilance legislation (Engel et al. 2017). During 2012 to 2015, PRAC
reviewed 189 PASS protocols, of which 58 (31%) were imposed and 131 were
voluntary but required in the RMP for the medicinal product. Of 57 studies with
protocols available in ENCePP, 67% were primary data collection and 33% used
secondary data; in addition, the authors report that 65% did not include a comparator
population. Only 2 of the 57 protocols explicitly stated hypothesis testing analyses
were planned; therefore, we expect very few PASS used a clinical trial design. The
authors did not report results on interventional vs. non-interventional study design.
Health authorities throughout the world have regulations for requesting additional
research in the post-approval setting. This chapter focuses on the regulations in the
USA and EU. The history of post-approval studies can be traced through changing
regulations enforced by the FDA and EMA.
Specific terminology for post-approval studies is used by the FDA and EMA.
Briefly, the FDA uses the term risk evaluation and mitigation strategy (REMS) to
track post-approval safety studies that can be either post-marketing requirements
(PMRs) or post-marketing commitments (PMCs). However, PMR and PMC can be
conducted outside of REMS. The EMA uses post-authorization measures (PAM) to
track post-authorization safety studies (PASS) and post-authorization efficacy stud-
ies (PAES).
Post-approval studies are either clinical trials (interventional) or observational
(non-interventional) studies. Choosing a study design may be influenced by the
strengths and weaknesses of the design options, as described in Table 4.
Imposed post-approval studies are reviewed for compliance by the regulatory
agencies. For clinical trials that are ongoing at the time of approval, often these are
classified as PMC in the USA or PAM in the EU. Findings of these trials can be
submitted to the health authorities for addition to the prescribing information. The
FDA and EMA both track progress on PMC/PMRs and PAMs, respectively.
In conclusion, post-approval studies are necessary to continually gather data on
the safety and effectiveness of approved drugs. These studies are regulated by health
authorities, included in registries (e.g., ClinicalTrials.gov, ENCePP), and tracked to
completion. Given the inclusion of post-approval studies in public databases, they
are the subject of systematic reviews.
38 Post-Approval Regulatory Requirements 723
Key Facts
Cross-References
▶ Introduction to Meta-Analysis
▶ Pragmatic Randomized Trials Using Claims or Electronic Health Record Data
▶ Regulatory Requirements in Clinical Trials
References
Austin PC (2011) An introduction to propensity score methods for reducing the effects of
confounding in observational studies. Multivariate Behav Res 46(3):399–424
Austin PC, Stuart EA (2015) Moving towards best practice when using inverse probability of
treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in
observational studies. Stat Med 34(28):3661–3679
Beaver JA, Howie LN, Pelosof L, Kim T, Liu J, Goldberg KB, Sridhara R, Blumenthal GM,
Farrell AT, Keegan P, Pazdur R, Kluetz PG (2018) A 25-year experience of US Food and Drug
Administration approval of malignant hematology and oncology drugs and biologics. JAMA
Oncol. https://doi.org/10.1001/jamaoncol.2017.5618. Published online March 1, 2018
Berger ML, Dreyer N, Anderson F, Towse A, Sedrakyan A, Normand SL (2012) Prospective
observational studies to assess comparative effectiveness: the ISPOR good research practices
task force report. Value Health 15(2):217–230
Berger ML, Sox H, Willke RJ, Brixner DL, Eichler HG, Goettsch W, Madigan D, Makady A,
Schneeweiss S, Tarricone R, Wang SV, Watkins J, Mullins CD (2017) Good practices for real-
world data studies of treatment and/or comparative effectiveness: recommendations from the
joint ISPOR-ISPE Special Task Force on real-world evidence in health care decision making.
Pharmacoepidemiol Drug Saf 26(9):1033–1039
Berry DA, Elanshoff M, Blotner S, Davi R, Beineke P, Chandler M, Lee DS, Chen LC, Sarkar S
(2017) Creating a synthetic control arm from previous clinical trials: Application to establishing
early end points as indicators of overall survival in acute myeloid leukemia (AML). ASCO
abstract. J Clin Oncol 35(15_Suppl):7021. https://doi.org/10.1200/JCO.2017.35.15_suppl.
7021. Published online May 30, 2017. https://ascopubs.org/doi/full/10.1200/CCI.19.00037
Blake KV, Prilla S, Accadebled S, Guimier M, Biscaro M, Persson I, Arlett P, Blackburn S, Fitt H
(2011) European Medicines Agency review of post-authorisation studies with implications for
the European Network of Centres for Pharmacoepidemiology and Pharmacovigilance.
Pharmacoepidemiol Drug Saf 20:1021–1029
Blumenthal S (2017) The use of clinical registries in the United States: a landscape survey. EGEMS
(Wash DC) 5(1):26. https://doi.org/10.5334/egems.248. Published 2017 Dec 7
Chou R, Helfand M (2005) Challenges in systematic reviews that assess treatment harms. Ann
Intern Med 142(12 Pt 2):1090–1099
724 W. Werther and A. M. Loughlin
Dreyer NA, Schneeweiss S, McNeil B, Berger ML, Walker AM, Ollendorf DA, Gliklich RE, on
behalf of the GRACE Initiative (2010) GRACE principles: recognizing high-quality observa-
tional studies of comparative effectiveness. Am J Manag Care 16:467–471
Engel P, Almas MF, DeBruin ML, Starzyk K, Blackburn S, Dreyer NA (2017) Lessons learned on
the design and the conduct of Post-Authorization Safety Studies: review of 3 years of PRAC
oversight. Br J Clin Pharmacol 83:884–893
European Medicines Agency (2012) Legal framework: pharmacovigilance. Available at https://
www.ema.europa.eu/en/human-regulatory/overview/pharmacovigilance/legal-framework-
pharmacovigilance. Accessed 12 June 2020
European Medicines Agency (2016) Scientific guidance on post-authorisation efficacy studies.
Available at https://www.ema.europa.eu/en/documents/scientific-guideline/scientific-guidance-
post-authorisation-efficacy-studies-first-version_en.pdf. Accessed 12 June 2020
European Medicines Agency (2017) Guideline on good pharmacovigilance practice (GVP). Avail-
able at https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-good-
pharmacovigilance-practices-gvp-module-viii-post-authorisation-safety-studies-rev-3_en.pdf.
Accessed 12 June 2020
European Medicines Agency (2020a) Post-authorisation safety studies (PASS). Available at https://
www.ema.europa.eu/en/human-regulatory/post-authorisation/pharmacovigilance/post-authori
sation-safety-studies-pass-0. Accessed 12 June 2020
European Medicines Agency (2020b) Post-authorisation measures: questions and answers. https://
www.ema.europa.eu/en/human-regulatory/post-authorisation/post-authorisation-measures-ques
tions-answers. Accessed 12 June 2020
European Network of Centers for Pharmacoepidemiology and Pharmacovigilance (ENCePP)
(2010) Guidelines on methodological standards in Pharmacoepidemiology (Revision 7), 2010.
Online: http://www.encepp.eu/standards_and_guidances/documents/ENCePPGuideonMeth
StandardsinPE_Rev7.pdf. Accessed 12 June 2020
European Network of Centers for Pharmacoepidemiology and Pharmacovigilance (ENCePP)
(2017) Guidelines on Good Pharmacovigilance (GVP) – Module VIII – Post-authorization
safety studies (Revision 3). Online: https://www.ema.europa.eu/en/documents/scientific-
guideline/guideline-good-pharmacovigilance-practices-gvp-module-viii-post-authorisation-
safety-studies-rev-3_en.pdf. Accessed 12 June 2020
Food and Drug Administration (2011) Guidance for industry postmarketing studies and clinical
trials – implementation of section 505(o)(3) of the Federal Food, Drug, and Cosmetic Act.
Available at https://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryIn
formation/Guidances/UCM172001.pdf or https://www.fda.gov/regulatory-information/search-
fda-guidance-documents/postmarketing-studies-and-clinical-trials-implementation-section-
505o3-federal-food-drug-and. Accessed 12 June 2020
Food and Drug Administration (2013) Best practices for conducting and reporting pharmacoepi-
demiology safety studies using electronic healthcare data sets. May 2013. Online: https://www.
fda.gov/regulatory-information/search-fda-guidance-documents/best-practices-conducting-
and-reporting-pharmacoepidemiologic-safety-studies-using-electronic. Accessed 12 June 2020
Food and Drug Administration (2018) FDA drug safety communication: new risk factor for
Progressive Multifocal Leukoencephalopathy (PML) associated with Tysabri (natalizumab).
Available at https://www.fda.gov/drugs/drug-safety-and-availability/fda-drug-safety-communi
cation-new-risk-factor-progressive-multifocal-leukoencephalopathy-pml. Accessed 12 June
2020
Food and Drug Administration (2019) FDA in brief: FDA issues annual report on efforts to hold
industry accountable for fulfilling critical post-marketing studies of the benefits, safety of new
drugs. https://www.fda.gov/news-events/fda-brief/fda-brief-fda-issues-annual-report-efforts-
hold-industry-accountable-fulfilling-critical-post. Accessed 12 June 2020
Food and Drug Administration (2020a) List of pregnancy exposure registries updated 17 Jan 2020.
Online: https://www.fda.gov/science-research/womens-health-research/list-pregnancy-expo
sure-registries. Accessed 20 Feb 2020
38 Post-Approval Regulatory Requirements 725
Food and Drug Administration (2020b) Postmarketing requirements and commitments: reports.
https://www.fda.gov/drugs/postmarket-requirements-and-commitments/postmarketing-require
ments-and-commitments-reports. Accessed 12 June 2020
Gliklich R, Dreyer N, Leavy M (eds) (2014) Registries for evaluating patient outcomes: a user’s
guide, 3rd edn. Two volumes. (Prepared by the Outcome DEcIDE Center [Outcome Sciences,
Inc., a Quintiles company] under Contract No. 290 2005 00351 TO7.) AHRQ Publication No.
13(14)-EHC111. Agency for Healthcare Research and Quality, Rockville. http://www.
effectivehealthcare.ahrq.gov/registries-guide-3.cfm
Goedecke T (2017) EU PASS/PAES Requirements for Disclosure. Available at https://www.ema.
europa.eu/en/documents/presentation/presentation-eu-pass/paes-requirements-disclosure-
thomas-goedecke_en.pdf. Accessed 12 June 2020
Hall GC, Sauer B, Bourke A, Brown JS, Reynolds MW, LoCasale R (2012) Guidelines for good
database selection and use in pharmacoepidemiology research [published correction appears in
Pharmacoepidemiol Drug Saf. 2012;21(11):1249. Casale, Robert Lo [corrected to LoCasale,
Robert]]. Pharmacoepidemiol Drug Saf 21(1):1–10. https://doi.org/10.1002/pds.2229
International Society for Pharmacoepidemiology (2015) Guidelines for good pharmacoepi-
demiology practice (Revision 3), 2015. Online: https://www.pharmacoepi.org/resources/
policies/guidelines-08027/#1. Accessed 12 June 2020
Kappos L, Bates D, Edan G, Eraksoy M, Garcia-Merino A, Grigoriadis N, Hartung HP, Havrdová
E, Hillert J, Hohlfeld R, Kremenchutzky M, Lyon-Caen O, Miller A, Pozzilli C, Ravnborg M,
Saida T, Sindic C, Vass K, Clifford DB, Hauser S, Major EO, O’Connor PW, Weiner HL, Clanet
M, Gold R, Hirsch HH, Radü EW, Sørensen PS, King J (2011) Natalizumab treatment for
multiple sclerosis: updated recommendations for patient selection and monitoring. Lancet
Neurol 10(8):745–758
Krumholz HM, Ross JS, Presler AH, Egilman DS (2007) What have we learnt from Vioxx? BMJ
334(7585):120–123. https://doi.org/10.1136/bmj.39024.487720.68
Naci H, Smalley KR, Kesselheim AS (2017) Characteristics of preapproval and postapproval
studies for drugs granted accelerated approval by the US Food and Drug Administration.
JAMA 318(7):626–636. https://doi.org/10.1001/jama.2017.9415
National Institutes of Health (2019) List of registries, last reviewed 18 Nov 2019. Available at:
https://www.nih.gov/health-information/nih-clinical-research-trials-you/list-registries.
Accessed 12 June 2020
Patsopoulous NA (2011) A pragmatic view on pragmatic trials. Dialogues Clin Neurosci
13:217–224
Prakash S, Valentine V (2007) Timeline: the rise and fall of Vioxx November 10, 2007. https://
www.npr.org/2007/11/10/5470430/timeline-the-rise-and-fall-of-vioxx
Rosenbaum PR, Rubin DB (2007) The central role of the propensity score in observational studies
for causal effects. Biometrika 70(1 (Apr, 1983)):41–55
Suissa S (2007) Immortal time bias in observational studies of drug effects. Pharmacoepidemiol
Drug Saf 16(3):241–249
Suissa S (2008) Immortal time bias in pharmaco-epidemiology. Am J Epidemiol 167(4):492–499
Wallach JD, Egilman AC, Dhruva SS, McCarthy ME, Miller JE, Woloshin S, Schwartz LM, Ross
JS (2018) Postmarket studies required by the US Food and Drug Administration for new drugs
and biologics approved between 2009 and 2012: cross sectional analysis. BMJ 361:k2031
Zauderer MG, Grigorenko A, May P, Kastango N, Wagner I, Caroline A (2019) Creating a synthetic
clinical trial: comparative effectiveness analyses using electronic medical record. JCO Clin
Cancer Inform. https://doi.org/10.1200/CCI.19.00037. Published online June 21, 2019
Part IV
Bias Control and Precision
Controlling for Multiplicity, Eligibility,
and Exclusions 39
Amber Salter and J. Philip Miller
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
Multiplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
Sources of Multiplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 731
Adjustment for Single Sources of Multiplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
Adjustments for Multiple Sources of Multiplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735
Eligibility and Exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
Abstract
Multiple comparison procedures play an important role in controlling the accu-
racy of clinical trial results while trial eligibility and exclusions have the potential
to introduce bias and reduce external validity. This chapter introduces the issues
and sources of multiplicity and provides a description of the many different
procedures that can be used to address multiplicity primarily used in the confir-
matory clinical trial setting. Additionally, trial inclusion/exclusion criteria and
enrichment strategies are reviewed.
Keywords
Multiple comparison procedures · Inclusion/exclusion criteria · Enrichment
strategies
Introduction
Clinical trial design continues to evolve and become increasingly complex due, in
part, to efforts on making evaluation of new treatments more efficient. The use of
multiple outcomes, dose levels, and/or populations results in challenges for decision-
making, especially concern over making incorrect conclusions about the efficacy or
safety of a treatment. As the multiplicity increases, the probability of making a false
conclusion increases. Multiple strategies have been developed recently to maintain
strong control over the error rate in clinical trials. Regulatory bodies such as the
Federal Drug Administration (FDA) and European Medical Agency (EMA) have
both recognized this issue and provided guidance on aspects of multiplicity in
confirmatory clinical trials. In addition to issues of multiplicity, eligibility criteria
and exclusions have the potential to add bias and reduce the external validity of a
clinical trial.
Multiplicity
Introduction
increases the type I error in hypothesis testing, the statistical methods to control
multiplicity may differentially affect the power of a trial. The effects on power
should be examined in the trial planning stages.
Recognition for the importance of controlling multiplicity is increasing. While
not every confirmatory trial is conducted to obtain regulatory approval, the EMA
published a guidance document on multiplicity in clinical trials (EMA (European
Medicines Agency) 2017) and the FDA released guidance on multiple outcomes in
clinical trials in 2017 (FDA (U.S. Food and Drug Administration) 2017). The
development of these guidance documents was partly a result of sponsors increased
efforts to improve the efficiency of clinical trials. This efficiency can be gained by
having a trial evaluate more than one outcome or more than one population.
However, the need to control multiplicity in this setting is critical to maintain
scientific rigor.
Safety measures are an important component in clinical trials, and specific safety
outcomes or concerns are based on experiences in earlier phase trials. Specific safety
outcomes are one type of safety measure and may be a specific outcome to be tested
in a Phase III trial in conjunction with an efficacy outcome. Other safety measures,
such as adverse events, may be considered more descriptive or exploratory in nature
(Dmitrienko and D’Agostino 2018). If adverse event data is compared between
groups, this could be considered a multiplicity issue which has the potential to
identify false positive safety concerns. Application of methods such as the double
false discovery rate has been proposed to more rigorously evaluate adverse events
(Mehrotra and Heyse 2004) and reduce the complexity of safety profiles, especially
in large drug or vaccine trials.
Statistical adjustment for multiplicity is not always necessary in the multiplicity
setting. The objectives of the analysis need to be considered and scenarios which do
not require adjustments for multiplicity exist. For instance, the use of co-primary
outcomes where success of the trial depends on all outcomes being less than the
significance level or when supplemental analyses are conducted (adjusting for
covariates or a per-protocol analyses) for a single outcome (FDA (U.S. Food and
Drug Administration) 2017; Proschan and Waclawiw 2000). While scenarios, such
as co-primary outcomes, may not affect the type I error rate, their possible effect on
power needs to be addressed in the design stage.
Sources of Multiplicity
evaluating several dose levels in a general population and a targeted subgroup of the
population.
Different methods are used for single and multiple sources of multiplicity. Choice
of adjustment procedure utilizes clinical and statistical information to decide which
method to implement. The multiplicity procedure chosen should be in line with the
clinical trial objectives and investigate the effect of the method on statistical power.
Simulations are often used to evaluate the effect of the procedures on power.
For single sources of multiplicity, adjustment methods fall into two main categories:
single step and hypothesis ordered methods. Single step methods test all hypotheses
simultaneously, while ordered methods test hypotheses in a stepwise manner with
the order based on the data (size of p-values) or are prespecified based on strong
clinical information or prior studies. These methods can step-up or step-down where
the significance level changes as the procedure progresses through the set of null
hypotheses being tested due to the error rate being transferred from rejection of the
prior null hypothesis. The step-up procedure will order the hypotheses from largest
to smallest according to the p-values. The step-down procedures in data-driven
hypotheses ordering will arrange the hypotheses from smallest to largest based on
their associated p-values, and the testing ceases when a hypothesis fails to be
rejected. Within these categories of adjustment methods, distributional information
about the hypothesis tests is relevant to the choice of multiplicity method. Increased
knowledge regarding the joint distribution of the test statistics among the hypotheses
being tested leads to more powerful procedures being chosen (Dmitrienko and
D’Agostino 2013). Nonparametric procedures make no assumptions regarding the
joint distribution while semiparametric procedures assume the hypothesis tests
follow a distribution but have an unknown correlation structure (Dmitrienko et al.
2013). Additionally, there are parametric procedures, such as the Dunnett’s test
(Dunnett 1955), which assume an explicit distribution for the joint distribution of
hypothesis tests and are associated with classical regression and analysis of variance
and covariance models.
Single step procedures control the error rate using simple decision rules in order
to adjust the significance level. The Bonferroni correction is a classic example of a
single step nonparametric multiplicity adjustment where the overall error rate is
divided by the number of tests being tested to obtain an adjusted significance level
for all tests (α/m) (Dunn 1961). For example, if the overall error rate is 0.05 and three
tests are being conducted, the adjusted significance level for all three tests is 0.0167.
The Bonferroni method can also be applied by assigning prespecified weights to
different tests to account for clinical importance or other factors in the multiplicity
adjustment (FDA (U.S. Food and Drug Administration) 2017).
Other single step procedures include the Simes and Šidák semiparametric multi-
plicity adjustment methods (Šidák 1967; Simes 1986). Both are uniformly more
powerful than the Bonferroni. The Simes procedure is a global null hypothesis test,
39 Controlling for Multiplicity, Eligibility, and Exclusions 733
before the trial begins. The fixed-sequence method places the more important
hypotheses to be tested first and is tested at the trial significance level. If the
hypothesis is rejected, then the next hypothesis is tested; however, if the hypothesis
fails to be rejected, the testing ceases and the remaining hypotheses fail to be
rejected. The fallback procedure is a more flexible approach to prespecified ordering
that allows for other hypotheses to be tested in the event the preceding hypothesis
fails to be rejected (FDA (U.S. Food and Drug Administration) 2017; Wiens 2003).
The fixed sequence of hypotheses are maintained, but the type I error is divided up
between the hypothesis being tested. The division of the significance level uses
weights (wi) which are nonnegative and sum to 1. The first hypothesis is tested at the
adjusted significance level determined by pi αwi, and if the hypothesis fails to be
rejected, the second hypothesis is tested at pi αwi. However, if the first hypothesis
is rejected, then second hypothesis is tested at the overall error rate as the unused
alpha is passed on to the next test in the sequence.
The procedures used for multiple sources of multiplicity have an additional com-
plexity inherent in that the multiple sources of multiplicity need to be addressed. A
common manifestation of having multiple sources of multiplicity is having multiple
families of hypotheses in the form of hierarchy in clinical trial objectives (primary,
secondary, and tertiary objectives). The strategy usually employed for this setting is
called a gatekeeping procedure and tests the hypotheses in the first (primary objec-
tives) family with a single source adjustment method. The second family of hypoth-
eses is tested with a multiplicity adjustment only if the primary family has
demonstrated statistical success. The first family of hypothesis tests acts as a
gatekeeper to testing the second family of hypotheses. The gatekeepers can be
designed to be serial or parallel where serial gatekeepers require all hypotheses in
the first family to be rejected before proceeding to the second family of hypotheses,
while parallel gatekeepers only require at least one hypothesis to be rejected. These
procedures allow for the error rate to be transferred to subsequent families of testing.
These approaches can be further extended to allow for retesting to occur by trans-
ferring error back from the subsequent families to previous families (second family
of hypothesis back to the first family). The choice of procedure again depends on the
clinical objectives of the trial. Trials which have used these procedures include the
lurasidone trial in schizophrenia and CLEAN-TAVI in severe aortic stenosis
(Haussig et al. 2016; Meltzer et al. 2011).
Software
Summary
Multiplicity issues arise from various sources in clinical trials and are defined as the
evaluation of different aspects of treatment efficacy simultaneously (Dmitrienko
et al. 2013). The more commonly encountered multiplicity problems are found in
the use of multiple outcomes, composite outcomes and their components, multiple
doses, and multiple subgroups or populations. One or a combination of these may be
found in a clinical trial, and as the number of multiple comparisons increases, the
probability of making a false conclusion, or type 1 error, increases. This inflation of
the type 1 error has the consequence of incorrectly concluding that a treatment is
efficacious or safe. Multiple methods for controlling for multiplicity have recently
been developed (Dmitrienko et al. 2013). These methods range from simple to
complex where there is a need to handle multiple sources of multiplicity in a clinical
trial. The choice of adjustment to be used should be based on clinical and statistical
information and predefined in the statistical analysis plan for a clinical trial (Gamble
et al. 2017).
A clinical trial aims to have a sample which is representative of the population that
would have the treatment applied clinically if the trial is positive. The selection of
individuals for inclusion into a clinical trial is based on predefined eligibility criteria.
Criteria in clinical trials may be used to limit the heterogeneity of the trial population,
limit inference for obtaining the outcome measure (e.g., individuals with a
comorbidities at high risk of dying for an unrelated cause prior to the outcome
assessment), and related to safety concerns (e.g., pregnant women or individuals at
increased risk of an adverse outcome). Yet, the potential consequences of eligibility
criteria are to create selection bias in the study population and reduced external
validity for the trial. One of the primary ways to control selection bias is through
randomization and allocation concealment. By using randomization to assign indi-
viduals to a treatment or intervention, there will be balance in the known and
unknown factors, on average, between the groups. Yet, randomization alone does
not completely eliminate selection bias. There is still a chance for individuals to be
selectively enrolled in a trial should those in charge of recruitment know what the
next treatment allocation will be. Without concealment, those enrolling patients may
take eligible individuals which they perceive may do worse and randomize them
when they know a placebo assignment is likely. It is established that clinical trial
populations differ from the larger clinical population. The eligibility criteria often
create a highly selected population that can limit the external validity of the trial.
Recommendations to improve the reporting of the eligibility criteria are encouraged
in order to increase awareness among clinicians reading the findings to more
appropriately assess who the results apply to.
Run-in periods, placebo or treatment, or implementing enrichment strategies are
examples of exclusion criteria in clinical trials. These occur prior to randomization
and are utilized to select or exclude individuals from a trial. Using a run-in period can
736 A. Salter and J. P. Miller
help reduce the number of noncompliant individuals who are randomized or exclud-
ing those who experience adverse events (Rothwell 2005). Enrichment strategies
focus on recruiting individuals who are likely to respond well in a trial such as
nonresponders to a previous treatment. These exclusion criteria have the potential to
reduce the external validity of the study.
There is a need for inclusion/exclusion criteria in clinical trials, but their use may
result in potential bias or lack of generalizability of the study results. Enrichment
strategies are useful in limiting individuals who may discontinue participation in a
trial or not respond well to a treatment. While these strategies may result in more
subjects (and consequently more power), the external validity of the trial may be
compromised.
Key Facts
• Awareness is increasing for the need to identify and address multiplicity issues in
confirmatory clinical trials.
• There are many procedures which control the type I error inflation resulting from
multiplicity issues in clinical trials. The choice of procedure should be based on
clinical and statistical information and determined during the design phase of a
clinical trial.
• Eligibility criteria and exclusion strategies may be necessary to implement in a
clinical trial; however, careful review of the potential biases as a result should be
conducted.
Cross-References
References
Dmitrienko A, D’Agostino R (2013) Traditional multiplicity adjustment methods in clinical trials.
Stat Med 32(29):5172–5218. https://doi.org/10.1002/sim.5990
Dmitrienko A, D’Agostino RB (2018) Multiplicity considerations in clinical trials. N Engl J Med
378(22):2115–2122. https://doi.org/10.1056/NEJMra1709701
Dmitrienko A, D’Agostino RB Sr, Huque MF (2013) Key multiplicity issues in clinical drug
development. Stat Med 32(7):1079–1111. https://doi.org/10.1002/sim.5642
39 Controlling for Multiplicity, Eligibility, and Exclusions 737
Dunn OJ (1961) Multiple comparisons among means. J Am Stat Assoc 56(293):52–64. https://doi.
org/10.1080/01621459.1961.10482090
Dunnett CW (1955) A multiple comparison procedure for comparing several treatments with a
control. J Am Stat Assoc 50(272):1096–1121. https://doi.org/10.1080/01621459.
1955.10501294
EMA (European MedicinesAgency) (2017) Guideline on multiplicity issues in clinical trials.
Retrieved from www.ema.europa.eu/contact
FDA (U.S. Food and Drug Administration) (2017) Multiple endpoints in clinical trials: guidance
for industry. Retrieved from http://www.fda.gov/Drugs/GuidanceComplianceRegulatoryIn
formation/Guidances/default.htm
Gamble C, Krishan A, Stocken D, Lewis S, Juszczak E, Doré C, . . . Loder E (2017) Guidelines for
the content of statistical analysis plans in clinical trials. JAMA 318(23): 2337. https://doi.org/
10.1001/jama.2017.18556
Haussig S, Mangner N, Dwyer MG, Lehmkuhl L, Lücke C, Woitek F, . . . Linke A (2016). Effect
of a cerebral protection device on brain lesions following transcatheter aortic valve implantation
in patients with severe aortic stenosis. JAMA 316(6): 592. https://doi.org/10.1001/jama.
2016.10302
Hochberg Y (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika
75(4):800–802. https://doi.org/10.1093/biomet/75.4.800
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70.
Retrieved from https://www.jstor.org/stable/pdf/4615733.pdf?refreqid=excelsior%
3Ab73e56d22a17fe5eebc22397ada28121
Hommel G (1988) A stagewise rejective multiple test procedure based on a modified Bonferroni
test. Biometrika 75(2):383–386. https://doi.org/10.1093/biomet/75.2.383
Mehrotra DV, Heyse JF (2004) Use of the false discovery rate for evaluating clinical safety data.
Stat Methods Med Res 13(3):227–238. https://doi.org/10.1191/0962280204sm363ra
Meltzer HY, Cucchiaro J, Silva R, Ogasa M, Phillips D, Xu J, . . . Loebel A (2011) Lurasidone in the
treatment of schizophrenia: a randomized, double-blind, placebo- and olanzapine-controlled
study. Am J Psychiatry 168(9): 957–967. https://doi.org/10.1176/appi.ajp.2011.10060907
Proschan MA, Waclawiw MA (2000) Practical guidelines for multiplicity adjustment in clinical
trials. Control Clin Trials 21(6):527–539. https://doi.org/10.1016/S0197-2456(00)00106-9
Rothwell PM (2005) External validity of randomised controlled trials: “to whom do the results of
this trial apply?”. Lancet 365(9453):82–93. https://doi.org/10.1016/S0140-6736(04)17670-8
Šidák Z (1967) Rectangular confidence regions for the means of multivariate normal distributions. J
Am Stat Assoc 62(318):626–633. https://doi.org/10.1080/01621459.1967.10482935
Simes RJ (1986) An improved Bonferroni procedure for multiple tests of significance. Biometrika
73(3):751–754. https://doi.org/10.1093/biomet/73.3.751
Wiens BL (2003) A fixed sequence Bonferroni procedure for testing multiple endpoints. Pharm Stat
2(3):211–215. https://doi.org/10.1002/pst.064
Principles of Clinical Trials: Bias and
Precision Control 40
Randomization, Stratification, and Minimization
Fan-fan Yu
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
Assignment Without Chance: A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741
Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 742
Methods of Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743
Simple Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
Restricted Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745
Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
Synonyms: Covariate-Adaptive Randomization, Dynamic Randomization, Strict
Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754
Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
Practicalities and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
Unequal Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
Checks on the Actual Randomization Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758
Assessing Balance of Prognostic Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 759
Accounting for the Randomization in Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 760
Conclusion and Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 763
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764
Abstract
The fundamental difference distinguishing observational studies from clinical
trials is randomization. This chapter provides a practical guide to concepts of
randomization that are widely used in clinical trials. It starts by describing bias
and potential confounding arising from allocating people to treatment groups in
a predictable way. It then presents the concept of randomization, starting from a
simple coin flip, and sequentially introduces methods with additional restrictions
F.-f. Yu (*)
Statistics Collaborative, Inc., Washington, DC, USA
e-mail: [email protected]
to account for better balance of the groups with respect to known (measured) and
unknown (unmeasured) variables. These include descriptions and examples of
complete randomization and permuted block designs. The text briefly describes
biased coin designs that extend this family of designs. Stratification is introduced
as a way to provide treatment balance on specific covariates and covariate
combinations, and an adaptive counterpart of biased coin designs, minimization,
is described. The chapter concludes with some practical considerations when
creating and implementing randomization schedules.
By the chapter’s end, statistician or clinicians designing a trial may distinguish
generally what assignment methods may fit the needs of their trial and whether or
not stratifying by prognostic variables may be appropriate. The statistical prop-
erties of the methods are left to the individual references at the end.
Keywords
Selection bias · Assignment bias · Randomization · Allocation concealment ·
Random assignment · Permuted block · Biased coin · Stratification ·
Minimization · Covariate-adaptive randomization
Introduction
In a “classic” parallel-group clinical trial, people are assigned one of two different
therapies, often to compare a new treatment intervention to placebo or standard of
care. In the case of beta-carotene, one trial, the Physicians’ Health Study, random-
ized two groups of men to a 12-year supplementation regimen of either beta-carotene
or beta-carotene placebo (Hennekens et al. 1996). The outcome for such a trial is
compared between the groups, with the goal of obtaining an estimate of treatment
effect that is free of bias and confounding (which will be addressed later in this
chapter).
In an observational trial, the adjustment of the exposure effect between groups of
people often occurs in the analysis through a stratified analysis or by including
potential confounders in a regression model. Although a clinical trial analysis can
apply the same adjustment methods, designers of a clinical trial can control bias and
confounding at the start of the trial. One way to do so is through randomization, the
process of assigning individuals to treatment groups using principles of chance for
assignment. Because the presence of bias can greatly affect the interpretation and
generalizability of a clinical trial, many aspects in trial design, including randomi-
zation, exist to ensure its minimization.
Randomization is the fundamental difference distinguishing observational studies
from clinical trials. In controlling for biases and confounding, randomization forms
the basis for valid inference. Careful thought should be given to its elements,
described in this chapter (block size, strata, assignment ratio, and others) and,
naturally, discussed between statisticians and a trial’s clinical leadership.
1. Assign the first 50 people who consent to the trial to treatment A and the next 50
to treatment B.
2. Assign alternating treatments: odd-numbered participants receive A and even-
numbered participants receive B.
These approaches are simple and systematic but with drawbacks. In the first
approach, there is a high likelihood that participants within groups are more alike
than those in opposing groups. This could occur if Dr. X had all the earlier
appointments and Dr. Y the later ones.
In the second approach, the pattern is predictable. From observing the previous
set of participants, the caring Dr. Compassion has figured out that A is the novel,
active treatment. A patient he has known for many years has been very sick, and Dr.
Compassion believes that this patient may benefit from the experimental therapy in
the trial. The patient would be tenth in line and therefore slated to receive B, the
742 F.-f. Yu
Bias
These situations show two examples of bias that could easily occur during assign-
ment of treatment groups. In both cases, the treatment schedule is predictable. If the
trial is unmasked, or if the novel treatment is obvious despite masking, then, like Dr.
Compassion, investigators may manipulate the timing of participant enrollment so
that certain participants receive certain treatments. Both cases are prone to selection
bias. This bias occurs when investigators have knowledge of the treatment assign-
ment and the selection of a participant for a trial is based on that knowledge. Such
selection could occur in an unmasked trial or if the randomization scheme’s assign-
ment pattern is predictable.
This issue of predictability raises the importance of allocation concealment.
The risk of investigator-influenced assignments and selection bias can be minimized
if the investigators do not know what the next assignment will be. Note that this is
different from blinding or masking, which seeks to conceal the treatment altogether.
Two easy ways to conceal assignments before they are handed out are (1) avoiding
easy assignment patterns and (2) avoiding publicly available lists, such as the
notorious example of one tacked up on the nurses’ station bulletin board. Allocation
concealment is possible in an unmasked trial, as long as investigators are unaware of
the assignment before a participant receives the intervention.
In the first case, the participants who enroll early may share certain baseline
characteristics that differ from those of participants who enroll later. These baseline
characteristics are often prognostic factors for the disease of trial. For trials enrolling
over long periods of time, demographic shifts do occur. The characteristics of
participants, which sometimes reflect changed and improved standards of care,
may differ temporally depending on when they enter the trial. Byar et al. (1976)
described the Veterans’ Administration Cooperative Urological Research Group trial
in participants with prostate cancer. Earlier recruited participants had shorter survival
than those who entered later. A similar contrast arises with prevalent versus incident
cases of disease. Those available at first might have had the disease for a long time;
incident cases that arise during the trial may be more rapidly (or more slowly)
progressive. When the assignment results in prognostic factors that are unequally
distributed across the treatment groups, then the effect of the treatment on the final
outcome may be confounded with the effect of the factor. This is an example of
assignment bias.
Mitigating bias results in more accurate estimates of treatment differences. To
see this mathematically, consider a hypothetical trial in diabetic children as
presented in Matthews (2000). Hemoglobin A1c is a measure of average blood
glucose levels over the past 2–3 months. HbA1c levels tend to be higher in
adolescents (9–10%) than in young children (6–7%). Consider a trial comparing
40 Principles of Clinical Trials: Bias and Precision Control 743
active treatment (A) to placebo (B) showing no treatment effect. Matthews nicely
shows how assignment bias with respect to age grouping affects the treatment
difference. Assuming there is no treatment difference, the expected value of
HbA1c is μ1 for children and μ2 for adolescents. The mean HbA1c in group A
may be expressed as the sum of all nA children’s observations plus the sum of all
(N nA) adolescent observations:
P
nA P
N
Xi þ Xi
i¼1 i¼nA þ1
HA ¼
N
The mean for group B is similarly calculated. The expected HbA1c in each group
is
n μ þ ðN nA Þμ2 nB μ1 þ ðN nB Þμ2
E HA ¼ A 1 , E HB ¼
N N
ð n nB Þ
E HA E HB ¼ A ðμ1 μ2 Þ
N
Recall that there is no actual treatment effect; thus the expected difference above
equals 0.
Since children and adolescents have different HbA1c levels, μ1 < μ2. If
the number of children in groups A and B is balanced (nA = nB), then the
above equation is also to 0. The balancing of the prognostic factor, age, provides
an unbiased estimate of the treatment effect. If the number of children in groups
A and B are not balanced (nA 6¼ nB), the treatment effect is non-zero. The
treatment effect is biased, showing a treatment difference when there actually
is none.
Randomization may help to avoid this type of bias, making sure the two groups
are similar. In that way, the observable difference between the two groups is the
result of the treatment.
Methods of Randomization
Simple Randomization
(a) First, generate a sequence of numbers from 1 to n, where n is the total sample
size.
(b) Then, perform an independent coin flip (probability 0.5 for each treatment) for
each number in sequence. For example, in the R software, use the command
rbinom(n, 1, 0.5) to generate the probabilities, and assign treatment A or B
depending on whether the probability is less or more than 0.5.
Note that this process introduces no selection bias, because each successive
participant’s assignment is completely random and independent of each other.
A graphical depiction of simple randomization appears in Fig. 1, which uses a
game-board spinner to depict the coin probabilities.
Fun fact. A computer-generated coin flip is actually pseudorandom, as it’s the
result of algorithm-based number generator which makes the result reproducible
given an initial number or “seed.” A series of ideal coin flips is randomness in its
purest form, but it is not reproducible.
Because a coin flip is binomial, the large sample theory of binomial distribu-
tions applies. An assignment of 50% in each group becomes more likely as the
number of coin flips, or trials, increases. With smaller sample sizes, the assign-
ment of people to one group or another is likely to be unequal for some period of
time.
Clinical trials aiming for a 1:1 assignment seek to attain equal experience with
both treatments. As discussed above, using simple randomization may not guarantee
that assignment when the sample size is small. In reality, however, the assignment
from simple randomization will not be far from 1:1; the larger the sample size, the
greater the likelihood of balance. Lachin (1988b) states one need not worry about the
imbalance for trials of more than 200 people.
40 Principles of Clinical Trials: Bias and Precision Control 745
Generate a
sequence of #s
A B
1
2
3
.
.
.
k
. COIN FLIP USING R
. < 0.5
. A
n
rbinom(n, 0.5)
B
>= 0.5
Restricted Randomization
Random Assignment
A step beyond simple randomization’s fair coin flip is random assignment.
An analogy is a typical American elementary school game of musical chairs,
modernized so that no one is eliminated and everyone wins: for an equal number
of boys and girls (1:1 assignment of treatments A and B) and the same number of
chairs, music plays while the children dance freely inside the circle of chairs
(randomization placement); after the music stops, the children scramble to find a
new chair, forming a new seating arrangement (random assignment of treatments).
More formally, for a 1:1 assignment, this procedure pre-specifies the exact sample
size in advance and then restricts the randomization to half the participants receiving
746 F.-f. Yu
A and the other half receiving B. Then, true to the method’s name, it randomly
allocates each participant’s placement in the list sequence. A list constructed pro-
grammatically could do the following for 100 planned assignments:
The schema below (Table 1) illustrates the process on a smaller scale for 10
assignments, using a seed number of 465 with the SAS RANUNI function for the
reordering.
In practice, pre-generated randomization lists provide more assignments than the
planned sample size in order to account for higher-than-expected enrollment, poten-
tial errors, and other just-in-case scenarios.
Fun fact. Random assignment is the simplest form of a permuted block design –
it’s a single block of size n.
base and the first block on top, the blocks are then “stacked” to produce a tower,
which comprises the randomization list.
In the simplest case, 1:1 assignments use even block sizes, while 2:1 randomiza-
tions use multiples of 3.
How does one build the tower of blocks? First, figure out the different sequences
for particular block sizes. For instance, a block size of 2 has only two sequence
options:
Sequence number 1 2 3 4 5 6
Sequence AABB BBAA ABAB BABA ABBA BAAB
In many studies, randomization is stratified by trial site. If many sites are expected
to enroll few subjects, or if a trial is small, a smaller block size may be appropriate.
This helps to ensure better assignment of treatment within a block and prevents the
majority of participants from receiving a single treatment at one site. Larger studies
with many randomized at each site are able to accommodate larger block sizes.
Example. A trial of 100 people has 10 sites but expects enrollment to occur at the
2 main sites located in major metropolitan areas. Trial coordinators at the smaller
sites expect few enrollees. The randomization, which stratifies by trial site, uses a
block size of 8. The first block in the randomization list for one of the smaller sites
has assignment sequence AAAABBBB. Only four participants enroll at this site; all
four therefore receive treatment A. The analysis of the outcome cannot disentangle
the effect of this site from the effect of treatment. Because the effect of treatment is
potentially confounded with the effect of site, a smaller block size would be
appropriate here.
As a rule of thumb: block sizes of 4 to 6 are typical for studies with sites that are
expected to enroll only a few participants, 2 is a small block, and 8 is considered
large. Block sizes greater than 8 need careful consideration in relation to the size of
the trial. They are not recommended for small studies. Long runs, such as
AAAABBBB|BBBBAAAA, may occur with a block size of 8, and a block may
40 Principles of Clinical Trials: Bias and Precision Control 749
not fill completely as seen in the example above. In the case of a 12-participant trial
with this randomization, the trial would actually be a 2:1 randomization instead of
the intended 1:1. This defeats the purpose of blocking to achieve better balance
between treatment groups. While this example may be extreme, it is still an impor-
tant consideration within blocks and for a stratified randomization (more on this later
in section “Stratified Randomization”).
For trials with more than two treatment groups, the block size should be a
multiple of the number of treatment groups if the assignment is 1:1. A trial with
three treatment groups and block sizes of 2 and 4 make less sense than a trial with
block sizes of 3 and 6.
• A trial has several centers but only a few enrollees per center are expected.
Initiation of sites often occurs in groups and sequentially over time. Thus,
enrollment at certain times in the trial – for example, in the first few months –
may occur only at a few sites. To minimize the chance that sites enroll participants
from the same treatment group, consider a block size of 2.
• A small trial has multiple sites with anywhere from four to eight participants
expected per site. Randomization will be stratified by site. Use a block size of 2
first to guarantee treatment balance for the first two randomized, and then mix it
with block sizes of 4.
• A block size of 6 may run the risk of having this assignment: AAABBB|
BBBAAA, a run of 6 Bs in a row at a single site. Rather than using a block
size of 6, mix block sizes within a randomization; for example, combine a block
size of 4 with a block size of 2. For continuous runs such as AABB|BA, the
maximum run of any one treatment in this case is 3.
An alternative for a larger trial is to mix more than two block sizes – for example,
sizes of 2, 4, and 6.
Mix It Up: Using Random Permuted Blocks with Unequal Block Sizes
A way to address the selection bias that may occur by predicting treatment
assignments at the ends of blocks is to mix up the block sizes and use random
block sizes rather than fixed ones. Rather than choosing among the six possible
blocks of size 4, one could choose among blocks of size 4 or 2. For a list of 100
numbers,
Sequence number 1 2 3 4 5 6
Sequence AABB BBAA ABAB BABA ABBA BAAB
If A–B < 0 (more Bs), randomize to A with probability p, where p > 0.5.
If A–B > 0 (more As), randomize to A with probability 1- p.
If A–B = 0, randomize to A with probability 0.5.
Note that p is constant even when there is imbalance. The design is summarized
visually in Fig. 2, again using a game-board spinner instead of a coin for illustration
purposes.
Two other designs are Wei’s urn design (1978) and its generalization, Smith’s
generalized biased coin design. Wei’s urn design is similar to Efron’s except that p
fluctuates depending upon the balance between the two groups. Both are urn models
with n balls labeled A and n balls labeled B. For Wei’s urn design, when the k-th
person is randomized, a “ball” is picked from the urn. If the ball is labeled A, then:
Fun fact. Complete randomization is the situation if no B ball is added to the urn.
752 F.-f. Yu
Current allocation B
difference has Randomize next to A
more Bs with p=0.70
(A-B < 0)
A
Current allocation B
difference has Randomize next to A
more As with p=0.30
(A-B > 0)
A
Current allocation
difference Randomize next to A
is equal with p=0.50
(A-B=0) A B
Fig. 2 Efron’s biased coin design, using a 30:70 “biased coin” as depicted by a game-board spinner
Because the probability of assignment is biased toward the group with fewer
assignments, urn-adaptive designs adapt the probability of choosing the next treatment
on the basis of the assignment ratio thus far. This helps maintain balance as the trial is
ongoing, but doesn’t guarantee complete balance at the end of trial. In terms of bias,
the procedure reduces the predictability of the assignments and thereby reduces the
bias associated with that. For smaller trials, urn designs provide balance along the way
and behave more like complete randomization as the sample size gets large.
Stratified Randomization
Let’s return to the earlier example of a trial of diabetic children, where HbA1c tends
to be higher in adolescents as compared to young children. An imbalance in the
40 Principles of Clinical Trials: Bias and Precision Control 753
number of children in group A versus group B could occur using the randomization
methods just described and could lead to a biased treatment effect. An alternative is
to achieve treatment balance within each age grouping, rather than achieving balance
between treatments over all participants. This method, stratified randomization,
achieves balance within pre-chosen strata (young children vs. adolescents) defined
by important prognostic factors (age grouping), whose levels may affect the outcome
(HbA1c). In the simplest case of a two-strata factor, the randomization list is
essentially two lists, one for each stratum.
One major goal of stratification is to minimize the chances of one treatment
occurring primarily within a single factor – for instance, the majority of adolescent
trial participants receiving treatment B – such that the analysis cannot disentangle the
effect of the factor from the effect of treatment. This helps avoid correlation between
predictors (for factors not associated with the outcome) and confounding (for factors
associated with the outcome).
A trial with two two-level factors, such as age and gender, has four strata: male
pediatric, male adult, female pediatric, and female adult. Statisticians will often
picture this as a 2 by 2 table and refer to each stratum as a “cell.” A trial with two
treatment groups will therefore have eight cells. This trial will have four separate
randomization lists, one for each stratum. Within each stratum, randomization may
occur using random assignment or permuted blocks.
Fun fact. For random assignment, stratification may be viewed as blocking, with
each stratum acting as one large block using simple randomization.
A real challenge in designing stratified trials is selecting the most important strata
on which to achieve balance. A true story is a discussion among researchers who
were planning a trial. Each clinician felt very strongly about a prognostic factor
whose levels would affect the outcome. The list grew to include gender, baseline
disease status, age category, a disease-specific clinical characteristic, and a bio-
marker. When the trial statistician pointed out that there were now at least 32 strata,
and therefore 64 cells, for a 100-person trial, the researchers had to step back to re-
prioritize as a group.
As seen in Table 3, the number of strata quickly multiplies as the number of
prognostic variables increases. A risk of including so many factors is that numerous
strata may result in certain “cells” having few or no people. Having empty cells, or
many cells with a single person, not only goes against achieving balance but also
Table 3 Number of prognostic factors and strata for a trial with two treatment groups
Two-level Number of Number of
factors Example strata cells
1 Gender 2 4
2 Gender, age (pediatric vs. adult) 4 8
3 Gender, age, baseline disease status (WHO class 8 16
I/II vs. III/IV)
4 Gender, age, baseline disease status, genetic 16 32
biomarker
N 2N 2N + 1
754 F.-f. Yu
presents problems when analyzing the data. For the analysis, many trial teams
choose to pool strata that have only one or two people randomized.
Another operational consideration for limiting the number of strata is the
possibility for mis-stratification. Investigators are humans; they may enter the
wrong stratum criterion when randomizing a participant. Deciding how to handle
mis-stratifications then becomes a challenge in the conduct and interpretation of
an analysis. For example, if a woman is stratified as a man, should she be
analyzed as a man, to reflect the actual randomization, or as a woman, because
that is what she is?
There is some debate as to whether, and when, studies should stratify. Lachin
et al. (1988) recommend stratification for trials with fewer than 100 participants. For
larger trials, the advantages are negligible for efficiency; they recommend stratifying
by center but not by other prognostic factors. Others argue that investigators may
want to stratify for other scientific reasons – such as characteristics of a disease – that
may affect trial outcome. An example is the breast cancer trial in Table 5, which
stratified by first- versus second-line therapy. With an outcome of progression-free
survival, it was important to monitor that the randomization obtained balance
between those who were farther along in their treatment (second-line therapy) than
those who were not.
A special consideration is stratification by clinical site, which was addressed
earlier in the discussion of block sizes in section “Permuted Block Designs.”
Because of the sequential nature of site initiation in a trial and the similarities in
patient care within a site, many studies will stratify randomization by site. This helps
to avoid confounding of the treatment effect by site and ensures balance within site.
Because blocking is usually employed within site, unmasked studies should
probably avoid stratifying by site. The prediction of treatment patterns at the ends
of blocks is much easier in this setting.
Minimization
When a trial has many prognostic factors needing balancing, minimization may
provide a good assignment alternative compared to more traditional randomization
methods. Recall the trial mentioned earlier with the five different two-level prog-
nostic variables and the resulting problematic 64 cells for 100 people. That trial may
have been a candidate for minimization if the clinicians decided that each of the five
variables was equally important for stratification.
Minimization refers to minimizing the treatment imbalance over several
covariates by the use of a dynamic, primarily nonrandom method. As an alternative
to stratified block randomization, minimization allows balancing on many prognos-
tic variables in real time. The method uses information on prognostic factors – the
stratification variables used with the randomization methods above – to determine
40 Principles of Clinical Trials: Bias and Precision Control 755
where the imbalance is. Then, generally, the method chooses the arm that best
minimizes the imbalance and assigns the next participant to that arm. Similar to
biased coin designs, the next assignment is partially determined by the treatment
group with fewer people. Here, the assignment is done by defining a weighted metric
that combines the treatment differences across all the strata for a covariate. This
weighted metric then determines the assignment probability to assignment of that
arm. The goal of minimization, like urn-adaptive randomization, is to ensure a small
absolute difference between the numbers randomized in each treatment group. The
difference between urn-adaptive methods and minimization is that minimization
uses stratum-specific differences to minimize the differences between treatment
groups.
First introduced by Taves (1974), a deterministic version of the method with the
four strata in Table 4 would randomize the first set of participants using complete
randomization (Table 3). Within each stratum, calculate the differences for A-B as
seen in the last column of the table; positive values indicate more participants in A
and vice versa. To determine the assignment for the 16th participant, add the
differences for the subject-specific strata.
A p of 0.8 gives a relatively high probability for receiving the treatment currently
in “deficit.” Pocock and Simon generally prefer a p of 0.75.
This method balances marginally over all covariates rather than within stratum as
for stratification using permuted block designs. While the method achieves better
balance in real time on the selected factors, unlike conventional randomization
methods, it does not guarantee balance on unspecified factors. It works best in
small trials (e.g., trials of <100 people). Pocock and Simon have generalized this
method to three or more groups, which is not covered here (Fig. 3).
B
More Bs A
(sum < 0)
A
B
More As B
(sum > 0)
A
Equal allocation
(sum = 0)
A B A B
Other Methods
Additional approaches to randomization models include urn models where the distri-
bution of treatment “balls” within urns is based on the responses observed so far. These
response-adaptive randomizations include randomized play the winner (Wei and
Durham 1978) and drop the loser (Ivanova 2003), among others. The basic premise
is that if one treatment is showing better response than the other, then the assignment
probabilities can favor the better treatment. This type of randomization is more suitable
with trials where responses are viewed quickly and addresses ethical concerns about
participants exposing themselves to treatment groups that may not be effective. This
chapter does not address these methods further.
Unequal Allocation
Although this chapter focused on 1:1 assignments, some trials may choose to use
other assignment ratios. A common alternative is 2:1, which in some cases has less
power than its 1:1 counterpart. Figure 4 displays the total sample
" size needed#
for a continuous, normally distributed outcome represented by Φ Δ
σ
pffiffiffiffiffiffiffiffi
1
1 1
1:96 ,
n1 þ n2
Δ
where ¼ 0:65 and power of 80%, 90%, and 95%.
σ
For an often small loss in power (or increase in sample size), however (Fig. 4), in
some trials investigators will prefer to have unequal allocation because of nonstatistical
reasons. Some trials may face high-cost issues for obtaining the control treatment from
sponsors. In rare diseases, it allows more people access to the novel treatment, and a
single trial will therefore have more experience with the novel treatment.
An example of unequal allocation comes from a trial of a novel gene therapy,
delivered by subretinal injection to the eye, which had a promise to restore vision to
blind participants who had a particular mutation. The sponsor and investigators did
not want to burden control participants with a sham injection procedure, especially
when many participants would be children. As a result, masking the treatment groups
was not possible. With the potential to regain vision from blindness or to stop the path
758 F.-f. Yu
Fig. 4 Total sample size as a function of the assignment ratio for 80%, 85%, and 90% power
toward blindness, everyone recruited was keen on receiving the novel treatment in a
rare disease population where finding patients was already difficult. The final design
used a 2:1 randomization with an extension period. Despite a small loss in power
compared to a 1:1 randomization in an already small trial, this design limited the
number of control participants, and the extension period allowed the opportunity for
controls to receive treatment after a year on the main trial (Russell et al. 2017).
• Check for patterns to ensure that the distribution of possible block permutations is
not unusual. An example is to ensure that all A–B blocks do not all occur early in
the list and all B–A blocks at the end. Another is to avoid a long string of Bs
occurring such as a block size of 6; a run of AAABBB|BBBAAA|AAABBB
gives two long runs of As and Bs, respectively. One might want to consider a
different seed to achieve runs that alternate more between the treatment groups.
• Check to see if the distribution of the position of treatment assignment within blocks is
well-balanced. For example, for blocks of size 6, As occur more often in position 5/6.
• Check to see there are no patterns of transitions between treatment assignments.
For example, across all blocks, there are more A–B transitions than B–A.
• If there is an abbreviated treatment group identifier (A, B or 1, 2), then a variable
should have the decoded variable (“active,” “placebo”).
model assumes that people differ in terms of their (baseline) characteristics and are
sampled from multiple populations. The underlying distribution of participant
responses is a function of the participant’s characteristics.
The invoked population model. Randomized groups may be similar with respect
to baseline variables, but each group may not necessarily be a perfect sampling
distribution from the larger population. The reality is that recruitment for a trial’s trial
population is far from a random sampling procedure of an infinitely large population.
In fact, much of it is nonrandom, targeting specific hospitals and communities and
selecting participants who satisfy certain eligibility criteria. The only random ele-
ment comes from the act of randomization itself (Rosenberger et al. 2018). The data
are still analyzed as if they were a random sample representative of the infinitely
larger population. While the randomized participants may be somewhat representa-
tive of the larger population, this belief still requires a leap of faith and is appropri-
ately called the invoked population model (Lachin 1988a). It invokes the assumption
that the analysis and inferences are from samples of the larger, homogeneous
population where the underlying distributions are the same.
The randomization model. Another approach is to say that the underlying distri-
butions of the treatment groups are not expected to be similar or are unknown.
In fact, there is no way ever to know the underlying distribution or to even make
assumptions about them under the invoked population model. Although this sounds
philosophical, this situation is tangible in the real-world setting of a small trial. Here,
the sample size may be too small to assume normality, which is the basis of many
tests. An alternative is to make no assumptions about the underlying distributions,
and therefore tests of treatment differences do not rely on those assumptions. Instead,
the test solely compares whether the outcome is related to the treatment.
In a randomization test, the basic idea is to assume that treatment label has
nothing to do with a person’s outcome. The null hypothesis is that the participant’s
responses are unaffected by the treatment. The observed difference, then, is only the
result of how the participants were allocated. The test is actually multiple rounds of
reshuffling and is sometimes referred to as “re-randomization.” To perform the test,
If the two groups were really not different, the reassignments are unlikely to
produce significant differences.
The benefit of the randomization test is that it requires no distributional assump-
tions; the disadvantage is that the computationally intensive process can be time-
consuming and complex programmatically.
762 F.-f. Yu
Fun fact. A randomization test is often referred to as a permutation test, but they
are not technically the same thing. A permutation test assumes that the data are
exchangeable and that all outcomes in the permutation have the same likelihood.
Rosenberger et al. show that this may be the case for random assignment, but not
under other randomization designs.
Randomization Method
Earlier parts of the chapter reference adjusting for prognostic factors when the
treatment groups are not balanced. In most cases, the analysis needs to account for
the randomization in order to account for the type I error properly.
In a stratified randomization, many statisticians advocate including the stratifica-
tion factors as covariates in the analysis. One reason for this is that in stratifying,
participants within a stratum are more alike; the stratification induces correlation
among those participants (Kahan and Morris 2012). This affects the variances of the
treatment difference. If the analysis ignores the stratification, and therefore the
correlation, then the standard error of the treatment difference is larger than the
truth. This in turn affects both power (lower than accounting for stratification) and p-
values (smaller than accounting for stratification).
If one is performing a permutation or randomization test for the analysis, then re-
randomization should use the same method for the original randomization.
For example, if assignment occurred using a permuted block design, then re-ran-
domization should use the same in order for the tests to have the proper type I error
rate.
For a trial where assignment is determined using minimization, the method of
analysis is less clearly defined. Taves advocated including the factors used for minimi-
zation as covariates in the analysis. Others have argued for a randomization test to
control alpha, although conducting one may be complicated. Re-randomization for trials
using minimization may have issues if there is unequal allocation (Proschan et al.) and
may also be unnecessary, producing similar results as a t-test or test of proportions
(Buyse 2000). Further review of this discussion appears in Scott et al. (2002).
regimen is the main comparison. A sensible extra analysis would analyze the
data using the actual treatment received.
The choice of randomization method depends on, of course, the size and needs of the
trial, with input from the trial sponsor and investigators. The table below (Table 6)
summarizes considerations for the different types of assignment methods discussed
in this chapter.
Cross-References
References
Buyse M (2000) Centralized treatment allocation in comparative clinical trials. Applied Clinical
Trials 9:32–37
Byar D, Simon R, Friendewald W, Schlesselman J, DeMets D, Ellenberg J, Gail M, Ware J
(1976) Randomized clinical trials – perspectives on some recent ideas. N Engl J Med
295:74–80
Hennekens C, Buring J, Manson J, Stampfer M, Rosner B, Cook NR, Belanger C, LaMotte F,
Gaziano J, Ridker P, Willett W, Peto R (1996) Lack of effect of long-term supplementation with
beta carotene on the incidence of malignant neoplasms and cardiovascular disease. N Engl J
Med 334:1145–1149
Ivanova A (2003) A play-the-winner type urn model with reduced variability. Metrika 58:1–13
Kahan B, Morris T (2012) Improper analysis of trials randomized using stratified blocks or
minimisation. Stat Med 31:328–340
Lachin J (1988a) Statistical properties of randomization in clinical trials. Control Clin Trials
9:289–311
Lachin J (1988b) Properties of simple randomization in clinical trials. Control Clin Trials
9:312–326
Lachin JM, Matts JP, Wei LJ (1988) Randomization in clinical trials: Conclusions and recommen-
dations. Control Clin Trials 9(4):365–374
Leyland-Jones B, Bondarenko I, Nemsadze G, Smirnov V, Litvin I, Kokhreidze I, Abshilava L,
Janjalia M, Li R, Lakshmaiah KC, Samkharadze B, Tarasova O, Mohapatra RK, Sparyk Y,
Polenkov S, Vladimirov V, Xiu L, Zhu E, Kimelblatt B, Deprince K, Safonov I, Bowers P,
Vercammen E (2016) A randomized, open-label, multicenter, phase III study of epoetin alfa
versus best standard of care in anemic patients with metastatic breast cancer receiving standard
chemotherapy. J Clin Oncol 34:1197–1207
Matthews J (2000) An introduction to randomized controlled clinical trials. Oxford University
Press, Inc., New York
Matts J, Lachin J (1988) Properties of permuted-block randomization in clinical trials. Control Clin
Trials 9:345–364
Pocock S, Simon R (1975) Sequential treatment assignment with balancing for prognostic factors in
the controlled clinical trial. Biometrics 31:103–115
Proschan M, Brittain E, Kammerman L (2011) Minimize the use of minimization with
unequal allocation. Biometrics 67(3):1135–1141. https://doi.org/10.1111/j.1541-0420.2010.
01545.x
Rosenberger W, Uschner D, Wang Y (2018) Randomization: the forgotten component of the
randomized clinical trial. Stat Med 38(1):1–12
Russell S, Bennett J, Wellman J, Chung D, Yu Z, Tillman A, Wittes J, Pappas J, Elci O, McCague S,
Cross D, Marshall K, Walshire J, Kehoe T, Reichert H, Davis M, Raffini L, Lindsey G, Hudson
F, Dingfield L, Zhu X, Haller J, Sohn E, Mahajin V, Pfeifer W, Weckmann M, Johnson C,
Gewaily D, Drack A, Stone E, Wachtel K, Simonelli F, Leroy B, Wright J, High K, Maguire A
(2017) Efficacy and safety of voretigene neparvovec (AAV2-hRPE65v2) in patients with
40 Principles of Clinical Trials: Bias and Precision Control 765
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 768
Type I and Type II Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769
Illustrations of Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769
Trade-Offs in Power Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
Clinically Meaningful Effect Sizes and Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
Choosing Alpha and Beta (or Power) Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773
Power Calculations for Common Trial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773
Comparative Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773
Single-Treatment Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777
Power Calculations for Non-inferiority Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 779
Approaches for Calculation of Power and Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 780
Available Software and Websites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 780
Simulation Studies for Power Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781
Power Calculations for Fixed Sample Size Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781
Alternatives to Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 782
Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 782
Sample Size Calculations in Bayesian Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 782
Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783
Evaluability of Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783
Interim Analyses and Early Stopping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785
Abstract
A critical component of clinical trial design is determining the appropriate sample
size. Because clinical trials are planned in advance and require substantial
resources per patient, the number of patients to be enrolled can be selected to
E. Garrett-Mayer (*)
American Society of Clinical Oncology, Alexandria, VA, USA
e-mail: [email protected]
ensure that enough patients are enrolled to adequately address the research
objectives and that unnecessary resources are not spent by enrolling too many
patients. The most common approach for determining the optimal sample size in
clinical trials is power calculation. Approaches for power calculations depend on
trial characteristics, including the type of outcome measure and the number of
treatment groups. Practical considerations such as trial budget, accrual rates, and
drop-out rates also affect the study team’s plan for determining the planned
sample size for a trial. These aspects of sample size determination are discussed
in addition.
Keywords
Power · Sample size · Type I error · Type II error · Clinically meaningful ·
Effect size
Introduction
A critical component of clinical trial design is determining the appropriate sample size.
Because clinical trials are planned in advance and require substantial resources per
patient, the number of patients to be enrolled can be selected to ensure that enough
patients are enrolled to adequately address the research objectives and that unneces-
sary resources are not spent by enrolling too many patients. The most common
approach for determining the optimal sample size in clinical trials is power calculation.
In clinical trials evaluating a new treatment regimen relative to a standard treatment,
power is the probability of concluding that the new treatment is superior to the
standard treatment if the new treatment really is superior to the standard treatment.
In designing a trial, the research team wants to ensure that the power of the trial is
sufficiently high. If the trial does not have sufficient power, the team is likely to
incorrectly conclude that a promising treatment has low efficacy.
The concept of power is based on hypothesis testing, a method used in most phase
II and phase III clinical trials. As an example, consider a randomized trial with two
treatment groups, an experimental treatment and a standard-of-care treatment, and
assume that the outcome of interest is a binary indicator of response (i.e., a patient
responds or does not respond to the assigned treatment). When a research team
embarks on a trial, they have a hypothesis about the level of response for the
treatment under study that would be considered “a success” relative to the control
group. If the researchers are treating a condition where the standard treatment leads
to a 10% response rate in patients, then perhaps a 25% response rate would be
considered sufficiently high in the experimental treatment to pursue further study. In
this example, to design the trial, the known information and assumptions regarding
response rates in the standard of care and new treatment are used to set up the
hypothesis test with two hypotheses: the null hypothesis (H0) and the alternative
hypothesis (H1). H0 represents the response rate if the new treatment is no better than
the standard of care; H1 represents the response rate if the new treatment is better
41 Power and Sample Size 769
than the standard of care. When developing a power calculation, these are usually
written in the format
H0 : p1 ¼ 0:10; p2 ¼ 0:10
H1 : p1 ¼ 0:10; p2 ¼ 0:25,
where p1 is the assumed response rate (or response probability) in the standard
treatment and p2 is the assumed response rate in the experimental treatment.
In our hypothesis testing example assuming one of the two hypotheses is true, at the
end of the trial, the research team will either choose the correct or the incorrect
hypothesis. If the null hypothesis is true, but the data collected lead the research team
to choose the alternative hypothesis, then the team has made a type I error. If the
alternative hypothesis is true, but the data lead the research team to choose the null
hypothesis, then the team has made a type II error. Table 1 shows the possible
outcomes that a research team can make.
When designing a clinical trial, the research team wants to minimize making
errors and sets the type I and II error rates to relatively low levels. Traditionally, type
I error rates are set to values between 2.5% and 10%; type II errors are usually in the
range of 10–20%. Note that the type I error rate is also called the alpha (α) level of
the hypothesis test and the type II error rate the beta (β) level of the test. Power is 1
minus beta (1β). Because it is desirable to keep the type II error relatively low
(20% in most trials), the power is usually at least 80% in well-designed studies.
In our example, there are four elements to include to calculate our optimal sample
size: (1) the response rate under the null hypothesis, (2) the response rate under the
alternative hypothesis, (3) the type I error rate, and (4) the type II error rate. As will
be seen in later sections, when you have other types of outcomes, you may need
additional information to perform the power calculation (e.g., the assumed variance
if the outcome is a continuous variable).
Illustrations of Power
Graphical displays to illustrate power are shown in Fig. 1. In panel A, there are two
bell curves (i.e., distributions) where the x-axis is the difference in proportions from
770 E. Garrett-Mayer
Fig. 1 Illustration of power and alpha levels for varying sample sizes with response rate as the
outcome in a randomized trial with two treatment groups. Panels a, b, c and d are ordered vertically
from top to bottom
our example. Each curve represents the response rate in the experimental group
minus the response rate in the control group under one of our hypotheses. The black
distribution represents the null hypothesis, where the difference in response rates is
0 (i.e., the response rate in both groups is 0.10 if the null hypothesis is true); the red
41 Power and Sample Size 771
is almost no region of the red curve that is to the left of the rejection threshold, and
the power is 98%. While the research team will be pleased to know that they have a
high chance of finding a significant difference in treatments if the treatments are
different, many would argue that this trial is wasteful because it utilizes too many
resources and could be completed without enrolling so many patients.
From the previous section, there were four quantities that were specified to calculate
the power of the trial: (1) the response rates under H0; (2) the response rates under
H1; (3) the alpha level; and (4) the sample size. (As noted above, with other
outcomes, the assumed variance may also be required.) In theory, one can specify
the power and solve for any of the other four quantities. However, in most trials the
null hypothesis and the alpha level are prespecified. Ideally, the research team would
solve for the sample size based on the other quantities, but due to resource con-
straints, many trials have an upper limit on a feasible sample size, and thus power or
the alternative hypothesis is determined based on the sample size limitations.
meaningful improvement in survival (Sobrero and Bruzzi 2009). Studies like this led
to efforts to define clinically meaningful differences in cancer clinical trials, with the
goal of ensuring that trials would be designed with appropriate levels of power and
sample size to ensure that detectable effect sizes would be clinically meaningful
(Ellis et al. 2014).
There are conventions in clinical trials that have been used for many decades,
leading to almost no consideration given to appropriate selection of alpha and beta
levels. Most commonly, one will see alpha set to 5% and beta set to 20% (i.e., power
set to 80%). These are not correct levels, but are the most commonly chosen.
Strident arguments can be made that setting alpha low for a phase III trial is
appropriate: making a type I error when deciding whether or not to approve an
experimental agent is a very serious error. That is, a type I error would lead to
approving an agent when the agent is not better as compared to the control group.
From an approval standpoint, making a type II error is less grievous; not approving
an effective treatment is less worrisome than approving ineffective treatments. Thus,
for trials that are intended to provide direct evidence for approval of the agent,
setting alpha substantially lower than beta may be sensible. However, in earlier
phase trials, setting alpha and beta to similar levels may be a better strategy. In many
early efficacy trials in cancer research, alpha and beta are both set to 10%, suggesting
that each type of error has equally bad implications. In this setting, the research team
is more willing to take an ineffective agent to the next phase of research (higher
alpha), but less willing to discard an effective agent (lower beta).
Different areas of medical research tend to use different primary outcomes in their trials,
leading to differences in test statistics used in hypothesis tests and thus in how power
calculations are performed. Most outcomes fall into one of the three categories: contin-
uous, binary, or time-to-event outcomes. The example in the previous section was based
on a comparison of response rates and a binary outcome. In the following sections,
comparative and single-treatment studies are reviewed for each of these outcomes.
Comparative Studies
Binary Outcomes
A randomized trial with a binary outcome example was developed in section “Intro-
duction.” For binary outcomes, there are various options and assumptions that can be
used in power calculations. In Fig. 1, a normal approximation was used, which is
simple to calculate and works well when the response probabilities are not close to 0 or
774 E. Garrett-Mayer
Table 2 Differences in power using different power calculation approaches for a randomized trial
with a binary indicator of response as the outcome, assuming response probabilities of 0.10 and 0.25
in the control and experimental treatments under the alternative hypotheses, respectively
Power calculation type Sample size per treatment Power (%)
Normal test, approximation 1 40 54
Normal test, approximation 2 40 31
Chi-square test 40 42
Fisher’s exact test 40 33
Normal approximation 1 100 90
Normal approximation 2 100 80
Chi-square test 100 83
Fisher’s exact test 100 76
Normal approximation 1 160 98
Normal approximation 2 160 93
Chi-square test 160 96
Fisher’s exact test 160 94
1 in either group, and the sample size is relatively large. Other normal approximations
are also used which differ in their approach for estimating the denominator of the test
statistic (i.e., the standard error of the difference in proportions). Depending on the
sample size and the assumed response rates, the power estimates may be very similar
or dissimilar depending on the approximation used. When planning a trial, the
approach used to calculate the power or sample size should be consistent with the
approach used to analyze the data at the end of the trial (Table 2).
Continuous Outcomes
In the previous example with a binary outcome, in addition to knowing the power
and alpha, one only needed to know the expected response rates under the null and
alternative hypotheses. When the outcome of the trial is a continuous variable, and
the goal is to compare the means between two groups, the research team must set null
and alternative hypotheses for the means in the groups, and they must also make an
assumption about the variance of the outcome. For example, assume that a trial is
being planned to evaluate the efficacy of vitamin D supplementation in individuals
with vitamin D deficiency where individuals are randomized to a low dose of
vitamin D (400 IU) in one group and a high dose in another group (2000 IU). The
outcome is 25(OH)D, which is a measure of vitamin D in the blood. The research
team assumes (based on their previous research) that the standard deviation of 25
(OH)D is approximately 14 ng/mL in individuals who do not have deficiency. The
research team plans to compare 25(OH)D levels between the two groups after
6 months of supplementation using a two-sample t-test.
In the previous example, the width of the curves that determined power
(Fig. 1) was determined based on the both assumed response rates in the null
and alternative hypotheses and the sample sizes in each group. When using a
continuous outcome, the means in the null and alternative hypotheses and the
41 Power and Sample Size 775
Fig. 2 Illustration of effect of the standard deviation on power in a trial with a continuous outcome.
Panels a is on the top; panel b is on the bottom
sample size factor into the power calculation, but so does the assumed standard
deviation. Thus, to calculate power, the following are required: alpha, sample
size, difference in means under the null hypothesis (usually 0), the difference in
means under the alternative hypothesis, and the standard deviation in each group.
The researchers expect that the mean 25(OH)D levels will be 55 ng/mL in the
low-dose group and 65 ng/mL in the high-dose group after 6 months of supple-
mentation. Under the null hypothesis, the means would be the same; and under
the alternative hypothesis, the difference in means would be 10 ng/mL:
H0 : u2 u1 ¼ 0 ng=mL
H1 : u2 u1 ¼ 10 ng=mL
To achieve 90% power with a two-sided alpha level of 5%, and assuming that the
standard deviation is 14 ng/mL in each group, the research team would need to enroll
42 patients in each group. Figure 2a shows the distributions of the difference in
776 E. Garrett-Mayer
means under the null and the alternatives, assuming a sample size of 42 per group,
and a standard deviation (SD) of 14 ng/mL in each group. All else being the same, if
the assumed SD were larger, the power would decrease. Having a larger SD in each
group adds more variance and thus more imprecision in the estimates. Figure 2b
shows the effect of the larger SD on power if the sample size remains 42 per group.
Notice that the curves are wider, the overlap is greater, and the area under the
alternative distribution curve representing power (i.e., the red shaded portion) is
smaller. If the SD is 16 ng/mL in each group instead of 14, the power decreases from
90% to 82%. In the example above, it was assumed that the standard deviation in the
two groups is the same (14 ng/mL). Power calculations can also be performed
assuming a different standard deviations in each group.
Time-to-Event Outcomes
In many trials, the outcome of interest is a time interval. For example, in many phase
III randomized clinical trials in cancer research, survival time is the outcome, that is,
the time from enrollment on the trial until death. The challenging aspect of survival
time as an outcome is that not all patients have the event of interest (death) when the
data are analyzed. The patients who are still alive at the time of data analysis have
survival times that are “censored,” meaning that we know that they lived for a certain
amount of time but we do not know their actual survival time (which will occur in the
future). Statisticians have approaches for analyzing time-to-event outcomes, such as
survival time. Randomized trials with time-to-event outcomes which have inferences
based on the hazard ratio (i.e., the ratio of events in two groups being compared) only
require a few elements: the hazard ratio under the null hypothesis (usually assumed
to be 1, meaning equal event rates), the hazard ratio under the alternative, and the
type I and type II errors. With these quantities, one can solve for the number of
events required to achieve the desired power. While the simplicity is convenient,
when planning a trial, knowing the number of events needed is not sufficient: clinical
trial protocols require that the number of patients enrolled be stated. Using additional
information, including the expected accrual rate and the minimum amount of time
each patient will be followed for events combined with the required number of
events to achieve the desired power, the number of patients required for enrollment
can be calculated.
The hypothesis for evaluating a time-to-event outcome could be set up as follows,
where the null hypothesis assumes there is no difference in the event rates in the two
groups; the alternative in this example assumes that the event rate in group 2 (λ2) is
half as large as the event rate in group 1 (λ1). If this were a cancer treatment trial with
two treatments with survival time as the outcome, the alternative assumes that the
rate of death occurring in group 2 is half of that in group 1, meaning the treatment in
group 2 doubles the expected survival time:
H0 : λ2 =λ1 ¼ 1
H1 : λ2 =λ1 ¼ 0:5
41 Power and Sample Size 777
Going back to the characteristics that affect sample size, if the hazard ratio were
0.75 instead of 0.5, the sample size would be required to be larger due to a smaller
difference in rates in the two treatments. And, knowing that the power is driven by
the number of events, the sample size for a trial with 2 events per month would need
to be larger than a trial with 20 events per month to be completed in a similar amount
of time. Sample size for trials based on time-to-event outcomes also depends on the
accrual rate and the planned length of follow-up. For example, based on the expected
accrual rate of ten patients per month, a trial with only 1 year of follow-up (after the
last patient has enrolled) will require more patients than a trial with 2 years of follow-
up because the latter trial will observe more events prior to stopping the trial.
Single-Treatment Studies
H0 : p ¼ 0:10
H1 : p ¼ 0:25
where p is the response rate and H0 represents a response rate too low for further
consideration and H1 a response rate that is sufficiently high for further study of the
treatment. The power calculation includes the same elements of alpha, beta, null, and
alternative hypothesis, but the calculation will have smaller sample sizes than the
comparative trial. This is due to lower variance – with only one treatment group
included, and a fixed comparator, there is only variance in the experimental
condition.
The same trade-offs between type I and II errors, clinical effect size, and sample
size are present in single-treatment studies as in randomized trials.
Single Stage
In the example in section “Single-Treatment Studies,” with a null response rate of
0.10 and an alternative response rate of 0.25, the sample size can be calculated using
778 E. Garrett-Mayer
Multistage
As described in section “Interim Analyses and Early Stopping Rules” (Practical
Considerations), many trials are designed to have sufficient sample size to maintain
power and alpha when interim analyses or early stopping rules are incorporated into
the design. One common single-treatment trial design for binary outcomes is the
Simon two-stage design which includes two stages, where n1 patients are enrolled in
the first stage and n2 patients are enrolled in the second stage, and uses a one-sided
test (Simon 1989). After n1 patients are enrolled, the number of responses is
compared to a predefined threshold (r1). If there are r1 or fewer responses, the trial
stops for futility. That is, if there are r1 or fewer responses at the end of stage 1, it is
unlikely that sufficient responses could be seen by the end of stage 2 to reject the null
hypothesis, and so no more patients are enrolled. If more than r1 responses are seen at
the end of stage 1, the trial continues, enrolling an additional n2 patients (for a total of
n1+n2 patients at the end of stage 2). At the end of stage 2, the total number of
responses (from stage 1 and 2 combined) are counted up, and the null hypothesis is
rejected if there are sufficient responses.
Technically, there are many designs that can fit the criteria (due to the flexibility
induced by allowing the early look). Simon suggested a criterion minimizes the
expected sample size of the trial if the null hypothesis is true. He referred to this
version of the design as the “optimal” two-stage design.
Because the trial has two “looks” at the data, the type I and II errors will differ
from that in a trial with only one look. In this type of trial, early stopping is only
allowed for futility, meaning there are two opportunities for a type II error and only
one opportunity for a type I error. To ensure that the errors are controlled, the sample
size is usually slightly larger than if the trial was performed in a single stage. For
example, a single-stage trial with H0: p = 0.20 versus H1: p = 0.40 requires a sample
size of 42 for a one-sided alpha of 0.05 to maintain a power of 0.90. Simon’s optimal
two-stage design requires 54 patients, with 19 patients in stage 1 and stopping early
for futility if fewer than 5 patients has responses. Other two-stage designs with alpha
of 0.05 and power of 90% could be selected that would allow a sample size closer to
42, but they would not meet the optimality criterion defined above.
55 ng/mL, which is what the research team assumed would occur in a low-dose
(400 IU) setting; the researchers assumed that giving a high dose would lead to a
higher mean, a mean of 65 ng/mL. This would be set up as follows:
H0 : u ¼ 55 ng=mL
H1 : u ¼ 65 ng=mL
H0 : m ¼ 6 months
H1 : m ¼ 9 months
Comparing event rates is more efficient but will require that assumption that the
event rate is constant over time. For example, the example below shows a null
hypothesis with an event rate (λ) of 0.30; the alternative hypothesis event rate is 0.20.
If the assumption that the event rate is constant over time is untrue, the inferences
from the trial could be invalid. Power calculations in this setting can be based on a
test of the rate parameter from the exponential distribution.
H0 : λ ¼ 0:30
H1 : λ ¼ 0:20
Most trials are designed as superiority studies; that is, the goal is to show that a
treatment is better than another treatment setting (which might be an active treat-
ment, a placebo, or no treatment at all). However, some trials have the primary
780 E. Garrett-Mayer
Although there are formulas for power and sample size calculations that can be
implemented by hand (or with a calculator), clinical trial power calculations are
performed using software packages specifically designed for power and sample size
estimation (e.g., NQuery, PASS, GPower, EAST), standard all-purpose statistical
software (e.g., SAS, Stata, R, or SPSS), or using online websites (e.g., www.crab.
org, http://powerandsamplesize.com).
Sample size estimation software packages provide many options and user-
friendly interfaces for standard trial designs and some more complex settings, such
as ANOVA, clustered designs, and early stopping rules. Many academic institutions
and companies involved in trial design have a license for at least one package, and
some are free to download (e.g., GPower). Statistical packages like SAS and Stata
include standard trial design options for power calculations (e.g., one- and two-
sample comparisons of mean and proportions). While R has standard designs
available, it also has user-contributed libraries that allow users to perform power
calculations for more complex designs (e.g., cluster-randomized trials can be
designed using the “clusterPower” library.
41 Power and Sample Size 781
Not all trials designs will fall into the categories discussed above. For example, for a
trial with a longitudinal design where the primary objective includes comparing
trajectories (or slopes) based on repeated measures per individual in two or more
groups, there may not be a standard software package that will suit the trial design
needs. In that case, sample size could be determined based on simulation studies.
That is, just as above where assumptions were made regarding effect size, variance,
sample size, and alpha, assumptions are made for all the relevant parameters, and
trials are simulated under the set of assumptions. For each trial simulated under the
parameters consistent with the alternative hypothesis, the trial results are analyzed,
and it is determined if the null hypothesis is rejected or accepted (at the predefined
alpha level). The proportion of simulated trials for which the null hypothesis is
rejected provides an estimate of the power. To get a precise estimate of power, the
number of simulated trials has to be reasonably large. If the power is too low for a
given sample size, the sample size can be increased, and the simulations can be
performed using the larger sample size; this can be repeated until the desired power
level is reached.
While simulations allow the research team to investigate trial properties for
complex designs, they have the drawback that they usually have quite a few
parameters for which the research team needs to make assumptions. However, it
is not always possible to have preliminary information to make assumptions.
This leads the research team to consider a range of values for some parameters.
Additionally, simulations require skill in programming and depth of knowledge of
statistics and probability distributions. Simulations can be time-consuming: they
may take considerable time to develop and undertake (depending on the number of
simulations, the number of parameters, and the range of parameters considered) and
to summarize and present in graphical or tabular format.
In clinical trials, the sample size is planned prior to embarking on the trial, and so the
planning for the sample size incorporates the clinical effect size, alpha, power, and
variance. Once the sample size is determined, there may be additional analyses the
research team plans to perform to address secondary objectives of the trial. It is
unlikely that the sample size would be increased to ensure sufficient power for
secondary objectives, but it can be helpful to calculate power for secondary analyses
or to report the effect size that is detectable for secondary analyses based on a fixed
power and the sample size for the trial.
In some cases, after the trial is completed, the data from the trial may be used for
“secondary data analyses,” which implies that the data were collected but not
intended to be used for these analyses. The researchers may write a proposal for
these additional data analyses and the proposal should justify that the objectives of
the secondary data analyses can be achieved with the sample size of the dataset. This
782 E. Garrett-Mayer
can be helpful perspective for the research team proposing the analyses as it will
determine the effect sizes that are detectable for the predetermined sample size for a
given power level. The research team may have an overpowered trial, in which case
they may want to use a relatively small alpha level to conclude “significance.” If the
trial is underpowered (i.e., the sample size is too small to detect clinically relevant
effect sizes), the research team may decide that the analysis should not be conducted,
or they may decide to perform the analysis, but will be cognizant of the low power
and take it into account when interpreting their results, and they may choose to avoid
significance testing altogether.
Alternatives to Power
Inferences for some trials are not based on traditional hypothesis tests, and therefore
power is not relevant.
Precision
Some trials have goals of estimation. For example, a single-treatment trial of a new
agent in a patient population defined by a biomarker could have the primary
objective of estimation of median progression-free survival. Instead of testing
that median survival is larger than a null value, the sample size can be motivated
by the precision with which the median survival is estimated. If the research team
wants to estimate the median survival with a 95% confidence interval with a width
no greater than 2 months, the research team can use information regarding
expected accrual rate, assumptions about possible observed values of median
survival, and determine how many patients to enroll to ensure that the width is
within 2 months.
Although this is a sensible approach, there are not standard guidelines that allow
one to determine what a sufficiently precise confidence interval might be. Addi-
tionally, at the end of the trial, the 95% confidence interval still needs to be
interpreted, and a decision has to be made regarding whether or not further study
of the agent should be pursued in the patient population. The lower limit of the
confidence interval in this case could be used to determine if the treatment should
be pursued in additional research studies and for calculating effect sizes for future
studies.
designs differ in the way that they determine sample sizes and in the concepts that
they use that are similar to type I and II errors. Although some quantities can be
directly calculated, power and sample size calculations for Bayesian trial designs
are calculated using simulations because Bayesian trial designs tend to be adaptive.
See Lee and Chu for more details, including advantages to Bayesian trial designs
(Lee and Chu 2012).
Practical Considerations
Evaluability of Patients
Clinical trials must state a planned sample size for review and approval by institu-
tional review boards (IRBs) and for other scientific review committees. Technically,
studies should not enroll beyond the planned sample size, and doing so can result in
punitive actions for not following the trial protocol. But, in the practical implemen-
tation of trials, not all patients contribute information to address the primary objec-
tive and are deemed “inevaluable” as per a definition in the protocol. For example, a
patient may enroll in the trial but drop out of the trial prior to receiving the trial
treatment, and thus the patient did not receive treatment and has no information on
the outcome of interest. Some trials may have approaches for how this patient may
contribute to the analysis, but many trials would deem this patient as inevaluable. In
trials in which there are expected to be patients who are inevaluable, the research
team should take this into account when planning the sample size. For example, if
the sample size calculation requires 60 patients for sufficient power and the team
anticipates 10% of patients will be inevaluable, the projected enrollment should be at
least 67 (67 (1–0.10) = 60.3).
The examples provided in previous sections (except for the Simon’s two-stage
design) have assumed single-stage designs, where the trial enrolls patients and
does not evaluate the data until the trial ends. In practice, many trials have early
stopping rules or are designed to have interim analyses, which may affect the trial’s
enrollment.
stopping are included will need to account for these looks at the data and increase
the sample size accordingly to ensure that the desired power and alpha can be
achieved. If the power calculations do not account for the early looks, the true type
I and II error rates will be higher than assumed.
Interim Analyses
In addition to early stopping rules, other analyses can be planned to take place in the
midst of a trial. For example, the research team may be required to make assumptions
regarding the trial, such as the accrual rate or the event rate, without much preliminary
data, and the team designs the trial to adjust the sample size to account for any
incorrect assumptions. This does not allow the team free reign to make changes to
the trial based on interim results – this type of analysis must be planned carefully in
advance with clearly defined plans for how any changes would be made. In addition,
these mid-trial analyses are usually planned so that the research team is blinded from
knowing efficacy estimates. One type of interim analysis that could be planned is an
estimation of “conditional power” where the trial team determines the likelihood of
rejecting the null hypothesis based on the evidence in the data collected thus far. If the
conditional power is very high, the trial may continue without any revisions to the
sample size; if the conditional power is moderate (e.g., between 50% and 80%), the
sample size may be increased to bring the power up to a level of 80% or higher; if the
conditional power is low, the trial may be discontinued due to futility (i.e., with a low
conditional power, a very large sample size increase would be required, suggesting
that the detectable effect size would be considered too small to be clinically relevant).
Power and sample size calculations are an important part of clinical trial planning.
There are standard approaches for many traditional designs, including those for
continuous, binary, or time-to-event outcomes, and for single-treatment and for
randomized trials. For more complex designs, including those with adaptive designs,
interim analyses, or early stopping, more sophisticated calculations need to be
performed to ensure that type I and type II errors are controlled. While some
power calculations can be conducted using standard software or online tools, it is
wise to engage a biostatistician to ensure the calculations are performed properly and
any nuances of the design have been accounted for.
Key Facts
• Type I and II errors must be controlled when designing trials to ensure that
inferences lead research team to correct conclusions with a high probability.
• There are many types of trial designs, and power and sample size calculations can
be performed using simple approaches (which are available in software or online)
or using complex simulation studies.
• Practical considerations need to be considered in addition to output from formulas
for power or sample size.
41 Power and Sample Size 785
References
Brookmeyer R, Crowley JJ (1982) A confidence interval for the median survival time. Biometrics
38:29–41
Ellis LM, Bernstein DS, Voest EE, Berlin JD, Sargent D, Cortazar P, Garrett-Mayer E, Herbst RS,
Lilenbaum RC, Sima C, Venook AP, Gonen M, Schilsky RL, Meropol NJ, Schnipper LE (2014)
American Society of Clinical Oncology perspective: raising the bar for clinical trials by defining
clinically meaningful outcomes. J Clin Oncol 32(12):1277–1280
Lee JJ, Chu CT (2012) Bayesian clinical trials in action. Stat Med 31(25):2955–2972
Simon R (1989) Optimal two-stage designs. Control Clin Trials 10:1–10
Sobrero A, Bruzzi P (2009) Incremental advance or seismic shift? The need to raise the bar of
efficacy for drug approval. J Clin Oncol 27(35):5868–5873
Controlling Bias in Randomized Clinical
Trials 42
Bruce A. Barton
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
Sources of Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789
Selection Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789
Cautionary Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791
Performance Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793
Detection Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794
Attrition Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795
Reporting/Publication Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797
Other Sources of Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 799
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
Abstract
Clinical trials are considered to be the gold standard of research designs at the top
of the evidence chain. This reputation is due to the ability to randomly allocate
subjects to treatments and to mask the treatment assignment at various levels,
including subject, observers taking measurements or administering question-
naires, and investigators who are overseeing the performance of the study. This
chapter section deals with the five major causes of bias in clinical trials: (1)
selection bias, or the biased assignment of subjects to treatment groups; (2)
performance bias, or the collection of data in a way that favors one treatment
group over another; (3) detection bias, or the biased detection of study outcomes
(including both safety and efficacy) to favor one treatment group over another;
B. A. Barton (*)
Department of Population and Quantitative Health Sciences, University of Massachusetts Medical
School, Worcester, MA, USA
e-mail: [email protected]
(4) attrition bias, or differential dropout from the study in one treatment group
compared to the other; and (5) reporting and publication bias, or the tendency of
investigators to include only the positive results in the main results paper (regard-
less of what is specified in the study protocol) and the tendency of journals to
publish only papers with positive results. While other biases can (and do) occur
and are also described here, they tend to have lower impact on the integrity of the
study. The definitions of these biases will be presented, along with how to
proactively prevent them through study design and procedures.
Keywords
Treatment randomization · Treatment masking · Selection bias · Performance
bias · Detection bias · Attrition bias · Reporting bias
Introduction
Clinical trials are generally considered the least biased of any research study design
and are widely considered to the gold standard of research designs (Doll 1998).
The two major factors credited for the lower risk of bias in clinical trials are the use
of random treatment assignment for subjects and masking of the assigned treat-
ment. No matter how it is performed, treatment randomization “levels the playing
field” so that the treatment groups are typically similar in terms of baseline patient
characteristics, medical history, etc. Assignment to treatment is not influenced by
factors of any kind – for example, severity of disease, previous history, gender, age,
or location do not affect the probability of randomization for any subject. The
treatment masking of the assigned treatment alleviates any bias of the observer to
record data in a way that is preferential to one treatment group over the other.
However, clinical trials are not impervious to bias, although the risk is cer-
tainly reduced compared to other designs (Lewis and Warlow 2004). Perhaps the
best compendium of biases in clinical trials is the Cochrane Handbook for
Systematic Reviews of Interventions (Higgins et al. 2008), which lists the poten-
tial biases in clinical trials: (1) selection bias, (2) performance bias, (3) detection
bias, (4) attrition bias, (5) reporting/publication bias, and (6) other sources of bias.
The Cochrane Collaboration has developed a tool for assessing the risk of bias,
which is recommended for the evaluation of studies to include in each systematic
review submitted to the Cochrane Reviews, but is also useful to help investigators
reduce, if not eliminate, bias in their studies. In addition, the CONSORT State-
ment and, in particular, the Elaboration and Extension information can help
investigators avoid inadvertent biases in their study design, protocol, and
reporting.
The following sections describe the types of problems that can result in each type
of bias, how to detect these problems, how serious they are, and how a researcher can
design a study that avoids them.
42 Controlling Bias in Randomized Clinical Trials 789
Sources of Bias
Selection Bias
Selection bias refers to how subjects are allocated to treatment groups. There
are many examples in the literature of nonrandomized studies yielding positive
results for a treatment only to have a subsequent randomized study overturn those
results (e.g., Wangensteen et al. 1962; Ruffin et al. 1969). The underlying reason for
the positive results in a nonrandomized setting is that the investigators give the
treatment to patients preselected to be responsive. Without a randomized
control group, it is not possible to estimate the treatment effect. Bias can also arise
if patient outcomes are compared to historical or other non-contemporaneous (or
nonrandomly chosen) controls. The solution to this source of bias is the randomiza-
tion of subjects between (or among) treatment/control groups.
In a randomized clinical trial, subjects are randomly allocated to treatment/control
groups according to a masked allocation sequence, either static or dynamic. It is
important to understand that the randomization must be masked so that the investi-
gators are not able to determine what the next treatment assignment might be. For
example, if the “randomization” is based on the last digit of the patient’s medical
record number, investigators will be able to determine what treatment the current
patient will receive – and advise the patient (either directly or indirectly) on whether
to enroll in the study based on that knowledge and their own bias related to treatment
efficacy. In RCTs, randomization means masked randomization with barriers built
into the sequence to prevent determining the next treatment allocation. The section
below discusses approaches to do this.
Typically, this is done through a randomization process as part of the allocation;
for example, subjects are assigned to treatment groups based on a sequence of
random numbers. The complete process is as follows.
Before the study is started, a sequence of random numbers is generated, either
from a table of random numbers or from a computer program/web site. That
sequence can take a number of forms, ranging from single digits (e.g., 1, 2, 3,. . .)
to multiple digits (e.g., 001, 002, 003,. . .) or (especially from computer programs)
any decimal sequence between 0.0 and 1.0. Once the full range of the random
numbers is known, subjects who receive numbers in the lower half of the sequence
of random numbers are assigned to one treatment group, while subjects who receive
numbers in the top half are assigned to the other treatment group. Once the treatment
assignments have been determined for each random number, the numbers are put
back into the original order to achieve a fully randomized list (Lachin 1988).
This simple approach, however, does not assure balance of the numbers of
subjects between the treatment groups. It also does not avoid a lengthy string of
the same treatment. The usual approach to assure the balance is through a permuted
block randomization (Matts and Lachin 1988). In this approach to random alloca-
tion, random sequences are generated in groups of numbers, known as “blocks.”
Each block of, say, four random numbers is sorted by the random numbers in the
790 B. A. Barton
block with the lowest two numbers assigned to one treatment group and the other
two numbers assigned to the other treatment group. The random numbers are then
put back into the original sequence, forcing a balance between the numbers of
subjects assigned to each treatment group while generating a truly random sequence.
The down side to this approach is that, if it becomes known what the block size is,
the final treatment assignment in the block is easy to determine. The way to eliminate
this potential bias is to randomly set the block size (e.g., for a particular study,
the block sizes could be 2, 4, or 6) so that it is not possible to determine the next
treatment assignment without knowledge of the block size that is currently being
filled. This information is, of course, hidden from the investigators as only the
random sequence of treatment assignments is available to the randomization system.
In addition to the allocation bias described above, there is a potential bias related
to an imbalance of factors that are highly predictive of outcomes. For example, in
studies of diabetes in which HbA1c is the outcome, body mass index (BMI) could be
an important predictor of HbA1c. To assure a balance of BMI in the two treatment
groups so that a biased outcome is not generated by a BMI imbalance, stratification
can be used. Patients’ BMI could be classified as “normal” (BMI 18–25) or “ele-
vated” (BMI >25.0). The patients with normal BMI would have a separate random-
ization to assure an equal distribution of normal BMI patients between the treatment
groups, as would the patients with elevated BMI. It should be noted that typically
there is a separate randomization within the clinical center in a multicenter clinical
trial so that the treatments are balanced within each clinic, eliminating any treatment
bias within center.
Typically, a formal stratification using this approach would be limited to three
variables because of complexity of maintaining the randomizations across multiple
strata. For situations where it is important to balance the randomization for more than
three important factors, the alternate strategy is a minimization or dynamic allocation
strategy. This strategy keeps track of the treatment allocations for each factor of
interest. As patients are randomized using a standard approach, an imbalance may
develop with, say, females so that more females are assigned to one treatment group
(Treatment A) compared to the other (Treatment B). The usual approach, to preserve
an element of randomization, is to reduce the probability of being assigned to
Treatment A so that Treatment B has a higher probability of assignment for the
next female. This reduction in the probability of assignment to Treatment A is a
function of the imbalance, so that the usual probability of assignment to Treatment A
of 0.50 is reduced to 0.40 or even 0.25 depending on the decisions during the study
design phase related to how much tolerance to imbalance is possible. The adjustment
of this probability also implies that an a priori randomization scheme is not possible
since the probability of assignment will vary for each new patient depending on the
balance across the important factors. For each new patient ready for randomization,
this recalculation of probability of assignment to Treatment A (and the associated
probability of assignment to Treatment B) must take into account any imbalance for
each important factor, so that an overall probability is calculated. Thus, the random-
ization system must be available at all times with access to the previous randomiza-
tions so that these calculations can be made in real time.
42 Controlling Bias in Randomized Clinical Trials 791
Cautionary Note
central REDCap database, the next randomization is available – and the individual
can still be checked for eligibility prior to the randomization. Once contact with
REDCap is restored, the information is transmitted to the main study database, and
the next randomization is downloaded.
If the randomization cannot be done electronically, then an opaque sealed enve-
lope system can be used, but it is much easier to get around the masked treatment
assignments and “game” the masking. Various methods are used to maintain the
mask of the next randomization in the nonelectronic randomization. For example,
the person obtaining the randomization must sign the opaque envelope (along with
the date and time) on the seal on the back of the envelope as well as signing the
actual card with the treatment assignment. A third party, not otherwise associated
with the study, could be in charge of the randomization envelopes and log the date,
time, and requestor for each envelop. Once the envelope is opened, that subject is in
the study.
Berger et al. (Berger and Christophi 2003) have made the point that allocation
concealment (i.e., no ability to predict the next random treatment allocation) is
critical to the successful randomization of patients to form equivalent treatment
groups. The impact of selection bias is hard to estimate, although a recent review
of oral health intervention studies (Saltaji et al. 2018) indicated that studies with
inadequate or unknown method of sequence generation/masking had larger effect
sizes (diff. in effect size = 0.13 (95% CI: 0.01–0.25)) than studies where the
generation/masking of treatment allocation was adequate.
Finally, some studies use a “run-in” period to assess a subject’s ability to adhere to
the study requirements and to the intervention as well as to detect any early adverse
effects (Laursen et al. 2019). There is some concern that a run-in period can create a
selection bias because the run-in explicitly tests the ability of subjects to adhere to the
intervention, whereas, in studies without a run-in, the ability of people to adhere to the
intervention becomes part of the study post-randomization. Because of this selection
bias, the results of these studies could be less generalizable than otherwise. However,
the counterargument is that, with an intervention that requires that the subject performs
a task in some way, medical personnel would not prescribe the intervention without
knowing if the subject could (and would) perform the task. In the HIPPRO study, for
example, the intervention was the wearing of hip pads to prevent hip fractures in
nursing home residents. If the resident was not able (or willing) to wear the pads daily
during the run-in period, he/she was not included in the main study. In reality, medical
personnel would not ask a nursing home resident to wear hip pads if the resident was
not capable of wearing them all day every day. So, in some cases, a run-in period
makes sense. In a drug study, however, a run-in period could eliminate people who
experience a specific side effect, while, in reality, these people could be prescribed the
drug. So the question of whether to include a run-in period should be considered
carefully to be sure that generalizability is not compromised. In a systematic review by
Laursen et al. (2019) of 470 clinical trials reported in Medline in 2014, 5% (25 studies)
had a run-in period of varying design and duration. Of the 25 studies, 23 had
incomplete reports of the run-in period in the study results paper. Industry-sponsored
studies were more likely to have run-in periods than studies funded by other sources.
42 Controlling Bias in Randomized Clinical Trials 793
Performance Bias
Performance bias refers to the collection of data from subjects in a way that does not
accurately reflect subject responses, i.e., the collection of data in a way that is
favorable to the data collector’s treatment of choice. If the data collection staff are
not masked, a number of biases can enter into the data, including exaggeration (or
diminishing) of outcomes, failing to record adverse events, failing to administer all
data collection forms in a neutral way, and misinterpreting laboratory results. For
example, a data collector/interviewer could record “headaches” as an expected and
nonserious adverse event for study participants in one treatment group but the same
symptoms as unexpected and serious for the other group. To avoid this bias, all clinic
staff responsible for any form of data collection must be masked to the treatment
assignment. This is inclusive of all data collection staff or anyone who can influence
the staff, such as the principal investigator of the study or of the clinical site. This
implies that only the person obtaining the randomization would be unmasked – such
as pharmacy staff who are packaging or providing the study treatments in a study of
medications. In these clinical trials, the clinic staff and even the laboratory staff
should be masked to study treatment.
Care must be taken if there are laboratory results that could unmask the study
physicians or nurses. For example, in studies of a new treatment in type 2 diabetics
with uncontrolled diabetes, HgbA1c or fasting glucose values could unmask the
physician reviewing the results. Similarly, in patients with sickle cell disease,
patients taking hydroxyurea will have higher fetal hemoglobin levels than those
not taking it. In these situations, the laboratory results may need to be first sent to the
data center to be masked (especially placebo results) prior to forwarding to the clinic
staff, but the true results always need to be recorded in the patient’s medical records.
In the WARCEF study (Pullicino et al. 2006), in which patients were randomized to
aspirin or warfarin (coumadin), the INRs for the warfarin patients were put into the
study/clinical records without modification, but the data center generated INRs for
the aspirin patients to avoid unmasking those patients. The process of masking
laboratory results depends on the condition being studied, but generally involves
imputing a random level in one (or both) groups that will not distinguish between the
treatments. This will need to be stated explicitly in the protocol and to the IRB along
with the explanation of how these results will be handled in the clinical center. It is
important to note that the actual results be recorded in a fashion that will not
jeopardize patient care but will mask the information for study personnel. Depending
on the electronic health record system used at each site, this may need specialized
programming to be done effectively.
In studies of behavioral interventions, the same is true, although it may be
difficult to maintain masking if there is a difference in the level of “attention” in
the intervention group compared to the control group. Many behavioral clinical trials
now include an equivalent number of sessions for both groups to avoid a bias created
by this additional attention in the intervention group.
Subjects should also be masked to their treatment group (if possible) so that all
reporting by the subjects is accurate and not related to the perceived effects of the
794 B. A. Barton
Detection Bias
intervention over the standard of care treatment. However, when the same assess-
ment was conducted by masked neurologists, there was no advantage shown by the
analysis of the masked results (Noseworthy et al. 1994). In general, masked
observers produce smaller treatment effect sizes that are also more reproducible
(Hróbjartsson et al. 2012). A recent state-of-the-art review by Kahan et al. (2017)
indicated that the best approach to adjudication of events and outcomes in a clinical
trial is dependent of the nature of the study and of the event/outcomes that are subject
to adjudication.
The approach that can typically minimize or even eliminate detection bias is to
engage a group of clinicians, unrelated to the study, to determine the outcome based
on prespecified criteria listed in the final study protocol. It is important that these
criteria are finalized before any outcome data are reviewed. This Outcomes Adjudi-
cation Committee would receive only data, reports, and notes that are de-identified
and on which any reference to the study (and any potentially unmasking informa-
tion) is redacted. The minutes should be taken at Committee meetings and become
part of the study documents. Decisions by the committee should be clearly indicated
in the Committee meeting minutes and should be entered into the study database by
the Committee secretary and verified by the Committee chair or designee. The
adjudications could only be changed by the Committee chair through the database
audit trail and such actions noted in subsequent Committee meeting minutes.
The independent adjudication could also include a review of unexpected serious
adverse events (SAEs) to verify the relatedness of the SAE to the study therapy (i.e.,
drug, behavior modification, device, or biologics). The same basic approach as for
clinical outcomes should be taken with SAEs.
An additional detection bias is the inability to determine if an outcome, especially
a soft or subjective outcome, has occurred due to subject recall bias. For example, in
studies of sickle cell disease, if a patient is feeling better in general, he/she may
forget about the pain episode 2 weeks ago, while a patient who feels miserable may
not. Electronic daily pain diaries (especially cell phone apps) have been very useful
in capturing transient subject outcomes with corroborative information, such as
prescription use, in a number of disease areas, such as sickle cell disease, atrial
fibrillation, and diabetes glucose monitoring. A number of these apps are being
paired with sensors to help determine if, for example, a sickle cell pain crisis is about
to start, if an episode of atrial fibrillation has started or is imminent, or if a subjects’
continuous glucose monitor is indicating out-of-control blood sugar levels. With the
availability of wearable devices, including those that can run apps, it is practical to
design studies that collect daily information on these types of outcomes or adherence
information (such as length of Transcendental Meditation or yoga practice sessions)
to avoid recall bias.
Attrition Bias
Attrition bias can have two causes. First, some outcomes, although recorded in the
database, may be excluded from analysis for a variety of reasons. Some reasons may
796 B. A. Barton
be technical in nature (e.g., outcome not assessed within prespecified time window
or not assessed using protocol specified lab test). Others may be more logistic (e.g.,
subject did not receive protocol mandated intervention, patient found to be ineligible
after randomization). Second, subjects may have dropped out of the study or can no
longer be located for follow-up. These subjects may have dropped out of the study
for reasons related to treatment (Hewitt et al. 2010), so that it is critical to keep this
“missingness” to a minimum and preferably less than 5%. Differential attrition
between the two treatment groups may be an indication that side effects (or even
treatment effects) are not acceptable to subjects, and, rather than confront clinic staff
with that decision, the subjects are walking away quietly. It is important that the
Informed Consent Form (ICF) be written in such a way as to allow indirect (at least)
searching for subject information, including vital status (through the National Death
Index). If the subject has died, the causes of death (through the NDI) can be obtained
and would be important to complete the mortality information.
A systematic review (Akl et al. 2012) assessed the reporting, extent, and handling
of loss to follow-up and its potential impact of treatment effects in randomized
controlled trials published in the five top medical journals. The authors calculated the
percentage of trials in which the relative risk would no longer be significant when
participant’s loss to follow-up varied. In 160 trials, with an average loss to follow-up
of 6%, and assuming different event rates in the intervention groups relative to the
control groups, between 0% and 33% of trials were no longer significant.
The least biased approach to analysis of a clinical trial in general is the intention-
to-treat (ITT) approach. This approach, devised at the time of the Anturane Study
controversy (The Anturane Reinfarction Trial Research Group 1978; Temple and
Pledger 1980), has three principles: (1) all patients are analyzed in the treatment
group to which they were assigned; (2) outcomes for each subject must be recorded;
and (3) all randomized subjects should be included in the analysis. The problem is
that it is rare that a study has the outcome(s) for all subjects. So, some form of
imputation is usually required to satisfy all three of the ITT principles. The rule of
thumb is that, if the level of missing outcome data is 5% or less, it will not affect the
overall study results and imputation is not critical. If the level of missingness is 10%
or more, multiple imputation is a good clinical practice and should be performed. If
the level of missingness is 20% or more, imputation can overly influence the results
and, in a sense, drive the results. In these situations, other approaches for dealing
with the missingness would be necessary. These approaches will depend on the
study, but could include checking other sources for information (NDI, Social
Security, Medicare, all-payer claims databases, or contacting other family members).
Briefly, multiple imputation is a strategy that generates expected outcomes for
patients missing them (Sterne et al. 2009). This is usually done using a model-based
approach based on observed data. However, because even model-based imputation
can produce the same expected outcome for multiple patients, the end result of a
single imputation is likely to yield a smaller standard deviation (and, thus, standard
error of the regression coefficients), making it easier to reject the null hypothesis than
in reality. The solution is to produce multiple sets of imputed data, each with a
random variation of the imputed values designed to restore the full variability of the
42 Controlling Bias in Randomized Clinical Trials 797
outcome. The same strategy can be used to generate expected predictors when
important predictors are missing. The multiply imputed data sets are then analyzed
using an analysis stratified by imputation data set and the results combined to
produce a single analytic result.
A second approach to the analysis, sometimes called the “modified intention to
treat,” excludes subjects who have not received a protocol-specified minimum
“dose” of the intervention. The problem with the modified ITT approach is that
people who drop out early due to immediate adverse events are excluded from the
analysis – the exact problem that ITT was designed to prevent. This approach is used
frequently in oncology studies that enroll patients with advanced disease. The
rationale is that a number of these patients do not live long enough after randomi-
zation to receive the minimum dose – or potentially any dose. Thus, using ITT for
efficacy analysis could artificially reduce the success rate in those studies.
A third approach, the “per-protocol” approach, excludes subjects who do not
receive the protocol-specified complete dose for the treatment to which they were
assigned. It is typical that studies report the ITT as the primary analysis and the per-
protocol as the secondary analysis of the primary outcome. Thus, the readers see the
most unbiased result as well as the “full dose” result. An additional concept that is
sometimes included under the per-protocol approach is to analyze subjects according
to the treatment received, not as randomized. The rationale for this is that this is a
“cleaner” approach to estimating treatment effect, rather than keeping people in their
original groups, regardless of what they received. The ITT analysis tends to mini-
mize the treatment effect, whereas the “as-treated” approach tends to report the
observed treatment effect.
The study may also inadvertently cause differential attrition. In a study of growth
in children (not a clinical trial, but the lesson is valuable), the blood pressure of the
children was measured at one of the visits, and a note was given to the children with
elevated blood pressure to take home to their parents. A higher proportion of the
children who received the note did not return for future visits compared to those who
did not receive the notes. Actions may have unintended consequences and need to be
pilot tested for acceptance by subjects.
Attrition bias can lead to nonrandom missingness (as in the example above), so
that study results could be compromised, even if the nonrandomness is recognized. If
follow-up (and, thus, outcome) data are missing not at random, interpretation of the
study is not straightforward and could be curtailed to certain subgroups. In conjunc-
tion with the Data and Safety Monitoring Board and QA/QC reports discussed
above, analyses of missingness should be included so that the DSMB can identify
early any nonrandom missingness and potential causes.
Reporting/Publication Bias
Reporting bias relates to the reporting of significant treatment comparisons and the
underreporting of nonsignificant comparisons. As Chan and Altman (2005) says,
this could be the most substantial of the biases that can affect clinical trials. The
798 B. A. Barton
Catalog of Bias lists several types of selective reporting of outcomes: (1) reporting
only those outcomes that are statistically significant, (2) adding new outcomes after
reviewing the data that are statistically significant, (3) failing to report the safety data
(i.e., adverse events) from the trial, and (4) changing outcomes of interest to include
only those that are statistically significant (Catalog of Bias Collaboration 2019). The
CONSORT statement and associated checklist (Consolidated Standards of Reporting
Trials; http://www.consort-statement.org) is a comprehensive list of items that
should be included in the reporting of clinical trials (Schulz et al. 2010; Moher
et al. 2010). Most medical journals now subscribe to the CONSORT principles,
including the principles that all primary and secondary outcomes should be reported
(Item 17a) and all adverse events reported (Item 19). With the enforced use of
CONSORT by the ICMJE (International Committee of Medical Journal Editors),
which requires that all outcomes and adverse events be reported for a clinical trial,
the reporting bias should be minimized (Thomas and Heneghan 2017). This does
assume that studies will follow CONSORT and that the journals verify that.
It should also be noted that outcome data must be posted on ClinicalTrials.gov,
along with the study protocol and adverse event information. There are strict
timelines for posting the outcome results from the study, with financial penalties
for failure to comply. This requirement will also tend to diminish this bias in the
future. A number of systematic reviews of publications versus protocols filed on
ClinicalTrials.gov or other publicly accessible data sources have documented that
between 6% and 12% of reported studies have different primary outcomes than
specified in the protocol or a different analytic approach (Dwan et al. 2008, 2013,
2014; Zhang et al. 2017; Perlmutter et al. 2017). This is complicated by the
possibility that protocols were updated after data were viewed (or even analyzed),
and it is not possible to review previous protocol versions, indicating that this may be
a substantial underestimate of the problem. So even the manuscript statement
“Analyses were conducted according to the protocol” is not necessarily meaningful.
There is no clear way to determine if the analyses were conducted using prespecified
analytic techniques or if a number of analytic approaches were used until one that
produced a significant result was found and the statistical section of the protocol (or
the SAP) was changed to reflect the new approach. One indicator of this possibility is
if an “esoteric” statistical technique is used without a clear explanation why.
The second aspect of this type of bias, the publishing bias, is the tendency of
journals to publish studies with significant results. This is much more difficult for an
investigator or research group to counter. This bias can have a wide-reaching impact
since meta-analyses use predominately published results, although more are starting
to include results posted on ClinicalTrials.gov. Because meta-analyses are frequently
used in reports to policy makers regarding health care, this bias can lead to the
exaggeration of the efficacy of a new medication or procedure and, potentially, the
underestimation of safety issues. Because the sample of patients in a clinical trial
does represent a single sample in meta-analytic terms, investigators working on
meta-analyses need to be very careful of the publication bias in RCTs. To help
counter this bias is not easy, since the journal editors have control over what gets
published; investigators involved in negative studies should argue in the article
42 Controlling Bias in Randomized Clinical Trials 799
Discussion section (as well as in the cover letter to the editor) that publication of the
negative results is important to keep the literature balanced. Statisticians and epide-
miologists who develop meta-analyses for treatments of specific conditions should
be careful to search ClinicalTrials.gov for negative, non-published results to enhance
the balanced inclusion of studies in the meta-analysis.
Other sources of bias include statistical programming quality control concerns. The
data collected by a clinical trial will be recorded in a database, such as REDCap. The
data is typically longitudinal with repeated drug administration, clinical visits,
laboratory tests, adverse events, and outcome adjudication results. Assembling
these data into an analytic database can be challenging, requiring the merging of
multiple data sets into a set of longitudinal records for each patient. That aspect of
each study is a high-risk program that must be closely checked through a
documented quality control process. The programming to determine the outcomes
for each patient are also high-risk programs. All of these programs need to be
subjected to multiple layers of quality control to verify the accuracy of the prepara-
tion of the data for analysis. The actual analysis programming is less risky because
those programs are using the prepared analytic data sets. However, with today’s
statistical software, a mistake in one line of computer code can reverse the treatment
groups for efficacy and for safety outcomes.
Another source of bias occurs in cluster-randomized studies where it is known
that recruitment of subjects in the control facility is more difficult than in the
intervention facility. It will likely be necessary to allow a longer recruitment period
in control facilities to achieve the appropriate sample size. There is also concern that
the characteristics of the subjects recruited in control facilities could be different than
those in intervention facilities. These characteristics should be monitored during
recruitment to verify that the two cluster-randomized treatment groups are similar.
The DSMB reports should contain information on patient characteristics in these
studies. In addition, the characteristics of the facilities should be compared as well.
In the advent of an imbalance in patient characteristics between treatment groups, if
discovered early in the study, an adaptive randomization plan could be implemented
so that the patient characteristics would balance themselves prior to the end of
recruitment.
In studies that are not FDA monitored, a statistical analysis plan (SAP) either does
not exist or is vague and not followed very well. FDA typically requires a SAP to be
filed before the final analysis can be conducted. The reason is simple – if the analytic
techniques are not prespecified, the investigators are able to select the analytic
technique that supports their research without concern for what was stated in a
SAP. While a preliminary analysis approach was likely included in an NIH applica-
tion, that can easily be dismissed as preliminary and not even reported. If a reduced
version of a SAP is included in the protocol (and, thus, on ▶ ClinicalTrials.gov), it is
much harder to ignore it. But few protocols include much more than a cursory
800 B. A. Barton
explanation of the analysis plan unless there is FDA oversight, in which case the
major elements of the SAP are included in the protocol.
Summary
This section describes the major sources of bias in RCTs and possible solutions. In most
cases, depending on the nature of the study, other solutions can be found in addition to
those described here. There is no “push button” approach to safeguarding a trial from
bias – every RCT is different and will require different approaches to controlling and,
hopefully, eliminating bias. New sources of bias can arise through the electronic social
media. For example, a slight difference in the appearance of a placebo (or the inherent
differences in behavioral study treatments/conditions) can give study participants
enough information that, combined with a study-related social media group, can
unmask a study, resulting in misleading results based on patient-related outcomes.
Investigators, therefore, must be constantly watchful for bias to proactively prevent bias
from occurring and to retroactively correct existing problems.
Key Facts
1. Biases can still occur in randomized clinical trials, the study design that is
considered to be the gold standard.
2. Random treatment group assignment and data collection masked to subjects’
treatment group assignment will prevent most biases.
3. Random treatment assignments are always required for an RCT; treatment group
masking can be more challenging.
4. Other types of bias, such as reporting and publication bias, are unrelated to
randomization and masking. While reporting bias can be avoided by following
the CONSORT statement and checklist, publication bias is in the hands of the
journal editors.
Cross-References
References
Akl AE, Briel M, You JJ, Sun X, Johnston BC, Busse JW, Mulla S, Lamontagne F, Bassler D, Vera
C, Alshurafa M, Katsios CM, Zhou Q, Cukierman-Yaffe T, Gangji A, Mills EJ, Walter SD, Cook
DJ, Schünemann HJ, Altman DG, Guyatt GH (2012) Potential impact on estimated treatment
effects of information lost to follow-up in randomized controlled trials (LOST-IT): systematic
review. BMJ 344:e2809
Berger VW, Christophi CA (2003) Randomization technique, allocation concealment, masking, and
susceptibility of trials to selection bias. J Mod Appl Stat Methods 2(1):80–86
Catalog of Bias Collaboration (2019). Catalog of Bias, November 19. Retrieved from catalogofbias.
org
Chan A-W, Altman DG (2005) Identifying outcome reporting bias in randomised trials on PubMed:
review of publications and survey of authors. BMJ 330(7494):753
Doll R (1998) Controlled trials: the 1948 watershed. BMJ 317:1217
Dunn K (2019) Shewhart charts, July 17. Retrieved from https://learnche.org/pid/process-monitor
ing/shewhart-charts
Dusingize JC, Olsen CM, Pandeya NP, Subramaniam P, Thompson BS, Neale RE, Green AC,
Whiteman DC, Study QS (2017) Cigarette smoking and the risks of basal cell carcinoma and
squamous cell carcinoma. J Invest Dermatol 137(8):1700–1708
Dwan K, Altman DG, Arnaiz JA, Bloom J, Chan AW, Cronin E, Decullier E, Easterbrook PJ, Von
Elm E, Gamble C, Ghersi D, Ioannidis JP, Simes J, Williamson PR (2008) Systematic review of
the empirical evidence of study publication bias and outcome reporting bias. PLoS One 3(8):
e3081
Dwan K, Gamble C, Williamson PR, Kirkham JJ, Reporting Bias Group (2013) Systematic review
of the empirical evidence of study publication bias and outcome reporting bias – an updated
review. PLoS One 8(7):e66844
Dwan K, Altman DG, Clarke M, Gamble C, Higgins JP, Sterne JA, Williamson PR, Kirkham JJ
(2014) Evidence for the selective reporting of analyses and discrepancies in clinical trials: a
systematic review of cohort studies of clinical trials. PLoS Med 11(6):e1001666
Editorial (2014) #Trial: clinical research in the age of social media. Lancet Oncol 15(6):539
Hewitt CE, Kumaravel B, Dumville JC, Torgerson DJ, Trial Attrition Study Group (2010)
Assessing the impact of attrition in randomized controlled trials. J Clin Epidemiol 63(11):
1264–1270
Higgins JPT, Altman DG, Behalf of the Cochrane Statistical Methods Group and the Cochrane Bias
Methods Group (2008) In: JPT H, Green S (eds) Cochrane handbook for systematic reviews of
interventions. Wiley, Chichester
802 B. A. Barton
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806
Why Mask Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806
How to Mask Investigators in Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 810
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 813
Abstract
The substantial investment of both time and money to mount a clinical trial would
not be made without an underlying belief that the new treatment is likely
beneficial. While a lack of definitive evidence can underpin the equipoise of
investigators that is necessary to mount a new trial, the success in previous early
phase trials (or even animal models) provides a natural foundation for an expected
benefit in subsequent phase trials. Both investigators and patients can share this
belief, and these expectations of treatment efficacy for new therapies introduce
the potential for bias in clinical trials. The benefits, completeness, and reporting of
masking in clinical trials are described, as they are approaches for implementing
and maintaining the mask.
Keywords
Masking · Blinding · Assessment of outcomes
G. Howard (*)
Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
e-mail: [email protected]
J. H. Voeks
Department of Neurology, Medical University of South Carolina, Charleston, SC, USA
e-mail: [email protected]
Introduction
The substantial investment of both time and money to mount a clinical trial would
not be made without an underlying belief that the new treatment is likely beneficial.
While a lack of definitive evidence can underpin the equipoise of investigators that is
necessary to mount a new trial, the success in previous early phase trials (or even
animal models) provides a natural foundation for an expected benefit in subsequent
phase trials. Such an underlying belief of efficacy is demonstrated when investiga-
tors were asked to guess whether patients were assigned to active versus placebo in a
trial treating depression and were more likely to guess assignment to active treatment
among patients who had better clinical outcomes (and also among patients with more
adverse events) (Chen et al. 2015). Likewise, patients either have a predisposition or
are transferred a confidence that active treatment is superior, with patients even in
early-phase cancer trials are optimistic that new therapies will be beneficial (Sulmasy
et al. 2010; Jansen et al. 2016). These expectations of treatment efficacy for new
therapies introduce the potential for bias in clinical trials (see Fig. 1).
Masking (or blinding) of the treatment assignment stands as one of the pillars to
protect the study from potential biases introduced through these expectations. These
expectations could consciously or unconsciously influence the messaging by investi-
gators to subjects in the description of expected outcome and adverse events, the
Table 1 Potential benefits accruing depending on those individuals successfully masked. (From
Schulz and Grimes (2002))
Individual
masked Potential benefit
Participants Less likely to have biased psychological of physical responses to intervention
More likely to comply with trial regimens
Less likely to seek additional adjunct interventions
Less likely to leave trial without providing outcome data, including lost to
follow-up
Trial Less likely to transfer their inclinations or attitudes to participants
investigators Less likely to differentially administer co-interventions
Less likely to differentially adjust dose
Less likely to differentially withdraw participants
Less likely to differentially encourage or discourage participants to continue
trial
Assessors Less likely to have biases affect their outcome assessments, especially with
subjective outcomes of interest
of the masking index. Likewise, the reporting of the success of efforts to mask was
reported in only 31 of 1,599 (2%) trials described as masked, included in the Cochrane
Central Register of Control Trials, and published in 2001 (Hrobjartsson et al. 2007).
The guideline for reporting the success of masking included in the 2001 CONSORT
guidelines was removed from the 2010 CONSORT guidelines because of a lack of
empirical evidence supporting the practice and concerns regarding the validity of
assessments (Schulz et al. 2010). Reports that do report the frequency of assessing
the success of masking are inconsistent, for example, with the previously mentioned
meta-analysis of psychiatric trials showing masking success in both arms (Freed et al.
2014), while the masking was not successful in the pain trials (Colagiuri et al. 2019).
Success in masking appears higher in trials with smaller effect sizes (Freed et al. 2014).
However, as many as half of masked trials conducted some assessment of the quality
of masking without reporting it in the literature (Bello et al. 2017), where 20% or less
of the trials with formal assessments of masking reported these results in publications
(Hrobjartsson et al. 2007; Bello et al. 2017). Hence, reporting bias and other factors
challenge efforts to describe whether masking can be successfully implemented and
factors associated with masking success.
Among the mechanisms through which the lack of masking could introduce bias is
the possibility that knowledge of the treatment assignment may make affect the
vigilance of investigators in their surveillance of potential study outcomes. For
example, an unmasked investigator may consciously or unconsciously more aggres-
sively probe potential symptoms for patients in the placebo arm, “ensuring” that the
events presumed to be more common are not missed. Conversely, the investigator may
have a higher threshold to declare an outcome event in the actively treated arm. But, as
outcomes become more objective, there is less judgment in probing for potential
events and less leeway for judgment by the investigator assessing outcomes. Hence,
with objective outcomes, there is less opportunity for investigators (or subjects) to
introduce bias, with little contribution for very objective outcomes such as mortality or
outcomes determined through direct measurement (e.g., weight loss, blood pressure,
lipid levels, etc.). This possibility that masking is less important with increasing
objectivity of the outcome is supported by empirical evidence. For example, in
analyzing the difference in the treatment effect between 532 trials with adequate
masking versus 272 with inadequate masking, there was no evidence of a difference
for all-cause mortality (ratio of odds ratio = 1.04; 95% CI, 0.95 to 1.14), while the
difference between masked and unmasked was significantly different (ratio of odds
ratio = 0.83; 95% CI, 0.70 to 0.98). When these differences were assessed at the
threshold where outcomes were classified as “objective” versus “subjective” out-
comes, the ratio of odds ratios were 1.01 (95% CI, 0.92 to 1.10) for objective studies
and 0.75 (95% CI, 0.61 to 0.82) for subjective outcomes (Wood et al. 2008). However,
while such evidence does support a lower importance of masking with outcomes that
are more objective, it is important to remember there are other pathways through
which unmasked investigators could introduce bias. For example, in the setting of
intensive care units (ICUs), knowledge of the treatment assignment could affect the
decision to provide or withhold life support therapy and thereby have an effect on
mortality as an outcome (Anthon et al. 2017). However, a systematic review with a
published a priori protocol (Anthon et al. 2017) suggests this theoretical possibility
810 G. Howard and J. H. Voeks
provided little support that it is a major concern. The authors considered published
systematic reviews and reanalyzed the data clustering the studies included into those
with and without masking. The results of this effort showed that for the primary
outcome of death (at the longest follow-up time), only 1 of 22 studies showed a larger
treatment effect for those unmasked than masked (odds ratio = 0.58; 95% CI,
0.35–0.98 versus odds ratio = 1.00; 95% CI, 0.87–1.16) (Anthon et al. 2018). With
22 assessments (and testing interaction at α = 0.10), this is nominally fewer treatments
showing heterogeneity of effect than expected. Similar findings were shown for other
outcomes including in-hospital and in-ICU mortality (Anthon et al. 2018). Still, even
in studies with very objective outcomes, the possibility that alternative pathways could
introduce bias should be carefully considered before abandoning masking.
There is a growing literature supporting the position that estimation of treatment
effects based on events as reported by site investigators is quite similar to results
when information is centrally retrieved and processed by adjudication committees
(Ndounga Diakou et al. 2016). However, it is important to recognize that decision to
use or not use adjudication committees differs fundamentally from the decision to
mask or not mask studies. That is, it is straightforward to maintain a mask within the
clinics for pharmacological treatments with active versus placebo treatments, and
hence the decision for the use of adjudication becomes a comparison of masked local
determination of outcomes versus the masked central adjudication of outcomes.
However, for other trials where maintaining a mask within the clinical center is
problematic (such as surgical trials), providing adequate masking for the determina-
tion of outcomes may require either additional staff in the clinical centers who are
masked to treatment allocation (and trust to believe that a “wall” there between the
masked and unmasked clinical center staff) or the use of central adjudication
committee that can be masked (discussed below).
It does seem intuitive that investigators (and subjects) could be influenced by the
knowledge of the treatment allocation, and this intuition is supported by empirical
data showing a bias for larger treatment effects in unmasked studies. While a large
number of studies are masked, there appears to be substantial room for improvement
in the reporting of the methods for implementing the mask, and the methods and
benefit for assessing the success of masking remain questionable.
While double mask active versus placebo treatments in pharmacologic trials are
frequently the first thought of when discussing masking in clinical trials, the mask
of treatment assignment is much more complex for many trials. Different treatments
give rise to a spectrum of challenges to the masking of investigators, with perhaps
surgical trials giving rise to the largest number of issues. Here, clearly those providing
the therapy cannot be masked, nor can patients generally be masked (without the use
of sham surgery, i.e., frequently considered unethical (Macklin 1999)), nor can many
assessors be masked from the scars associated with procedures. Implementation of
masking in lifestyle treatments (e.g., diet, exercise, etc.) is similarly difficult to
43 Masking of Trial Investigators 811
implement. However, as noted above, the lack of masking can give rise to biased
estimates of treatment effect, and as such the effort to provide the most complete
masking feasible is central to the good conduct of studies. As such, an array of tools
and approaches have been developed to reduce bias. Karanicolas and colleagues have
proposed three consideration to consider in the implementation of these approaches:
they should successfully conceal the group allocation, they should not impair the
ability to successfully assess outcomes, and they must be acceptable to the individuals
assessing the outcome (Karanicolas et al. 2008).
Boutron and colleagues’ outstanding systematic review of methods for masking
studies among 819 trials offered an effective strategy for classifying masking tools
and approaches, specifically whether they primarily (1) mask patients and healthcare
providers, (2) maintain the mask of patients and healthcare providers, or (3) support
the masking of assessor of outcomes (Boutron et al. 2006). We will follow this
structure in the review of these methods.
Among approaches to support the masking of patients and healthcare providers,
by far, the most common technique is the central preparation of oral/topical active
treatments with masked alternative treatments, an approach employed by 193 of 336
(57%) of studies reporting approaches to mask the patient and healthcare provider
(Boutron et al. 2006). This approach for masking is nearly ubiquitous in pharmaco-
logical clinical trials, and use of a central pharmacy effectively masks the treatment
assignment from the investigators. While this approach is common, the effort and
cost to identify and contract with a central pharmacy partner for the trial are
considerable, and the time line for implementation and production of active and
placebo treatment should not be underestimated. In addition, investigators in the
central pharmacy have unique experience and insights that are often remarkably
useful for the trial; it is critical to identify and involve these scientists as early as
possible in the trial planning process. Specifically, while the active drug may be
readily available, the creation of a placebo treatment with similar characteristics
sometimes requires encapsulation to conceal the active drug, or the addition of
flavors to mask the taste. Care and due diligence are still required, as while never
published in the reports from the trial, the investigators in the Vitamin Intervention in
Stroke Prevention (VISP) trial fortunately had the foresight to bioassay the first batch
of active and placebo medications provided by the central pharmacy, finding the
placebo to have levels of the treatment medications (folate, B6, and B12) nearly
indistinguishable from the active medication (obviously resolved prior to the onset
of the trial). Trials with an active alternative treatments (e.g., a trial of teriparatide
versus risedronate for new fractures in postmenopausal women (Kendler et al.
2018)) offer additional challenges, where it could be difficult to produce treatment
that appear similar even with encapsulation. Such a situation may call for a placebo
to be created for each of the two active treatments, or a “double-dummy” design.
Once masking of treatment assignment is established, efforts need to focus on
maintaining that mask during patient follow-up during which several factors work to
potentially unmask the treatment assignment. A particular challenge to maintaining
the masking of investigators are pharmacological trials that require dose adjust-
ments, where the mask can be maintained by having a centralized office that creates
812 G. Howard and J. H. Voeks
the adjustment orders with the inclusion of sham adjustments for those patients on
placebo. The investigators can also be partially or completely unmasked by the
availability of results of laboratory or other assessments at the clinical site. This
possibility can be reduced by the use of a central laboratory or reading facility with
only selected information required for safety being returned to the clinical site. The
masking of investigators is also challenged by the occurrence of specific adverse
events, and again the use of a central facility to process and report adverse events can
reduce this possibility and by systematic treatments to prevent adverse events that
are applied equally in both treatment groups. Finally, it is critical for the investigators
to avoid “messaging” to the patient about the therapeutic effect ,and the expected
side effects have to be carefully considered to maintain the mask.
However, there are treatments where maintaining the mask in the clinical center is
quite difficult or even impossible. Examples would include randomization to surgery
versus medical management for the management of asymptomatic carotid stenosis
(Howard et al. 2017), randomization to Mediterranean diet versus alternative diets
(Estruch et al. 2013), or randomization to different treatment algorithms (such as
different blood pressure levels (SPRINT Research Group et al. 2015)). In this case, a
first-line approach to provide masking is to have independent clinic staff who are not
involved in providing treatment be masked to the treatment allocation and assess the
trial outcomes; however, such an approach requires faith that the clinic staff will
maintain a “wall” between staff who may know each other well. Alternatively,
outcomes can be centrally processed by trial staff that are masked to treatment
allocation, an approach referred to as a prospective randomized open-blinded end-
point (PROBE) design (Hansson et al. 1992). Examples of the approach include the
video recording of a neurological examination with centralized scoring of the
modified Rankin score that serves as the primary study outcome (Reinink et al.
2018) or the retrieval of medical records for suspected stroke events that can be
redacted to mask treatment allocation and adjudicated by clinicians who are masked
to treatment allocation (Howard et al. 2017). Even with the use of PROBE designs,
investigators must be careful to not let the actions of the unmasked clinic staff
introduce bias. For example, the clinic staff could be more sensitive to the detection
of potential events in the medically managed group and be more likely to report these
events for the central adjudication. This can be partially overcome by the introduc-
tion of triggers, such as a 2-point increase in a clinical stroke scale, and requiring
records to be provided each time the trigger occurs. This potential bias can also be
reduced by setting a very low threshold for suspected events, so that many more
records are centrally reviewed with a relatively small proportion being adjudicated as
a study outcome. That PROBE approaches could reduce but not eliminate bias is
supported by a meta-analysis of oral anticoagulants to reduce stroke risk estimating
the treatment difference between trials using a double mask approach (4 trials)
versus a PROBE design (9 trials). This analysis observed a nonsignificantly
(p = 0.16) larger effect for stroke prevention in the PROBE studies (relative risk
= 0.76; 95% CI, 0.65–0.89) than for the double mask studies (relative risk = 0.88;
95% CI, 0.78–0.98) and a significantly larger effect (p = 0.05) for the prevention of
hemorrhagic stroke in the placebo trials (relative risk = 0.33; 95% CI, 0.21–0.50) than
43 Masking of Trial Investigators 813
the double mask studies (relative risk = 0.55; 95% CI, 0.41–0.73) (Lega et al. 2013).
While the use of PROBE methods likely reduces bias in outcome ascertainment, it is
not clear that these methods are as widely used as possible. For example, in a review of
171 orthopedic trials, masking of clinical assessors was considered feasible in 89% of
studies and masking of radiographic assessors in 83% of trials; however, less than 10%
of these trials used masked assessors (Karanicolas et al. 2008).
While simple active/placebo masking is possible for some treatments, many trials
will require creativity and determination to implement masking of investigators.
Additionally, once masking is in place, efforts need to be directed to maintain the
mask.
Conclusions
Masking stands as one of the pillars to reduce or eliminate bias in the conduct of
clinical trials. Without masking, intentional or unintentional prejudice can influence
the outcome of the trial, and as such ignorance is truly bliss.
References
Anthon CT, Granholm A, Perner A, Laake JH, Moller MH (2017) The effect of blinding on
estimates of mortality in randomised clinical trials of intensive care interventions: protocol for
a systematic review and meta-analysis. BMJ Open 7(7):e016187
Anthon CT, Granholm A, Perner A, Laake JH, Moller MH (2018) No firm evidence that lack of
blinding affects estimates of mortality in randomized clinical trials of intensive care interven-
tions: a systematic review and meta-analysis. J Clin Epidemiol 100:71–81
Armijo-Olivo S, Fuentes J, da Costa BR, Saltaji H, Ha C, Cummings GG (2017) Blinding in
physical therapy trials and its association with treatment effects: a meta-epidemiological study.
Am J Phys Med Rehabil 96(1):34–44
Bang H, Ni L, Davis CE (2004) Assessment of blinding in clinical trials. Control Clin Trials 25
(2):143–156
Bello S, Moustgaard H, Hrobjartsson A (2017) Unreported formal assessment of unblinding
occurred in 4 of 10 randomized clinical trials, unreported loss of blinding in 1 of 10 trials. J
Clin Epidemiol 81:42–50
Boutron I, Estellat C, Guittet L et al (2006) Methods of blinding in reports of randomized controlled
trials assessing pharmacologic treatments: a systematic review. PLoS Med 3(10):e425
Chen JA, Vijapura S, Papakostas GI et al (2015) Association between physician beliefs regarding
assigned treatment and clinical response: re-analysis of data from the Hypericum Depression
Trial Study Group. Asian J Psychiatr 13:23–29
Colagiuri B, Sharpe L, Scott A (2019) The blind leading the not-so-blind: a meta-analysis of
blinding in pharmacological trials for chronic pain. J Pain 20:489–500
Estruch R, Ros E, Salas-Salvado J et al (2013) Primary prevention of cardiovascular disease with a
Mediterranean diet. N Engl J Med 368(14):1279–1290
Freed B, Assall OP, Panagiotakis G et al (2014) Assessing blinding in trials of psychiatric disorders:
a meta-analysis based on blinding index. Psychiatry Res 219(2):241–247
Hansson L, Hedner T, Dahlof B (1992) Prospective randomized open blinded end-point (PROBE)
study. A novel design for intervention trials. Prospective Randomized Open Blinded End-Point.
Blood Press 1(2):113–119
814 G. Howard and J. H. Voeks
Howard VJ, Meschia JF, Lal BK et al (2017) Carotid revascularization and medical management for
asymptomatic carotid stenosis: protocol of the CREST-2 clinical trials. Int J Stroke
12(7):770–778
Hrobjartsson A, Forfang E, Haahr MT, Als-Nielsen B, Brorson S (2007) Blinded trials taken to the
test: an analysis of randomized clinical trials that report tests for the success of blinding. Int J
Epidemiol 36(3):654–663
James KE, Bloch DA, Lee KK, Kraemer HC, Fuller RK (1996) An index for assessing blindness in
a multi-centre clinical trial: disulfiram for alcohol cessation – a VA cooperative study. Stat Med
15(13):1421–1434
Jansen LA, Mahadevan D, Appelbaum PS et al (2016) Dispositional optimism and therapeutic
expectations in early-phase oncology trials. Cancer 122(8):1238–1246
Karanicolas PJ, Bhandari M, Taromi B et al (2008) Blinding of outcomes in trials of orthopaedic
trauma: an opportunity to enhance the validity of clinical trials. J Bone Joint Surg Am
90(5):1026–1033
Kendler DL, Marin F, Zerbini CAF et al (2018) Effects of teriparatide and risedronate on new
fractures in post-menopausal women with severe osteoporosis (VERO): a multicentre, double-
blind, double-dummy, randomised controlled trial. Lancet 391(10117):230–240
Lega JC, Mismetti P, Cucherat M et al (2013) Impact of double-blind vs. open study design on the
observed treatment effects of new oral anticoagulants in atrial fibrillation: a meta-analysis. J
Thromb Haemost 11(7):1240–1250
Macklin R (1999) The ethical problems with sham surgery in clinical research. N Engl J Med 341
(13):992–996
Ndounga Diakou LA, Trinquart L, Hrobjartsson A et al (2016) Comparison of central adjudication
of outcomes and onsite outcome assessment on treatment effect estimates. Cochrane Database
Syst Rev 3:MR000043
Nuesch E, Reichenbach S, Trelle S et al (2009) The importance of allocation concealment and
patient blinding in osteoarthritis trials: a meta-epidemiologic study. Arthritis Rheum
61(12):1633–1641
Page MJ, Higgins JP, Clayton G, Sterne JA, Hrobjartsson A, Savovic J (2016) Empirical evidence
of study design biases in randomized trials: systematic review of meta-epidemiological studies.
PLoS ONE 11(7):e0159267
Reinink H, de Jonge JC, Bath PM et al (2018) PRECIOUS: PREvention of Complications to
Improve OUtcome in elderly patients with acute Stroke. Rationale and design of a randomised,
open, phase III, clinical trial with blinded outcome assessment. Eur Stroke J 3(3):291–298
Saltaji H, Armijo-Olivo S, Cummings GG, Amin M, da Costa BR, Flores-Mir C (2018) Influence of
blinding on treatment effect size estimate in randomized controlled trials of oral health inter-
ventions. BMC Med Res Methodol 18(1):42
Savovic J, Jones H, Altman D et al (2012) Influence of reported study design characteristics on
intervention effect estimates from randomised controlled trials: combined analysis of meta-
epidemiological studies. Health Technol Assess 16(35):1–82
Schulz KF, Grimes DA (2002) Blinding in randomised trials: hiding who got what. Lancet 359
(9307):696–700
Schulz KF, Altman DG, Moher D (2010) CONSORT 2010 statement: updated guidelines for
reporting parallel group randomised trials. J Pharmacol Pharmacother 1(2):100–107
SPRINT Research Group, Wright JT Jr, Williamson JD et al (2015) A Randomized Trial of
Intensive versus Standard Blood-Pressure Control. N Engl J Med 373(22):2103–2116
Sulmasy DP, Astrow AB, He MK et al (2010) The culture of faith and hope: patients’ justifications
for their high estimations of expected therapeutic benefit when enrolling in early phase oncology
trials. Cancer 116(15):3702–3711
Wood L, Egger M, Gluud LL et al (2008) Empirical evidence of bias in treatment effect estimates in
controlled trials with different interventions and outcomes: meta-epidemiological study. BMJ
336(7644):601–605
Masking Study Participants
44
Lea Drye
Contents
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816
Goals of Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817
Alerting Participants That Masking Will Be Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817
Operationalizing Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
Placebos and Shams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
Unmasking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 820
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
Abstract
Masking or blinding in clinical trials refers to the process of keeping the identity
of the assigned treatment hidden from specific groups of individuals such as
participants, study staff, or outcome assessors. The purpose of masking is to
minimize conscious and unconscious bias in the conduct and interpretation of a
trial. Masking participants in clinical trials is a key methodological procedure
since patient expectations can introduce bias directly through how a participant
reports patient-reported outcomes but also indirectly through his or her willing-
ness to participate in and adhere to study activities.
The complexity of operational aspects of masking participants is often
underestimated. Masking is facilitated by placebos, dummies, sham devices, or
sham procedures/surgeries. The success of masking depends on how closely the
L. Drye (*)
Office of Clinical Affairs, Blue Cross Blue Shield Association, Chicago, IL, USA
e-mail: [email protected]
Keywords
Blind · Mask · Single mask · Double mask · Unmask · Placebo · Sham
Definitions
Introduction
Masking in clinical trials is the process of keeping one or more parties (e.g.,
participants, study staff, outcome assessors, data analysts) unaware of the identity
of the treatment assignment during the conduct of the trial. The purpose of masking
is to prevent conscious or subconscious notions and expectations regarding the
treatment effects from affecting outcomes. To minimize behavior that can lead to
differential effects on outcomes, the preferred design strategy is to mask as many
individuals as is practically possible while maintaining safety. Blinding is a term
synonymous with masking that is frequently used.
The term single masking is usually used to refer to the masking of study partici-
pants, while double masking is usually used to refer to masking of study participants
and study staff. Additional levels of masking, such as masking of data analysts or
treatment effects monitoring committees (also known as data and safety monitoring
44 Masking Study Participants 817
boards), may also be used. It is important to note that the terms single, double, and
triple masked, which are used to describe the level of masking, are not universally
standardized. Readers must evaluate the description of masking in trial publications or
study documentation such as the protocol to understand which groups were masked.
Goals of Masking
Study staff should explain to study participants that treatments will be masked and
the reasons for masking. Masking should not be attempted if accomplishing it
requires lying to participants or deception.
Both the Common Rule and FDA clinical trial regulations require (45 CFR Part
46.116 and 21 CFR 50.25) that descriptions of procedures related to research, such
as masking, should be included in the informed consent process. Participants should
be informed that they will be kept unaware of their treatment assignments in masked
studies as well as whether the study staff or physician will be aware of the treatment
assignment.
818 L. Drye
Operationalizing Masking
Mechanics
The success of masking depends on how closely the placebo or sham matches the
active treatment. The brief statements regarding masking in most published reports
of clinical trials belie how complicated the task of masking truly is.
Masking participants in drug trials depends on whether pills, tablets, patches,
injections, liquid formulations, etc. can be made to match the active treatment with
respect to obvious characteristics such as size, shape, and color but also with respect
to smell and taste, particularly if the drug is known to have a characteristic feature. In
practice, this is generally possible only when active drug and matching placebo are
provided by the manufacturer since formulating an identical product to one that is
marketed is not legal. In the Alzheimer’s Disease Anti-inflammatory Prevention
Trial (ADAPT), Bayer and Pfizer supplied investigators with identical placebos
matching their marketed products (Martin et al. 2002).
When the manufacturer does not provide matching placebo, overencapsulation is
a technique that can be used to produce identical active and placebo capsules for a
trial of treatments that are in pill, tablet, or capsule formulation. This process is
expensive and may require lab testing to confirm bioavailability. Depending on how
large the overencapsulated study drug becomes, it can add additional eligibility
criteria for participation, i.e., participants must be able to swallow the capsule to
enroll.
Masking of drugs goes beyond creating identical product. The packaging also
must be identical. This will require repackaging of drug in some situations which
adds cost and time and also may have implications for product stability.
Masking of participants in drug trials becomes even more complicated if there are
more than two treatment groups or an active control, or if the treatments are taken at
different intervals or via different routes. In these cases, placebos will need to be
44 Masking Study Participants 819
created so that participants take active and placebo treatments at the appropriate
treatment intervals and routes for all treatments. If one treatment requires adminis-
tration twice as often as another, all participants must take study treatment at the
most frequent interval to maintain masking. Similarly, if one treatment is a tablet and
another is an injection, participants in both groups must receive both tablets and
injections. For example, the Oral Psoriatic Arthritis Trial (OPAL) Broaden phase
3 trial compared tofacitinib to adalimumab for psoriatic arthritis in patients with
inadequate response to previous disease-modifying antirheumatic drugs (Mease
et al. 2017). Patients were randomized in a 2:2:2:1:1 ratio to receive tofacitinib
5 mg twice daily, tofacitinib 10 mg twice daily, adalimumab 40 mg administered
subcutaneously once every 2 weeks, placebo with switch to the 5 mg tofacitinib at
3 months, or placebo with switch to the 10 mg tofacitinib at 3 months. In order to
mask the trial, all patients had to take two tablets twice daily and receive biweekly
injections. The content of the tablets and injections varied according to treatment
group.
Masking of participants in device or surgery trials is difficult but not always
impossible. Sham devices are manufactured to appear the same as active devices
but are manipulated so that they do not function as required to administer the
treatment. The Escitalopram versus Electrical Current Therapy for Treating
Depression Clinical Study (ELECT-TDCS) compared transcranial direct-current
stimulation with escitalopram in patients with major depressive disorder (Brunoni
et al. 2017). Patients received active or placebo escitalopram and active or sham
transcranial direct-current stimulation. Sham transcranial direct-current stimula-
tion was accomplished using fully automated devices that were programmed to
turn off the current automatically after 30 s. In a sham-controlled trial of 5 cm H2O
and 10 cm H2O of continuous positive airway pressure (CPAP) in patients with
asthma, “sham” CPAP was delivered via identical devices calibrated by the
manufacturer to deliver pressure at less than 1 cm H2O with masked display of
pressure level and intake flow rates and noise levels similar to the active devices
(Holbrook et al. 2016).
In a sham surgery, an imitation procedure is performed to mimic the active
surgery. This might include patients receiving anesthesia, having scopes inserted,
having incisions, etc. Therefore, sham surgeries do carry risks that are more difficult
to justify ethically and have been used less often than placebos. If patients are not
under general anesthesia, then the surgical team may also have to mimic sounds,
smells, and dialogue of surgery so that patients cannot distinguish whether or not
they underwent the actual surgery. Any imaging to check success of surgery or
medication to prevent infection must also be mimicked in patients assigned to sham.
While difficult, sham surgeries are not impossible to perform if risks to patients can
be minimized. In a trial investigating transplantation of retinal pigment epithelial
cells as a treatment for Parkinson’s disease, surgeons performed not only skin
incision but also burr holes in the skull. In the sham group, the burr holes did not
penetrate the dura matter (Gross et al. 2011).
820 L. Drye
Unmasking
• The treatment assignment is needed to care for the participant because decisions
on how to proceed depend on which treatment the participant has received
particularly in an emergency setting.
• Potential allergic reaction.
• Potential overdose of the participant or another person.
Conclusion
Masking minimizes conscious and unconscious bias in the conduct and interpretation of
a trial and as such is a key methodological procedure. While the importance of
participant masking is well understood, the complexity of its implementation is often
underestimated. It is rarely possible to create or purchase a completely identical placebo
or sham. Masking of participants is difficult if there are more than two treatment groups
or an active control, if the treatments are taken at different intervals or via different routes,
or if sham procedures are required. Investigators should have procedures in place for
routine unmasking of participants after treatment and follow-up are complete as well as
procedures to immediately unmask at any hour of the day in the event of an emergency.
Key Facts
Cross-References
References
Brunoni AR, Moffa AH, Sampaio-Junior B, Borrione L, Moreno ML, Fernandes RA, Veronezi BP,
Nogueira BS, Aparicio LVM, Razza LB, Chamorro R, Tort LC, Fraguas R, Lotufo PA, Gattaz
WF, Fregni F, Bensenor IM (2017) Trial of electrical direct-current therapy versus Escitalopram
for depression. N Engl J Med 376(26):2523–2533
Gross RE, Watts RL, Hauser RA, Bakay RA, Reichmann H, von Kummer R, Ondo WG, Reissig E,
Eisner W, Steiner-Schulze H, Siedentop H, Fichte K, Hong W, Cornfeldt M, Beebe K,
Sandbrink R (2011) Intrastriatal transplantation of microcarrier-bound human retinal pigment
epithelial cells versus sham surgery in patients with advanced Parkinson's disease: a double-
blind, randomised, controlled trial. Lancet Neurol 10(6):509–519
Holbrook JT, Sugar EA, Brown RH, Drye LT, Irvin CG, Schwartz AR, Tepper RS, Wise RA, Yasin
RZ, Busk MF (2016) Effect of continuous positive airway pressure on airway reactivity in
asthma. A randomized, sham-controlled clinical trial. Ann Am Thorac Soc 13(11):1940–1950
Hrobjartsson A, Emanuelsson F, Skou Thomsen AS, Hilden J, Brorson S (2014) Bias due to lack of
patient blinding in clinical trials. A systematic review of trials randomizing patients to blind and
nonblind sub-studies. Int J Epidemiol 43(4):1272–1283
Martin BK, Meinert CL, Breitner JC (2002) Double placebo design in a prevention trial for
Alzheimer's disease. Control Clin Trials 23(1):93–99
Mease P, Hall S, FitzGerald O, van der Heijde D, Merola JF, Avila-Zapata F, Cieslak D, Graham D,
Wang C, Menon S, Hendrikx T, Kanik KS (2017) Tofacitinib or Adalimumab versus placebo for
psoriatic arthritis. N Engl J Med 377(16):1537–1550
Issues for Masked Data Monitoring
45
O. Dale Williams and Katrina Epnere
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824
Main Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825
Some Current Guidelines and Opinions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826
Implications and Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
Key Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 830
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 831
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 831
Abstract
The essential, primary purpose of a clinical trial is to provide a fair test for the
comparison of treatments, drugs, strategies, etc. A challenge to this fairness is the
appropriate utilization, or lack thereof, of masking or blinding. Masking generally
refers to restricting knowledge as to the treatment group assignment for the
individual or, in the case of a Data and Safety Monitoring Board (DSMB), to
the summary of information comparing treatment groups. Fundamentally,
masking is important to consider for those situations wherein knowledge of the
treatment assignment could alter behavior or otherwise impact inappropriately on
trial results. Masking may, however, while protecting against this bias, make it
more difficult for the DSMB properly to protect trial participants from undue risk
of adverse or serious adverse events. While there are several dimensions to this
O. D. Williams (*)
Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA
Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
e-mail: [email protected]
K. Epnere
WCG Statistics Collaborative, Washington, DC, USA
e-mail: [email protected]
overall situation, this chapter addresses the important issue as to whether a trial’s
DSMB should be fully aware of which treatment group is which as it reviews data
summaries for an ongoing trial.
Keywords
Data and safety monitoring board · Data monitoring committee · Masking ·
Blinding · Open report · Closed report · Interim analysis · Risk/benefit
Introduction
restricted to Board members and those responsible for preparing the closed reports.
For publically funded trials, representatives of the funding agency may attend
(Anand et al. 2011; Bierer et.al 2016; DeMets et al. 2004; Wittes et al. 2007).
Main Focus
This chapter focuses on how these treatment groups are identified in the closed reports
and during the consequent Board discussions.
Context is a critical issue, elements of which include:
1. No masking – Also sometimes called “open label,” generally implies neither the
participants enrolled in the trial, the staff conducting the trial, nor the DSMB are
masked to treatment assignment. Even for this situation, trial site staff and potential
enrolled participants typically – and importantly – are masked with respect to the
treatment assignment for participants in line to be randomized.
2. Almost no masking – A special case of the “no masking” situation whereby the
process of assessing the trial’s primary outcome is done in a masked fashion. That
is, the individuals or panels assessing or measuring trial outcomes for individual
enrolled participants do so without knowledge as to the participant’s treatment
group. This is often considered a critically important need for any trial.
3. Single-masked trial – Generally refers to the situation where the participants
enrolled in the trial do not know which treatment they are receiving, but the
trial site staff and others are aware of treatment assignment (▶ Chap. 44,
“Masking Study Participants”).
4. Double masked – Generally refers to the situation where neither the enrolled
participants nor the trial site staff are aware of the treatment assignments.
5. Triple masked – Generally refers to the double-masked situation plus the masking
of the DSMB when reviewing ongoing results by treatment group.
6. Only DSMB unmasked – The critical question herein is whether the DSMB
should be masked or should be the only operational entity, except for the staff
preparing reports for the DSMB, which should be unmasked.
There is an important difference between the masking issues for a DSMB relative to
those for the other components mentioned above. For the others, the frame of reference
is possible bias for data items for individual enrolled participants. For the DSMB, the
judgment is whether differences between the treatment groups as represented by
summary data merit some action. This is, by its very nature, a broader, more important
assessment of benefit and risk, often with societal implications (▶ Chaps. 99, “Safety
and Risk Benefit Analyses”).
In this context, the operative question is whether the Board should be masked as
to the identity of the treatment groups and how should this operational decision be
made and implemented. Some options are listed below:
1. Masked mandated – The implication is that during the ongoing trial, the Board
would review differences between treatment groups with the treatment groups
826 O. D. Williams and K. Epnere
identified by codes (e.g., group A and B). They would learn the actual identity of
said groups only at the end of the trial. This strategy gives considerable weight to
the concern that the knowledge of treatment group could bias the interpretation of
interim results and lead to perhaps inappropriate action. It gives lesser weight to
the concern that the strategy would perhaps impair the Board’s ability to address
adverse event issues in a timely and appropriate manner.
2. Unmasked mandated – The implication is that the likelihood of the Board not
properly being able to assess harm or benefit without knowing the identity of the
treatment groups outweighs the concern about possible bias.
3. Something in between – If so, who decides and how will it be structured.
Although Boards typically have purview over additional issues, the two most
central issues tend to be:
Operational implications can be quite different for these two issues: a masking
policy for one may well not be appropriate for the other. The adverse event situation
typically involves the Board’s review of individual events in addition to summaries
comparing treatment groups. Further, the review of adverse events may be under-
taken on the occurrence of each individual event along with a careful investigation of
differences between treatment groups subsequently at a Board’s formal meeting. The
fundamental issue is risk to the enrolled individuals. A careful assessment of this risk
may require that the Board be informed as to treatment assignment.
Outcome comparisons may be made at each of the Board’s formal meetings. These
comparisons are critical to the purpose of the trial, and any factors that could impact on
the fairness of the comparison need careful attention. In this context, some Board
members may prefer to be masked and others may not. A special case is the issue of
interim analyses. There is often pressure for the Board not to formally review interim
analyses dealing with primary outcomes. Further, if the Board also is not able to
conduct “informal” looks at such efficacy data, there is an interesting situation with
respect to masking. If there are no such interim analyses or looks, then the Board is
inherently totally masked as to the assessment of outcomes prior to the end of the trial
(▶ Chap. 59, “Interim Analysis in Clinical Trials”). In this case, arguments for or
against the Board being masked with respect to primary outcomes are moot.
The critical aspects of assessing the pros and cons of masking Board are the
expectations and opinions of funding agencies, of governmental agencies, of inves-
tigators, and of experts in the field. Highlighted below are excerpts from relevant
documents and publications:
45 Issues for Masked Data Monitoring 827
Note that the EMA does not mandate unblinded reports; rather it indicates that the
DMC “may review unblinded study information.”
2. From the US Food and Drug Administration (FDA) (US DHHS FDA CBER
CDER CDRH 2016):
We recommend that a DMC have access to the actual treatment assignments for each study
group. Some have argued that DMCs should be provided only coded assignment information
that permits the DMC to compare data between study arms, but does not reveal which group
received which intervention, thereby protecting against inadvertent release of unblinded
interim data and ensuring a greater objectivity of interim review. This approach, however,
could lead to problems in balancing risks against potential benefits in some cases.
Note that this report references possible problems in some cases in “balancing risk
against potential benefits” (▶ Chap. 99, “Safety and Risk Benefit Analyses”) and
thus recommends that the Board be unblinded.
3. From the National Heart, Lung, and Blood Institute (NHLBI) (National Heart,
Lung, and Blood Institute National Institutes of Health 2014):
• Are convened to protect the interests of research subjects and ensure that they are not
exposed to undue risk.
• Operate without undue influence from any interested party, including study investigators
or NHLBI staff.
• Are encouraged to review interim analysis of study data in an unmasked fashion.
DMCs must periodically review the accumulating unmasked safety and efficacy data by
treatment group, and advise the trial sponsor on whether to continue, modify, or terminate a
trial based on benefit-risk assessment, as specified in the DMC Charter, protocol, and/or
828 O. D. Williams and K. Epnere
statistical analysis plan. During conduct of the trial, DMCs should periodically review by
treatment group and in an unmasked fashion: primary and secondary outcome measures,
deaths, other serious and non-serious adverse events, benefit-risk assessment, consistency of
efficacy and safety outcomes across key risk factor sub-groups.
These guidelines require periodic reviews during the conduct of the trial by treatment
group in an unmasked fashion.
The ability of DSMBs to monitor trial progress and ensure the safety of patients may be
compromised without access to unmasked data.
2. Chen-Mok et al. described the experiences and challenges in data monitoring for
clinical trials within an international tropical disease research network:
The interim reports discussed during closed sessions were presented using treatment
codes (eg, A and B), with any needed unblinding done in an executive session of voting
members only. The executive secretary kept sealed envelopes containing treatment
decoding information [..] These envelopes were available to members for each study
being reviewed at a meeting. DSMB members began to consider the arguments for fully
unblinded reviews and began to move toward more easily unblinding reports. However,
members did not achieve a clear position regarding automatic unbinding of reports.
(Chen-Mok et al. 2006)
45 Issues for Masked Data Monitoring 829
It is difficult to conjecture whether the DSMB being unmasked at time of the first interim
analysis [..] would have led to different decisions regarding study continuation and timing of
subsequent data reviews. Blinded review requires simultaneous consideration of different
possible scenarios, and the CRISIS DSMB members were sufficiently comfortable with the
two possibilities to maintain masking until the second data review. (Holubkov et al. 2013)
. . . DMCs should have full access to unblinded accumulating data on safety and efficacy
throughout the clinical trial. Some believe a DMC should receive only safety data or that a
DMC that receives efficacy data only by blinded codes (e.g., Group A versus Group B) will
be more objective in assessing interim data. The consensus of the expert panel was that such
blinding was counterproductive, even potentially dangerous to the safety of the study
participants. By having access to unblinded data on all relevant treatment outcomes, the
DMC can develop timely insights about safety in the context of a benefit-to-risk assessment,
as well as about irregularities in trial conduct or in the generation of the DMC reports.
(Fleming et al. 2017)
(Fleming et al. 2018, DeMets and Ellenberg 2016).
5. In 1998, the New England Journal of Medicine published Meinert’s opinion regard-
ing masked DSMB reporting:
Masked monitoring is thought to increase the objectivity of monitors by making them less
prone to bias. What is overlooked is what masking does to degrade the competency of the
monitors. The assumption underlying masked monitoring is that recommendations for a
change in the study protocol can be made independently of the direction of a treatment
difference, but this assumption is false. Usually, more evidence is required to stop a trial
because of a benefit than because of harm. Trials are performed to assess safety and efficacy,
not to “prove” harm. Therefore, it is unreasonable to make the monitors behave as if they
were indifferent to the direction of a treatment difference. (Meinert 1998)
First and foremost, the distinction between the three options for masking listed above can
be described simply as differences as to when unmasking occurs. For Option 1 Masked
Mandated, the unmasking would occur only at the end of the trial. For Option
2 Unmasked Mandated, the unmasking would occur at the outset. For Option 3 Some-
thing in between, there may be masking at the outset but unmasking later in accordance to
decisions made jointly by the funding entity, the Board itself, and others as appropriate.
The guidelines and opinions expressed above appear to be more in the context of
clinical research for which adverse events and serious adverse events are important.
In this context and in view of the information above, a conclusion is that Option
1 Masked Mandated above is neither practical nor tenable for many of these trials.
830 O. D. Williams and K. Epnere
For those investigative issues for which there is limited concern about risk and some
concern about judgment bias, the strategy may be more acceptable. Nevertheless,
Option 3 is likely to be a better alternative than mandating masking at the outset and
maintaining it until the trial’s end.
For Option 2, some language in relevant guidelines and used in discussions seems
to assume that the only options are Option 1 and Option 2, that is, the assumption
appears to be that masking would be mandated in such a manner that unmasking
would occur only at the end of the trial. Nevertheless, there is apparent considerable
force behind the recommendation that Option 2 be the operative strategy.
There are, however, some concerns with Option 2 Unmasked Mandated that deserve
consideration. The most compelling is the opinion of members of the Board, the funding
entity, and any other entity with an explicit role in the Board’s deliberations. This
collection necessarily has a clear understanding of the needs of the trial and should be
well positioned about issues critical to its success. There certainly have been instances
where the masking strategy was discussed at the outset and the decision was for the
Board to be masked. In this circumstance there is a clear understanding that the Board
can chose to be unmasked at any point for which it seems appropriate to do so. When
considering unmasking, the discussions tend to focus, as they should, on assessing the
joint issue of adverse events and primary and other outcomes.
An important issue is how the masking strategy utilized is reflected in the reports of
analyses summarizing adverse events and outcomes. If Option 1 is utilized, the reports
would necessarily have the treatment groups coded in some way, say Treatment A and
Treatment B. For Option 2, this would be unnecessary, and the treatment groups could
be clearly indicated. For Option 3, it may be necessary to code as per Option 1 and
then unmask this coding scheme when the decision to unmask the Board is made.
However, it may well be prudent to use coded labels (Buhr et al. 2018) in the reports in
any case as this may help prevent the identification of the treatment groups should the
reports be accessed inappropriately. If the Board is operating unmasked, then it would
simply need access to the interpretation of the codes.
Key Suggestions
A simple strategy consistent with the apparent purpose of the guidelines and
opinions above is listed below. This strategy accommodates the wide variety of
trials and questions they address and the opinions of the Board members, funding
entity, study leaders, and others as appropriate:
Step 1. At the outset of the functioning of the Board, have a clear discussion of the
masking strategy that seems most appropriate for the trial in question. This discus-
sion would necessarily involve the funding entity and others as appropriate.
Step 2. If all agree that an unmasked approach is most appropriate and should be
used from the outset, then proceed accordingly. However, it still may be prudent to
code the labels for the treatment groups in reports, with the code readily available to
the Board, as a strategy to diminish the likelihood of inadvertent knowledge of trial
status by someone who otherwise would not have access to this information.
45 Issues for Masked Data Monitoring 831
Step 3. If all agree that beginning with a masked approach is preferable, then a
prudent strategy would be to reconsider this decision at each subsequent meeting so
that the mask can be readily lifted if appropriate.
It should be noted that this strategy is not novel (Buhr et al. 2018). It has been used
and is being used for clinical trials both recent and underway. It puts the critical
decision as to the most appropriate masking strategy in the hands of those responsible
for the Board’s operation for a specific study. Thus, it takes into account the specific
characteristics of both the study in question and the concerns of the appointed Board.
Cross-References
References
Anand SS, Wittes J, Yusuf S (2011) What information should a sponsor of a randomized trial
receive during its conduct? Clin Trials 8(6):716–719
Bierer BE, Li R, Seltzer J, Sleeper LA, Frank E, Knirsch C, Aldinger CE, Lavine RJ, Massaro J,
Shah A, Barnes M, Snapinn S, Wittes J (2016) Responsibilities of data monitoring committees:
consensus recommendations. Ther Innov Regul Sci 50(5):648–659
Buhr KA, Downs M, Rhorer J, Bechhofer R, Wittes J (2018) Reports to independent data
monitoring committees: an appeal for clarity, completeness, and comprehensibility. Ther
Innov Regul Sci 52(4):459–468
Calis KA, Archdeacon P, Bain RP, Forrest A, Perlmutter J, DeMets DL (2017) Understanding the
functions and operations of data monitoring committees: survey and focus group findings. Clin
Trials 14(1):59–66
Chen-Mok M, VanRaden MJ, Higgs ES, Dominik R (2006) Experiences and challenges in data
monitoring for clinical trials within an international tropical disease research network. Clin
Trials 3(5):469–477
DeMets DL, Ellenberg SS (2016) Data monitoring committees—expect the unexpected. N Engl J
Med 375(14):1365–1371
DeMets D, Califf R, Dixon D, Ellenberg S, Fleming T, Held P, Packer M (2004) Issues in regulatory
guidelines for data monitoring committees. Clin Trials 1(2):162–169
Fleming TR, DeMets DL, Roe MT, Wittes J, Calis KA, Vora AN, Gordon DJ (2017) Data
monitoring committees: promoting best practices to address emerging challenges. Clin Trials
14(2):115–123
Fleming TR, Ellenberg SS, DeMets DL (2018) Data monitoring committees: current issues. Clin
Trials 15(4):321–328
Gordon VM, Sugarman J, Kass N (1998) Toward a more comprehensive approach to protecting
human subjects. IRB: A Review of Human Subjects Research 20(1):1–5
Holubkov R, Casper TC, Dean JM, Anand KJS, Zimmerman J, Meert KL, Nicholson C (2013) The
role of the data and safety monitoring board in a clinical trial: the CRISIS study. Pediatr Crit
Care Med J Soc Crit Care Med World Fed Pediatr Intensive Crit Care Soc 14(4):374
Lewis RJ, Calis KA, DeMets DL (2016) Enhancing the scientific integrity and safety of clinical
trials: recommendations for data monitoring committees. JAMA 316(22):2359–2360
832 O. D. Williams and K. Epnere
Online Documents
Department of Health and Human Services, Office of Inspector General. (2013) Data and safety
monitoring boards in NIH clinical trials: meeting guidance, but facing some issues. https://oig.
hhs.gov/oei/reports/oei-12-11-00070.pdf
U.S. Department of Health and Human Services Food and Drug Administration, Center for
Biologics Evaluation and Research (CBER)Center for Drug Evaluation, and Research
(CDER) Center for Devices and Radiological Health (CDRH) (2016) Guidance for clinical
trial sponsors establishment and operation of clinical trial data monitoring committee. https://
www.fda.gov/downloads/regulatoryinformation/guidances/ucm127073.pdf
European Medicines Agency Committee for medicinal products for human use (2005) Guideline on
data monitoring committee. https://www.ema.europa.eu/documents/scientific-guideline/guide
line-data-monitoring-committees_en.pdf
National Heart, Lung, and Blood Institute National Institutes of Health (2014) NHLBI policy for data
and safety monitoring of extramural clinical studies. https://www.nhlbi.nih.gov/grants-and-train
ing/policies-and-guidelines/nhlbi-policy-data-and-safety-monitoring-extramural-clinical-studies
Clinical Trials Transformation Initiative (CTTI) (2016) CTTI recommendations: data monitoring
committees. https://www.ctti-clinicaltrials.org/files/recommendations/dmc-recommendations.pdf
Variance Control Procedures
46
Heidi L. Weiss, Jianrong Wu, Katrina Epnere, and O. Dale Williams
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
What Is Variance? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
What Are the Main Sources of Variance in a Clinical Trial? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834
Why Does Variance in a Clinical Trial Matter? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835
When Is Variance Uncomfortably Large? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835
How to Control Variance Through Clinical Trial Design and Data Collection and
Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
Control Variance Through Clinical Trial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836
Control Variance Through Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837
Control Variance Through Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838
Variance As a Data Quality Assessment Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838
Conclusion/Key Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 840
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 840
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 840
Abstract
This chapter covers the concepts of variance and sources of variation for clinical
trial data. Common metrics to quantify the extent of variability in relation to the
mean are introduced as are clinical trial design techniques and statistical analysis
methods to control and reduce this variation. The uses of variance as a data quality
assessment tool in large-scale, long-term multicenter clinical trials are highlighted.
H. L. Weiss · J. Wu
Biostatistics and Bioinformatics Shared Resource Facility, Markey Cancer Center, University of
Kentucky, Lexington, KY, USA
e-mail: [email protected]; [email protected]
K. Epnere (*)
WCG Statistics Collaborative, Washington, DC, USA
O. D. Williams
Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA
Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
Keywords
Clinical trial · Variance · Systematic errors · Measurement errors · Random error ·
Coefficient of variation · Technical error · Matched design · Crossover design ·
Repeated measures design · Power · Sample size · Analysis of covariance ·
Multiple regression analysis · Data quality assessment
Introduction
Variance is one of the first topics presented during any basic statistics course,
typically occurring shortly after discussions of the mean and other measures of
central tendency. Subsequent presentations tend to portray variance in the context
of making judgments about the mean and as the denominator in equations that
include the mean or other point estimates in the numerator. This tends to indicate
that variance is a necessary evil in the computations required for other measures, but
otherwise of limited value. This is certainly not true for experimental research and
especially not true for large-scale, long-term multicenter clinical trials.
What Is Variance?
Variance is something that can be measured, so what is this something? One simple
and somewhat intuitive way to express this is to consider variance as being a
consequence of the distances between all the numbers in a dataset. Thus, the closer
together these numbers are, the lower the variance and the further apart they are the
higher the variance. Mathematically, the sample variance is the sum of squared
deviations of each value from the mean value, divided by the sample size minus one.
P
n
ðxi xÞ2 pffiffiffiffi
Sample variance : s2 ¼ i¼1 and standard deviation : s ¼ s2
n1
where xi is the value of the ith element, x is the sample mean and n is the sample size.
Thus, the units for the variance are the square of the units for the numbers used for
the calculation. To get back to the units of the initial numbers, the square root of the
variance, the standard deviation, is used. The magnitude of the variance, per se, is
not very informative. It, however, can be highly informative relative to the mean or
other relevant point estimates or to compare the variability of one set of numbers to
that of another set. This comparison can serve as a critically important data quality
assessment and monitoring tool.
provide data that represent the underlying true value at the point in time the
measurement was made as well as any deviation from this true value due to the
measurement process. Measurement error is typically unavoidable so the critical
issue is understanding its magnitude and taking steps to reduce it to more comfort-
able levels should doing so be warranted.
More broadly, in a parallel two group clinical trial, the variances can be divided
into two sources: the variance between the treatment groups and the variances within
the treatment groups. In general, the variance between the groups should be a
consequence of the treatment effects. The variances within the groups, however,
reflect the inherent differences among the individuals within the groups plus the
errors that occur in the process of making the data measurements. More specifically,
measurement errors occurring during data collection can often be due to data
collection instrument or process variability, data transfer or transcription errors,
simple calculation errors or carelessness. It is good practice for a clinical trial to
maximize the treatment effect and to reduce measurement error by using appropriate
methods for the study design, data collection, and data analyses.
Data variation is unavoidable in clinical research and such variation due to system-
atic or random errors can cause unwanted effects and biases. Systematic error (bias)
is associated with study design and execution. When bias occurs, the results or
conclusions of a trial may be systematically distorted especially should the biases
affect one treatment group more or less than the other. These can be quantified and
avoided. On the other hand, variances due to random error occur by chance and add
noise to the system, so to speak, and thus reduce the likelihood of finding a
significant difference between treatment groups (FDA 2019). Publication by Barraza
et al. (2019) discusses these two concepts in more detail. Furthermore, variances
have great impact on the sample size estimation and precision of outcome measure-
ments of a trial. Underestimation of the variance could result in lower statistical
power to detect treatment differences than would otherwise be the case. It can also
reduce the ability to comfortably compare the results of one trial to those of other
studies.
There are two aspects for assessing the magnitude of variance. One is the variance of
a set of numbers in relation to the average for that set. This is often assessed by the
use of the coefficient of variation (CV),
s
Coefficient of Variation : CV ¼ 100%
x
where s is the standard deviation and x is the sample mean.
836 H. L. Weiss et al.
which simply is the standard deviation divided by the average. Generally, if the
standard deviation is more than 30% of the average, the data may be too highly
variable to be fully useful.
The other aspect is technical error (TE), which can be based on the differences
between two independent measures of a variable. For example, a clinical chemistry
laboratory may be sent two vials of material for analyses, which represent one
sample that has been split. The identity of the two samples would not be known
by the laboratory. This process could be repeated for several samples so that a dataset
based on the assays of these paired observations can be created. These data can be
used to calculate the Technical Error for this measurement process, where
vffiffiffiffiffiffiffiffiffiffiffiffi
uP n
u d2
t i
i¼1
Technical Error of Measurment : TE ¼
2n
where di is the difference between measurements made on a given object on two
occasions (or by two workers) and n is the sample size.
Detailed instructions for calculating TE are described by Perini et al. (2005). For
this situation, data quality is classified as very good if RTE <10%, good if
10% RTE < 20%, acceptable if 20% RTE < 30%, and not acceptable if
RTE 30%. The target for key outcome measures for a clinical trial should be
<10%, however, there are no universally acceptable cut-off levels.
Some of the tools that can be used to control or minimize the impact of larger than
perhaps desirable variances are described below. These will be helpful in some, but
not all situations.
46 Variance Control Procedures 837
(eCRFs), data specification on types of variables within eCRFs, and use of clinical
trial management database systems are important.
Several statistical techniques and methods can be used in analysis stage of clinical
trial to control the variance.
corrðX; BÞ ¼ r:
The analysis ignoring baseline is based on the outcome X with variance σ2,
whereas the analysis based on change from baseline has variance.
var:ðX BÞ ¼ 2σ 2 ð1 rÞ:
Thus, the analysis of the difference X-B (change from baseline) will use a
smaller variance if r > 0.5. This is because of the typically marked positive
correlation between the baseline and outcome levels. If the correlation is less than
0.5, then using change from baseline introduces extra noise into the analysis and
is not recommended (Mathews 2006; EMA 2015).
3. Multiple regression analysis: Using regression analysis can separate the covariate
variance from the error variance, thus reducing the error variance for the treatment
assessment. Typical covariates to consider include different sites/centers, demo-
graphic and baseline clinical characteristics associated with the trial outcome that
were not controlled for in the trial design (EMA 2015).
Conclusion/Key Recommendations
• Variance is used to measure the deviation from (or the spread around) the mean in
each dataset and it allows us to compare different data sets.
• Coefficient of Variance and Technical Error can be used to assess the amount of
variance in each dataset.
• There are several clinical trial design tools as well as data collection and
analysis methods that can be used to control variance. Underestimation of the
variance could result in lower statistical power to detect treatment differences
and it can also reduce the ability to comfortably compare the results between
studies.
• Variance can serve as a very useful data quality assessment tool in clinical trials.
Cross-References
References
Barraza F, Arancibia M, Madrid E, Papuzinski C (2019) General concepts in biostatistics and
clinical epidemiology: random error and systematic error. Medwave 19(7):e7687
Biau DJ, Kernéis S, Porcher R (2008) Statistics in brief: the importance of sample size in the
planning and interpretation of medical research. Clin Orthop Relat Res 466(9):2282–2288
Chow SC, Liu JP (2014) Design and analysis of clinical trial, 3rd edn. Wiley, Hoboken, NJ. https://
www.wiley.com/en-ug/Design+and+Analysis+of+Clinical+Trials%3A+Concepts+and+Meth
odologies%2C+3rd+Edition-p-9780470887653
European Medicines Agency (EMA) (2015) Guideline on adjustment for baseline covariates in
clinical trials. https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-adjust
ment-baseline-covariates-clinical-trials_en.pdf. Accessed 23 Mar 2021
Mathews J (2006) Introduction to randomized controlled clinical trials, 2nd edn. Chapter 6, p 78,
Chapman & Hall/CRC Texts in Statistical Science. https://www.routledge.com/Introduction-to-
Randomized-Controlled-Clinical-Trials/Matthews/p/book/9781584886242
Perini TA, de Oliveira GL, Ornellas J d S, de Oliveira FP (2005) Technical error of measurement in
anthropometry. Rev Bras Med Esporte 11(1):81–85. https://doi.org/10.1590/S1517-
86922005000100009
Simon LJ, Chinchilli VM (2007) A matched crossover design for clinical trials. Contemp Clin
Trials 28(5):638–646. https://doi.org/10.1016/j.cct.2007.02.003
46 Variance Control Procedures 841
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844
Masking to Treatment Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845
Competing Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845
Types of Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845
Major Clinically Recognized Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845
Asymptomatic Subclinical Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 846
Patient-Reported Outcomes (PROs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
Time to Event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847
Models of Event Ascertainment and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848
Outcome Event Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849
Obtaining Diagnostic Data Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 850
Development of Data Capture Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 850
Training in Data Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
Classification of Clinical Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851
Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 852
Administrative Oversight of Outcome Ascertainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853
W. Rosamond (*)
Department of Epidemiology, Gillings School of Global Public Health, University of North
Carolina, Chapel Hill, NC, USA
e-mail: [email protected]
D. Couper
Department of Biostatistics, Gillings School of Global Public Health, University of North Carolina,
Chapel Hill, NC, USA
e-mail: [email protected]
Abstract
Successful completion and valid conclusions from a clinical trial rely on having
complete and accurate outcome data. Misclassification of outcome events can
introduce systematic error and bias as well as reduce the statistical power of the
trial. Outcomes of interest in clinical trials vary and can include major clinically
recognized events; asymptomatic subclinical measurements; and/or patient-cen-
tered reported outcomes. Final classification of study outcomes often involves use
of standardized computer algorithms, processing of materials and review with
outcome classification committees, and linkage with electronic data sources.
Processes for accomplishing the goals of outcome ascertainment and classifica-
tion can be designed as centralized systems, de-centralized networks of investi-
gators, or a hybrid of these two methods. Challenges to obtaining valid outcomes
include ensuring complete follow-up of study participants, use of standardized
event definitions, capture of relevant diagnostic information, establishing pro-
tocols for review of potential events, training of clinical review teams, linkage to
data sources across various platforms, quality control, and administrative oversite
of the process. Designers of clinical trials need to consider carefully their
approach for event identification, capture of diagnostic data, utilization of stan-
dardized diagnostic algorithms and/or clinical review committees, and mecha-
nisms for maintaining data quality.
Keywords
Event ascertainment · Outcome classification · Adjudication, bias
Introduction
In the process of conducting clinical trials and observational studies, there is often
a heavy focus on treatment and exposure assessment. Although important, this
attention can occur at the cost of less consideration of the complexities of complete
identification and valid classification of outcomes. Successful completion and valid
conclusions of a clinical trial rely on having complete and accurate outcome
data, particularly for the primary outcome. In clinical trials, if missingness or
misclassification of outcome events is unrelated to treatment group assignment,
this may merely reduce the statistical power of the trial and bias results toward the
null. However, if the missingness or misclassification varies across treatment or
exposure groups, this introduces systematic error in the results of the trial and may
bias findings in either direction. There are many challenges to obtaining valid
outcomes in clinical trials. These include ensuring complete follow-up of study
participants, use of standardized event definitions, capture of relevant diagnostic
information, establishing protocols for review of potential events, training of clinical
review teams, linkage to data sources across various platforms, quality control, and
administrative oversite of the process. This chapter focuses on methods to obtain
47 Ascertainment and Classification of Outcomes 845
information needed for the full assessment of trial outcomes and the process of using
that information to determine the outcomes for all participants.
The best designed clinical trials take particular care to reduce the potential for
differential misclassification to occur across treatment groups. Even if participants
and the investigators and staff involved in treatment provision cannot be masked,
it is desirable that those involved in any aspect of the outcome ascertainment
and classification be unaware of the participants’ treatment group assignments.
Otherwise, classifications may be either consciously or subconsciously influenced
by knowledge of the treatment group.
Competing Risks
The definition of outcome or the statistical methods for analyzing them need to
account for the potential for the outcome to be missing because of competing risks.
For instance, in a trial in the elderly of a method to reduce the risk of decline in
cognitive function, some participants may be missing information about change
in cognitive function because they die before having a follow-up assessment of
cognitive function. There are accepted statistical approaches to address competing
risks, such as the Fine and Gray model for competing risks in time-to-event analyses
(Fine and Gray 1999). Details of these methods are addressed elsewhere in this
monograph.
Types of Outcomes
Events that generally come to the attention of medical care services are often primary
outcomes in clinical trials. Examples of major clinically recognized outcomes
include acute myocardial infarction, acute decompensated heart failure, stroke, all-
cause or cause-specific mortality, exacerbations of chronic obstructive pulmonary
disease and asthma, venous thromboembolism, gestational diabetes, diabetes
846 W. Rosamond and D. Couper
mellitus, major infections, trauma, and injury. Events such as acute myocardial
infarction generally have a well-defined time of occurrence. Onset dates of other
events such as diabetes or heart failure are less well identified. By definition,
clinically recognized outcomes involve contact with medical personnel, though
often not with staff involved in the clinical trial. For instance, there is generally no
expectation that a participant who has an acute myocardial infarction will be treated
in hospitals in connection with investigators in the clinical trial. The ease with which
such potential events can be identified depends on the type of health-care system in
the country in which the trial is conducted. In a country with a single-payer health-
care system, information about hospitalizations is collected centrally, and with the
appropriate permissions, it is relatively straightforward to obtain the medical records
needed for event classification and adjudication. In the USA, if a trial is done using
participants from a managed care consortium such as Kaiser Permanente, the
situation is similar to a country with a single-payer health system, except that
participants may move to a different health-care system during follow-up. When
participants are not all in a single managed care consortium, identification of
potential events and obtaining the medical records needed for classification are
much more complex.
Time to Event
information about an event does not automatically imply the participant did not have
the event. Other outcomes such as heart failure can be thought of as clinical
syndromes with a diffuse event onset time. Occurrence time may be defined as
first onset of symptoms, which in the case of heart failure could be progressive over
an extended period of time. Some trials may choose to define onset of these types of
events as the time the condition requires hospitalization or the time it is diagnosed in
the outpatient setting. In the case of heart failure, the actual condition may be been
present for some time prior to this defined start date.
The goal of the event ascertainment and classification component of clinical trials is
to completely identify all events and establish valid event classification for each.
There are several models to identify and classify events. The best choice to employ
depends partially on the type of events targeted by the trail and on the size and
duration of the clinical trial. Operational structure and resources of the trial also
influence the methods used. Models for accomplishing the goals of event ascertain-
ment and classification can be grouped onto three types including centralized
systems, decentralized systems, or a hybrid of these two models.
A centralized model is one that establishes special clinics where study partici-
pants return for asymptomatic subclinical outcome assessment through a clinic
examination, biomarker measurement, and/or questionnaire evaluation. Clinical
outcomes may also be determined at central special clinics but would most likely
come to attention of other health-care providers within and outside the sphere of the
clinical trial investigators. Even though events may be identified in hospitals
and clinics outside of special centralized centers, medical records and diagnostic
elements are often sent to centralized reading (e.g., electrocardiograms) and/or
abstraction centers that employ specially trained medical record abstractors. Using
centralized reading centers or review of diagnostic elements helps reduce variation in
clinical practice fashions and increases standardization.
Decentralized outcome assessments are also common. Studies that employ a
decentralized system rely on identifying and obtaining medical records from all
facilities utilized by participants. These facilities could be anywhere around the
world. A considerable effort is required to identify and to obtain complete sets of
diagnostic information from the various medical systems. These decentralized
systems may also incorporate home visits with participants in comparison with
having participants return to one or more central specially created clinical sites.
An example of this method in a major observations study is the REGARDS study
(Howard et al. 2005). In this study of approximately 30,000 participants, mobile
units were sent to the homes of participants to capture information on subclinical and
patient-reported conditions. Clinical events were identified through participant self-
report. Records for reported hospitalizations were then sought and abstracted
centrally.
47 Ascertainment and Classification of Outcomes 849
Outcome ascertainment systems in clinical trials strive for complete capture of the
outcomes of interest. This often involves identification of a wide net of events from
which outcomes of interest are further evaluated and classified. The type of approach
depends on the type of outcomes (i.e., clinical, subclinical, patient-reported) that are
of most interest to the study. Studies often apply highly sensitive selection criteria of
potential events in order to ensure complete and comprehensive case ascertainment.
Clinical trials use a variety of methods that can include participant self-report of
potential events, searches of electronic medical records (EMRs) lists obtained
from selected health-care facilities and clinics, utilization of wearable devices by
participants (e.g., transdermal patch electrocardiographic (ECG) monitors permit
extended noninvasive ambulatory monitoring for atrial fibrillation and other cardiac
conditions (Heckbert et al. 2018)), and periodic participant examinations (e.g.,
sequential ECG evaluations to identify silent myocardial infarction).
An example of using participant self-report to obtain comprehensive ascertain-
ment of outcomes in a large observational cohort studies is the Atherosclerosis
Risk in Communities (ARIC) study (The ARIC investigators 1989). Briefly,
study participants are contacted by phone twice annually to obtain self-reported
hospitalizations for any reason. Medical records of all reported hospitalizations are
sought. In addition to identifying potential events from patient self-report, electronic
files of hospital discharges are obtained from hospitals in the regions from which the
cohort was drawn. These files are searched using participants’ information to
identify hospitalizations for study participants. Approximately 10% of total study
outcomes are identified from searching electronic files of discharges from selected
hospitals that were not otherwise identified from participant self-report. The result of
this case identification approach is a comprehensive list of hospitalized outcomes
from which event classification and validation can proceed. In the Hispanic
Community Health Study/Study of Latinos (HCHS/SOL), a similar approach is
used but includes participant self-report of all visits to an emergency department
not leading to hospitalization (Sorlie et al. 2010).
It is important that clinical trials established clear and detailed description of the
outcome of interest. This is key to the rigor and reproducibility of findings in
the context of other studies and patient populations. An example of this level of
outcome description is the RIVUR (Randomized Intervention for Children with
Vesicoureteral Reflux) trial, a double-mask placebo-controlled trial of antimicrobial
prophylaxis; the primary outcome to evaluate treatment efficacy was recurrence of F/
SUTI (febrile or symptomatic urinary tract infection (UTI) (RIVUR Trial Investiga-
tors et al. 2014). Suspected recurrent UTI events were reviewed and adjudicated to
determine if they met the RIVUR criteria for a primary outcome. The definition of
recurrent F/SUTI required the presence of fever or urinary tract symptoms, pyuria
based on urinalysis, and culture-proven infection with a single organism. A UTI was
defined as recurrent only if its onset occurred more than 2 weeks from the last day of
appropriate treatment for the preceding UTI or following a negative urine culture or
850 W. Rosamond and D. Couper
it was an infection with a new organism. The study had a UTI Classification
Committee (UCC). All reported medical care visits required data collection using
standardized study procedures, with data entered into the central data management
system (DMS). Those visits where a potential UTI was identified were reviewed and
classified by the UCC, using standardized criteria to adjudicate each event according
to the study definitions. When an algorithm in the data management system identi-
fied a potential outcome, relevant data were sent to two randomly selected members
of the UCC. Each of the two UCC members classified the event and entered their
responses into the DMS. If the classifications by the two reviewers disagreed, the
UCC met in person or by conference call to come to a final decision.
Once possible outcome events are identified, clinical trials must have standardized
approaches to obtain the relevant diagnostic information needed for event validation
and classification. Traditional methods of manual abstraction by trained medical
records abstractors have been widely used and are successful at reliable collection of
diagnostic elements from medical records. More recent approaches employ natural
language processing programs to capture information from EMR text fields on
symptom presentation, disease course, and other relevant diagnosis elements (e.g.,
presence of cardiac chest pain, worsening of difficulty breathing). Electronic medical
records can also be an efficient method to capture structured data elements (e.g.,
laboratory values, test results, medications) needed to validate and classify study
outcomes. Although the capture of diagnostic elements from EMR relying solely on
computer-based methods has great potential for efficiency, challenges remain in the
area of interoperability across EMR platforms and establishing and maintaining
acceptable sensitivity and specificity compared to traditional medical record review.
Once highly reliable and valid data are obtained (either by electronic or manual
approach), these data can then be used in computerized standard event classification
algorithms. Hybrid approaches to diagnostic data capture are also used. Structured
data elements captured from EMR combined with manual abstraction guided by
natural language processing are an example of a hybrid data collection approach.
Computer systems can be used to search text fields and locate and underscore
location of key diagnostic information that can be confirmed by manual overread
by trained study personnel.
Studies typically develop online computer systems for reviewers to use in event
classification. After decisions have been made about the information needed to be
captured for event classification, case report forms need to be developed and
programmed to be used for data entry by abstractors. These systems usually need
to include not only fields for capturing specified data elements but also to allow
47 Ascertainment and Classification of Outcomes 851
inclusion of narrative sections of the medical record and uploading of images, such
as MRIs and other components in electronic formats, such as ECGs. Once data entry
for a participant is complete, an algorithm and/or clinician uses the information to
decide the event type. When two clinicians or a clinician and the algorithm have
reviewed an event, the system compares the reviews. If there are discrepancies, they
are resolved either by mutual agreement or adjudication by an additional reviewer.
The system needs to be able to incorporate such resolutions or adjudications and
record the final decision as to the nature of the event (see section “Classification of
Clinical Events”).
Once relevant diagnostic data elements captured for potential study outcomes
are available, classification of events can proceed. Methods for determining final
classification of study outcomes vary and include use of standardized computer
algorithms, processing for review with outcome classification committees, and
linkage with electronic data sources (e.g., clinical registries, administrative claims,
mortality registries, and death indexes).
Computer diagnostic algorithms for determining final study events exist for many
major clinical outcomes. For example, a widely used algorithm for classifying acute
myocardial infarction utilized data on cardiac pain symptoms, biomarker evidence,
and electrocardiographic evidence (Luepker et al. 2003). The results of this and
similar algorithms are a spectrum of certainty of classification such as definite,
probable, suspect, or no acute myocardial infarction. Input from additional reading
of electrocardiograms can be incorporated to identify subclasses of myocardial
infarction, namely, ST segment MI (STEMI) or non-ST segment elevation MI
(NSTEMI). While widely used in trials, a limitation of these types of algorithms is
that they are not specific enough to classify subcategories of events based on newer
universal definitions of myocardial infarction (i.e., acute myocardial infarction
subtype 1 through subtype 5 (Thygesen et al. 2018)). Clinical overread of diagnostic
information is required to produce valid subtyping of these events.
Other major outcomes such as stroke, heart failure, and respiratory disease are
less well suited for reliance on diagnostic algorithms and may require processing for
final diagnostic classification with outcome review committees. Outcome review
committees are commonly used by clinical trials to determine the final diagnostic
852 W. Rosamond and D. Couper
Ethics
Clinical trials usually require all participants to provide informed consent at the time
of entry into the study. There are some types of trials for which a waiver of consent
may be granted, such as a trial of a new method of CPR for treating out-of-hospital
47 Ascertainment and Classification of Outcomes 853
cardiac arrest (Aufderheide et al. 2011). If medical records are needed for outcome
identification and classification, participants also need to sign an agreement allowing
their medical records to be obtained from physicians and hospitals.
Conclusion
Key Facts
References
Anton RF, O'Malley SS, Ciraulo DA, Cisler RA, Couper D, Donovan DM, Gastfriend DR,
Hosking JD, Johnson BA, LoCastro JS, Longabaugh R, Mason BJ, Mattson ME, Miller WR,
Pettinati HM, Randall CL, Swift R, Weiss RD, Williams LD, Zweben A (2006) COMBINE
Study Research Group. Combined pharmacotherapies and behavioral interventions for alcohol
dependence: the COMBINE study: a randomized controlled trial. JAMA 295(17):2003–2017
Aufderheide TP, Frascone RJ, Wayne MA, Mahoney BD, Swor RA, Domeier RM, Olinger ML,
Holcomb RG, Tupper DE, Yannopoulos D, Lurie KG (2011) Standard cardiopulmonary
resuscitation versus active compression-decompression cardiopulmonary resuscitation with
854 W. Rosamond and D. Couper
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856
Restricted Randomization in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857
Covariate Imbalances and Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 861
Correct Guesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 861
Conditional Allocation Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862
Type I Error Probability and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863
Multi-arm Trials (Generalizations) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864
Chronological Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866
Impact on Type I Error Probability and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867
Planning for Bias at the Design Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869
Robust Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869
Randomization Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 870
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873
Abstract
In clinical trials, randomization is used to allocate patients to treatment
groups, because this design technique tends to produce comparability across
treatment groups. However, even randomized clinical trials are still susceptible
to bias. Bias is a systematic distortion of the treatment effect estimate. This
chapter introduces two types of bias that may occur in clinical trials, selection
D. Uschner
Department of Statistics, George Mason University, Fairfax, VA, USA
e-mail: [email protected]
W. F. Rosenberger (*)
Biostatistics Center, The George Washington University, Rockville, MD, USA
e-mail: [email protected]
bias and chronological bias. Selection bias may arise from predictability of the
randomization sequence, and different models for predictability are presented.
Chronological bias occurs due to unobserved time trends that influence patients’
responses, and its effect on the rejection rate of parametric hypothesis tests for the
treatment effect will be revealed. It will be seen that different randomization
procedures differ in their susceptibility to bias. A method to reduce bias at the
design stage of the trial and robust testing strategies to adjust for bias at the
analysis stage are presented to help to mitigate the potential for bias in random-
ized controlled clinical trials.
Keywords
Selection bias · Chronological bias · Restricted randomization · Type I error ·
Power
Introduction
Clinical trials aim at comparing the efficacy and safety of therapeutic agents across
treatment groups. It is crucial that the groups are comparable with respect to the
demographic features and other prognostic variables. In practice, it is not possible
to create comparability deterministically across the groups, particularly, as some
underlying prognostic variables, such as pharmacological properties, may still be
unknown.
Randomization tends to balance groups with respect to known and unknown
covariates and is therefore commonly regarded as the key component of clinical
trials that provides comparability of treatment groups (Armitage 1982). In addition,
randomized treatment allocation allows the effective concealment of treatments from
patients and investigators. When the treatment assignment is deterministic, the
concealment of allocations is inherently difficult. Allocation concealment is often
referred to as double-blinding, while a loss of concealment is called unblinding.
Double-blinding is important, if possible, to achieve an unbiased assessment of the
outcomes of the trial.
Despite the favorable properties of randomization, a randomized clinical trial can
still suffer from a lack of comparability among the treatment groups. Biases may
arise from different sources. For example, long recruitment times, changes in study
personal, or learning curves during surgical procedures may cause time trends that
affect the outcomes of patients in the trial. It is intuitively clear that time trends will
lead to a bias of the treatment effect, when patients that arrive early in the allocation
process are allocated to one group and those that arrive later are allocated to the other
group. This bias has been termed chronological bias (Matts and McHugh 1978).
Randomizing patients in blocks has been recommended as a means to create more
similar groups in the course of the trial (ICH 1998). However, blocking introduces
predictability of the upcoming treatment assignments. In particular, the pharmaco-
logical effects or the side effects of an intervention may be easily distinguishable
48 Bias Control in Randomized Controlled Clinical Trials 857
from those of a standard or placebo intervention. These effects will cause unblinding
of the past treatment allocations and may in turn make future allocations more
predictable. Predictability can introduce selection bias (Rosenberger and Lachin
2015), a bias caused by a systematic covariate imbalance of the treatment groups.
Several randomization procedures have been developed in the literature to miti-
gate the effects of chronological bias and selection bias. Each randomization proce-
dure represents a trade-off between balance and randomness, while greater balance
leads to higher predictability, and greater randomness leads to more susceptibility to
time trends. The extreme of complete randomness is the toss of the fair coin, also
called complete randomization or unrestricted randomization. The other extreme
is small blocks of two, where after a random allocation to one group, the next patient
will be deterministically allocated to the other group. When the allocation
of a patient depends on the treatment assignments of the previous patients, a
randomization procedure is called restricted. Section “Restricted Randomization in
Clinical Trials” reviews restricted randomization procedures that are used to mitigate
bias in randomized trials.
Section “Covariate Imbalances and Predictability” shows how predictability and
covariate imbalances can be measured in clinical trials. Chronological bias is the
focus of section “Chronological Bias.” Section “Planning for Bias at the Design
Stage” presents an approach to minimize the susceptibility to bias at the design stage.
Section “Robust Hypothesis Tests” introduces hypothesis tests that are robust to
bias. The chapter closes with a Summary in section “Summary and Conclusions.”
Consider a randomized clinical trial with an experimental agent E and a control agent
C. Patients enter the trial sequentially and are allocated randomly into one of the two
groups. When the groups are expected to be balanced in the end of the trial, the
random allocation can be achieved by the toss of a fair coin, and a patient will be
allocated to the experimental group when the coin shows head and to the control
group when the coin shows tails. Let an even n ℤ>0 be the total sample size of the
trial. Then the allocation of patient i for i {1, . . ., n} is denoted by
1 if patient i is allocated to group E
ti ¼
0 if patient i is allocated to group C:
X
i
Di ¼ N E ðiÞ ¼ N C ðiÞ ¼ 2 t j i:
j¼1
The imbalance Di is a random variable that describes a random walk (i, Di) for i
1, . . ., N. Figure 1 shows a realization of the random walk in heavy black and all
the possible realizations in light gray.
There is a one-to-one correspondence between the set of randomization
sequences and realizations of the random walk. Each random walk corresponds to
a randomization sequence, and each randomization sequences describes a unique
random walk. Using a fair coin toss, each randomization sequence has the same
probability
1 1
PðT ¼ tÞ ¼ ¼ ,
jΩn j 2n
where |Ωn| is the cardinality of the set Ωn. In other words, for each patient i, the
random walk has the same probability to go either up or down. As there are no
restrictions to the randomization process, this randomization procedure is usu-
ally called unrestricted randomization or complete randomization. In particular,
complete randomization only maintains the allocation ratio in expectation but
may result in high imbalances in the course of the trial, as well as in the end of the
trial. As Fig. 1 shows, complete randomization (CR) allows imbalances as high
as n.
Several randomization procedures have been proposed to achieve better balance
in the clinical trial. A randomization procedure is a (discrete) probability distribution
on the set of randomization sequences. Randomization procedures other than the
uniform distribution are called restricted randomization procedures.
Fig. 1 Random walk of the randomization sequence t = (1, 1, 1, 1, 0, 0, 0, 0), figure generated
using the randomizeR package (Uschner et al. 2018b)
48 Bias Control in Randomized Controlled Clinical Trials 859
The most commonly used randomization procedures are random allocation rule
(RAR) and permuted block randomization (PBR). Random allocation rule forces
randomization sequences to be balanced in the end of the trial by giving zero weight
to
unbalanced
sequences and equal probability to balanced sequences. As there are
n
sequences, the probability of a sequence t is
n=2
8 !1
>
< n
if Dn ðtÞ ¼ 0
Pð T ¼ t Þ ¼ n=2
>
:
0 otherwise:
Permuted block randomization forces balance not only at the end of a trial but at
M points in the trial. The interval between two consecutive balanced points in the
trial is called a block. Let every block contain m = n/M patients, where M and m are
positive integers, and let b = m/2 be the number of patients allocated to E and C in
each block. Using PBR, the probability of a sequence is
8 !n=2b
>
> 2b
<
if D jk ðtÞ ¼ 0 for j ¼ 1, . . . , M
PðT ¼ tÞ ¼ b
>
>
:
0 else:
A different way to achieve balance was suggested by Berger et al. (2003). They
promote the maximal procedure (MP), a randomization procedure that achieves
final balance and does not exceed a maximum tolerated imbalance b = maxi |Di|.
All remaining sequences Ωn,MP are realized with equal probability.
8
< 1
i f max i j Di ðtÞ jj b and Dn ¼ 0
PðT ¼ tÞ ¼ j Ωn,MP j
:
0 else:
Figure 2 illustrates the set of sequences. The cardinality of the set of sequences
of Ωn,MP depends on n and the imbalance boundary b. There is no closed form,
and the generation of the randomization sequences requires an ingenious algo-
rithm proposed by Salama et al. (2008) and implemented in Uschner et al.
(2018b).
Another approach that does not force balance in the end of the trial is Efron’s
biased coin design (EBCD). Here, the probability of the next treatment assignment
is based on the current imbalance of the random walk. Let 12 < p 1. Then the
probability to assign the next patient to group E is given by
8
> p if Di < 0
<
1
PðT iþ1 ¼ 1jT 1 , . . . , T i1 Þ ¼ if Di ¼ 0
>
:2
1p if Di > 0:
860 D. Uschner and W. F. Rosenberger
Fig. 2 Set of sequences of the maximal procedure for sample size n = 8 with imbalance tolerance
b = 2, figure generated using the randomizeR package (Uschner et al. 2018b)
The set of all sequences of EBCD is Ωn, but the probability distribution
is different. Sequences with high imbalances have a lower probability. In
other words, the probability mass is concentrated about the center of the random
walk.
Chen’s design and its special case the big stick design (BSD) were developed to
avoid the high imbalances still possible in EBCD. Here, an imbalance boundary b is
introduced for the random walk, and a deterministic allocation is made to the other
treatment group once the random walk attains the imbalance boundary on one side of
the random walk. Using Chen’s design, the probability to allocate the next patient in
group E is
8
>
> 1 if b ¼ Di
>
>
>
> p if b < Di < 0
>
<
1
PðT iþ1 ¼ 1jT 1 , . . . , T i1 Þ ¼ if Di ¼ 0
>
>2
>
>
>
> 1p if 0 < Di < b
>
:
0 if Di ¼ b:
The special (and more common) case of the big stick design results if p ¼ 12.
Note that despite the similar set of sequences of the big stick design and the
maximal procedure, their probability distributions are very different. Maximal
procedure gives equal probability to all sequences. The big stick design, however,
introduces deterministic allocations (i.e., allocations with probability one) every time
the imbalance boundary is hit. As a consequence, sequences that run along the
imbalance boundary have higher probability than those in the middle of the alloca-
tion tunnel.
48 Bias Control in Randomized Controlled Clinical Trials 861
Correct Guesses
The first to propose a measure of selection bias were Blackwell and Hodges (1957).
Under the assumption that the investigator knows the target allocation ratio as well as
past treatment assignments, they investigate the influence of an investigator who
consciously seeks to make one treatment appear better than the other irrespective of
the presence of a treatment effect. Assuming that the investigator favors the experi-
mental treatment, he might include a patient with better expected response in the trial
when he expects the experimental treatment to be allocated next. Conversely, he would
include a patient with worse expected response, when he expects the next treatment
assignment to be to the control group. Blackwell and Hodges propose two models for
the guess of the investigator. The first model, coined the convergence strategy (CS),
assumes that the investigator guesses the treatment that has so far been allocated less.
Let gCS(i, t) denote the guess for allocation i using the convergence strategy, and let R ~
Ber(0.5) be a Bernoulli random variable. Using the convergence strategy, the inves-
tigator’s guess for the ith allocation is given by
862 D. Uschner and W. F. Rosenberger
8
<1
> N E ði 1, tÞ < N C ði 1, tÞ
gCS ði, tÞ ¼ 0 N E ði 1, tÞ > N C ði 1, tÞ
>
:
R N E ði 1, tÞ ¼ N C ði 1, tÞ,
A correct guess is the event that the investigator guesses the treatment that will in
fact be allocated next, i.e., g(i, t) = ti for g {gCS, gDS}. The number of correct
guesses of a randomization sequence is then defined as
X
n
GðtÞ ¼ I ðgði, tÞ ¼ ti Þ:
i¼1
With this notation, the expected number of correct guesses E(G) is given by
X
EðGÞ ¼ PðT ¼ tÞ GðtÞ,
t Ωn
X
n
ρPRED ðtÞ ¼ ð ϕi ð t Þ pÞ 2 :
i¼1
Proschan (1994) was the first to propose and investigate the influence of the
convergence strategy on the type I error rate of a hypothesis test of the treatment
effect. Let the primary outcome Y follow a normal distribution
Y N ðμE T þ μC ð1 T Þ, σ 2 Þ. If the variance σ is known, the null hypothesis
H0: μE = μC can be tested using a Z-test, and, under the assumption of independent
and identically distributed responses, the test statistic D ¼ Yp
E Y
ffiffiffiffi C follows a standard
2σ
normal distribution.
Assume that a higher outcome Y can be regarded as better and that the
investigator favors the experimental group, although the null hypothesis is true, i.
e., μ = μE = μC. Then the influence of the convergence strategy on the responses can
be modeled as follows:
8
<μ þ η
> N E ði 1, tÞ > N C ði 1, tÞ
Eð Y i Þ ¼ μ η N E ði 1, tÞ < N C ði 1, tÞ
>
:
μ N E ði 1, tÞ ¼ N C ði 1, tÞ,
where η > 0 denotes the selection effect, the extent of bias introduced by the
investigator. It is assumed that η > 0 to account for the fact that the treatment E is
preferred and higher outcomes are regarded as better.
864 D. Uschner and W. F. Rosenberger
Under this assumption, the responses Y1, . . ., Yn are not identically distributed
anymore but are still independent. Proschan gave an asymptotic formula for the type
I error probability when random allocation rule is used and investigates the rejection
rate in simulations for various values of n and η. It turns out that the rejection
probability exceeds the planned significance level even for small values of η.
Kennes et al. (2011) extended the approach of Proschan to permuted block
randomization. As expected, the type I error inflation increases with smaller block
sizes. Ivanova et al. (2005) adapted the approach for binary outcomes and introduced
a guessing threshold to reflect a possibly conservative investigator. The influence
of various guessing thresholds was also investigated by Tamm and Hilgers (2014),
who further generalized the approach to investigate the influence of predictability
on the t test. Rückbeil et al. (2017) investigated the impact of selection bias on time-
to-event outcomes.
Langer (2014) gave an exact formula for the rejection rate of the t test conditional
on the randomization sequence, when the convergence strategy is used. The
approach was published in Hilgers et al. (2017) and implemented in the randomizeR
R package in Uschner et al. (2018b). The rejection probability conditional on the
randomization sequence t is given by
α
r ðtÞ ¼ P jSj> tn2 1 jt
2
α α
¼ F tn2 , n 2, δ, λ þ F tn2 , n 2, δ, λ ,
2 2
where S is the test statistic of the t-test, tn – 2(γ) is the γ-quantile of the t-distribution
with n – 2 degrees of freedom, and F(∙, n – 2, δ, λ) is the distribution function of the
doubly noncentral t-distribution with n – 2 degrees of freedom and non-centrality
parameters δ, λ that both depend on the randomization sequence t. Figure 3 shows
the distribution of the type I error probability for the maximal procedure and the big
stick design, both with imbalance tolerance b = 2, and for the random allocation
rule. All are based on the total sample size n = 20 and normally distributed outcomes
with group means μE = μC = 2 and equal variance σ 2 = 1.
Notably, all randomization procedures contain sequences with rejection proba-
bilities as high as 100%. These are the alternating sequences. The big stick design
has most sequences concentrated around the 5% significance level. The random
allocation rule is similar to the big stick design but introduces more variability.
The maximal procedure, despite having a similar set of sequences as the big stick
design, has a higher probability for sequences that exceed the significance level
substantially.
The approach to assess susceptibility based on the rejection probability (see section
“Type I Error Probability and Power”) was generalized to multi-arm trials by
48 Bias Control in Randomized Controlled Clinical Trials 865
Fig. 3 Distribution of the type I error probability under the convergence strategy with η = 4 for
three randomization procedures, figure generated using the randomizeR package (Uschner et al.
2018b)
Uschner et al. (2018a). They proposed models for selection bias in multi-arm trials
that generalize the convergent guessing strategy of Blackwell and Hodges (1957);
see section “Correct Guesses.” Let K 2 denote the number of treatment groups,
and assume that a randomization procedure with equal allocation ratio is used for the
allocation of patients to the K groups. In the two-arm case, it is assumed that the
investigator favors one treatment over the other. In the multi-arm case (K > 2), it is
assumed that the investigator favors a subset of the K treatment groups and
dislikes the rest. Similarly to the two-arm case, it is assumed that the investigator
would like to make his favored groups appear better than the disliked groups,
despite the null hypothesis H0: μ1 = . . . = μK being true. Under this assumption,
the investigator would thus try to include a patient with better expected
response when he guesses that one of his favored groups will be allocated next.
Let F {1, . . ., K} denote the subset of favored treatment groups, and let the
complement F C = {1, . . ., K} \ F denote the treatment groups that are not favored
by the investigator. A reasonable strategy for the investigator would be to guess that
one of his favored groups will be allocated next, when all of the groups in F have
fewer patients than the remaining groups. Under this assumption, the expected
response is given by
EðYÞ ¼ μ þ η b,
8
<1
> if max j F N j ði 1Þ < min k F C N k ði 1Þ
bi 1 if min j F N j ði 1Þ > max k F C N k ði 1Þ
>
:
0 else:
Chronological Bias
Chronological bias arises when the treatment effect is distorted due to a time trend
that affects the patients’ responses, i.e., when later observations tend to be
systematically higher or lower than previous observations. According to Matts and
McHugh (1978), who coined the term chronological bias, clinical trials with a long
recruitment phase are particularly prone to suffer from the hidden effects of time.
The idea of investigating the effect of an unobserved covariate, such as time, on the
estimation of the treatment effect is due to Efron (1971), who termed the resulting
systematic distortion of the treatment effect in a linear model accidental bias.
The susceptibility of a randomization procedure to chronological bias may be
measured by the degree of balance it yields (Atkinson 2014). In a trial with two
treatment arms, a randomization sequence t is said to attain final balance, when the
48 Bias Control in Randomized Controlled Clinical Trials 867
achieved allocation ratio in the end of the trial is equivalent to the target allocation
ratio p,
N E ðn, tÞ ¼ n p:
When the target allocation is p = 0.5, the maximum difference in group size
yields a measure for imbalance throughout the trial,
MI ¼ max Dk ðtÞ,
k¼1, ..., n
where the difference at time k is given by Dk(t) = NE(k, t) NC(k, t). A generali-
zation of this approach to multiple treatment groups was presented by Ryeznik and
Sverdlov (2018).
EðY i Þ ¼ μE T i þ μC ð1 T i Þ þ τðiÞ:
They propose three different shapes of trend: linear, logarithmic, and stepwise.
Under linear time trend, the expected response of the patients increases evenly
proportional to a factor θ with every patient included in the trial, until reaching θ
after n patients. Linear time trend may occur as a result of gradually relaxing in- or
exclusion criteria throughout the trial. The shift of patient i where i = 1, . . ., N is
given by the formula
i
τðiÞ ¼ θ:
n
Under logistic time trend, the expected response of the patients increases
logistically with every patient included in the study, until reaching θ after n patients.
Logistic time trend may occur as a result of a learning curve, i.e., in a surgical trial.
Under logistic trend, the shift of patient i where i = 1, . . ., N is given by the formula
i
τðiÞ ¼ log θ:
n
868 D. Uschner and W. F. Rosenberger
Under a step trend, the expected response of the patients increases by theta
after a given point n0 in the allocation process. Step trend may occur if a new device
is used after the point n0 or if the medical personal changes at this point. Under a step
trend, the shift of patient i where i = 1,. . ., N is given by the formula
Rosenberger and Lachin (2015) present the results of a simulation study in which
they investigate the average type I error rate and power of the t-test under a linear
time trend for various designs. They find that the mean type I error rate of the designs
does not suffer from chronological bias, but power can be deflated substantially.
Moreover, more balanced designs lead to better control of power. Tamm and Hilgers
(2014) investigate the permuted block design with various block sizes concerning
it’s susceptibility to chronological bias. They find that strong time trends can lead to
a deflation of the type I error rate if large block sizes are used.
It is, however, not necessary to rely on simulation. As in the case of selection
bias (see section “Type I Error Probability and Power”), the impact of chronological
bias on the rejection probability of the t-test, conditional on the randomization
sequence, can be calculated using the doubly noncentral t distribution. Figure 4
shows the exact distribution of the type I error probability under a linear time trend
with θ = 1 for the random allocation rule, the maximal procedure, and the big stick
design latter two with maximum tolerated imbalance b = 2 for sample size n = 20.
For all three designs, the most randomization sequences yield a rejection
probability that is below the nominal significance level of 5%. The variance of the
Fig. 4 Distribution of the type I error probability under a linear time trend, figure generated using
the randomizeR package (Uschner et al. 2018b)
48 Bias Control in Randomized Controlled Clinical Trials 869
random allocation rule is higher as for the other designs. The big stick design seems
to best attain the significance level.
Fig. 5 Power of the F-test. Panel A, adjusted for selection bias; panel B, unadjusted for selection
bias. Both panels assume total sample size n = 48, K = 3 treatment groups, and selection effect
η = ρ ∙ f16,3 with ρ {0, 0.5,1, 2}. (Originally published in (Uschner et al. 2018a), under Creative
Commons Attribution (CC BY 4.0) license)
subjects each. A total number of 10,000 trials was simulated under the assumption
μ1 = c ∙ f, μ2 = c ∙ f, μ3 = 0, where c is chosen such
pffiffiffi that the effect size of the
comparison results in 80% power of the F-test c ¼ 3= 2 , and the estimated power
is given by the proportion of trials that led to a rejection of the null hypothesis.
Panel A shows that the power of the adjusted test is close to the nominal power
when the block size is large but reduces to about 67% when a small block size is
used. The magnitude of the selection effect does not have an impact on the power of
the adjusted test. Panel B shows the power of the unadjusted test. As expected, the
power increases with increasing selection effect, as a result of an overestimation of
the treatment effect. Small block sizes lead to the heaviest inflation, reflecting the
higher susceptibility to selection bias.
Randomization Tests
Y i N ðΔ T i þ 4 τðiÞ 2, 1Þ,
where Δ {0, 1}, and τ(i) is a linear time trend as in section “Chronological Bias.”
Their results show that the randomization test maintains the 5% significance level
for all randomization procedures. The power is decreased for all randomization
procedures except permuted blocks with small block sizes. It results that a smaller
degree of balance leads to greater power loss.
The advantages of randomization tests are that they need little effort to compute
and can handle heterogeneity in the data that may arise from bias and they do not
rely on random sampling from a distribution, such as parametric hypothesis tests.
They are therefore the natural choice for hypothesis tests when bias is anticipated
in the data.
potential for selection bias, using a variety of selection bias parameters. Again, a
procedure that performs well in all investigated scenarios should be chosen. Lastly, if
little is known about the nature of the trial, a combination of chronological and
selection bias, as proposed by Hilgers et al. (2017), can be used as a basis for the
assessment that will determine the choice of the design. The combination approach
will ensure that a trial is protected if both biases occur during the trial.
By choosing a randomization procedure for a particular clinical trial that reflects
anticipated bias, the susceptibility to bias can be substantially reduced. While it is
recommended to include all available randomization procedures in assessment, ran-
domization procedures that promote balance, such as the permuted block randomiza-
tion or the big stick design, should particularly be taken into account when
chronological bias is anticipated. Procedures that support randomness, such as com-
plete randomization, Efron’s biased coin design, or the big stick design with larger
imbalance tolerance, are especially recommended when predictability is an issue.
At the analysis stage, when researchers suspect that bias may have affected the
results of their trial, they can use testing strategies to detect and adjust for potential
bias. The Berger-Exner test (Berger and Exner 1999) can be applied to detect the
presence of selection bias with a high accuracy, if all assumptions are met
(Mickenautsch et al. 2014). Altman and Royston (1988) recommend to use cumu-
lative sums of the outcomes to detect the presence of time trends. A more general
approach to control bias is to use methods that are robust to bias. When a parametric
test is used for the treatment effect, it is possible to control for a specific bias by
estimating its effect from the data and thus adjusting the treatment effect for the bias.
When the focus is not estimation, but testing of the null hypothesis of no treatment
effect, randomization tests are recommended to control the effect of bias, particularly
chronological bias, on the type I error probability. As randomization tests do not
rely on parametric assumptions, their results are robust to biases that arise from
heterogeneity in the patient stream, e.g., due to chronological bias.
Key Facts
Cross-References
▶ Cross-over Trials
▶ Evolution of Clinical Trials Science
▶ Factorial Trials
48 Bias Control in Randomized Controlled Clinical Trials 873
References
Altman DG, Royston JP (1988) The hidden effect of time. Stat Med 7(6):629–637. https://doi.org/
10.1002/sim.4780070602
Armitage P (1982) The role of randomization in clinical trials. Stat Med 1:345–353
Atkinson AC (2014) Selecting a biased-coin design. Stat Sci 29(1):144–163. https://doi.org/
10.1214/13-STS449
Berger VW (2005) Quantifying the magnitude of baseline covariate imbalances resulting from
selection bias in randomized clinical trials. Biom J 47(2):119–127. https://doi.org/10.1002/
bimj.200410106
Berger VW, Exner DV (1999) Detecting selection bias in randomized clinical trials. Control Clin
Trials 20(4):319–327. https://doi.org/10.1016/S0197-2456(99)00014-8
Berger VW, Ivanova A, Deloria Knoll M (2003) Minimizing predictability while retaining balance
through the use of less restrictive randomization procedures. Stat Med 22(19):3017–3028.
https://doi.org/10.1002/sim.1538
Blackwell D, Hodges JL (1957) Design for the control of selection bias. Ann Math Statist
28(2):449–460. https://doi.org/10.1214/aoms/1177706973
Efron B (1971) Forcing a sequential experiment to be balanced. Biometrika 58(3):403–417
Hilgers RD, Uschner D, Rosenberger WF, Heussen N (2017) ERDO – a framework to select an
appropriate randomization procedure for clinical trials. BMC Med Res Methodol 17(1):159.
https://doi.org/10.1186/s12874-017-0428-z
ICH (1998) International conference on harmonisation of technical requirements for registration of
pharmaceuticals for human use. ICH harmonised tripartite guideline: statistical principles for
clinical trials E9
Ivanova A, Barrier RC Jr, Berger VW (2005) Adjusting for observable selection bias in block
randomized trials. Stat Med 24(10):1537–1546. https://doi.org/10.1002/sim.2058
Kennes LN, Cramer E, Hilgers RD, Heussen N (2011) The impact of selection bias on test decisions
in randomized clinical trials. Stat Med 30(21):2573–2581. https://doi.org/10.1002/sim.4279
Kennes LN, Rosenberger WF, Hilgers RD (2015) Inference for blocked randomization under
a selection bias model. Biometrics 71(4):979–984. https://doi.org/10.1111/biom.12334
Langer S (2014) The modified distribution of the t-test statistic under the influence of selection
bias based on random allocation rule. Master’s thesis, RWTH Aachen University, Germany
Matts JP, McHugh RB (1978) Analysis of accrual randomized clinical trials with balanced groups
in strata. J Chronic Dis 31(12):725–740. https://doi.org/10.1016/0021-9681(78)90057-7
Mickenautsch S, Fu B, Gudehithlu S, Berger VW (2014) Accuracy of the Berger-Exner test
for detecting third-order selection bias in randomised controlled trials: a simulation-based
investigation. BMC Med Res Methodol 14(1):114. https://doi.org/10.1186/1471-2288-14-114
Proschan M (1994) Influence of selection bias on type i error rate under random permuted block
designs. Stat Sin 4(1):219–231
Rosenberger W, Lachin J (2015) Randomization in clinical trials: theory and practice. Wiley series
in probability and statistics. Wiley, Hoboken
Rückbeil MV, Hilgers RD, Heussen N (2017) Assessing the impact of selection bias on test
decisions in trials with a time-to-event outcome. Stat Med 36(17):2656–2668
874 D. Uschner and W. F. Rosenberger
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878
Study Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 881
Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 881
Defining Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 883
Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884
Selection Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885
Information Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885
Analytic Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 886
Examples of the Use of Historical Comparators to Test Efficacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
Abstract
The goal of clinical research of new disease treatments is to evaluate the potential
benefits/risks of a new treatment, which generally requires comparisons to a
control group. The control group is selected to characterize what would have
happened to a patient if they had not received the new therapy. Although a
randomized controlled trial provides the most robust clinical evidence of treat-
ment effects, there may be situations where such a trial is not feasible or ethical,
and the use of external or historical controls can provide the needed clinical
evidence for granting conditional or accelerated regulatory approvals for novel
drugs. Data from previous clinical trials and real-world clinical studies can
provide evidence of the outcomes for patients with the disease of interest.
However, many methodologic considerations must be considered such as the
appropriateness of data sources, the specific data that needs to be collected,
C. Kim · V. Chia (*) · M. Kelsh
Amgen Inc., Thousand Oaks, CA, USA
e-mail: [email protected]; [email protected]; [email protected]
Keywords
Real-world comparators · Real-world evidence · Clinical trials · Controls ·
Historical comparator · Statistical methods · Propensity score
Introduction
The goal of clinical research of new disease treatments is to evaluate the potential
benefits/risks of a new treatment, which generally requires comparisons to a control
group. The control group is selected to characterize what would have happened to
patient if they had not received the new therapy. To minimize bias in this compar-
ison, clinical trial study designs typically involve randomization of eligible
patients to treatment and control groups and, where feasible, blinding investigators
to patient treatment status. This approach provides the most robust clinical evidence
of treatment effects. However, for a variety of reasons and under a number of
circumstances, this well-established study design may not be feasible or ethical,
and the use of existing “real-world” (RW) data (i.e., external or historical controls)
can provide the needed clinical evidence for granting conditional or accelerated
regulatory approvals for novel drugs.
Randomized controlled trials (RCTs) may not be feasible because the disease
under study is so rare, creating a significant challenge in recruiting a sufficient
number of patients or requiring an unreasonably long time period for patient
recruitment. Similarly, if a new treatment is targeting a relatively rare “molecular-
identified” subgroup of patients, this may require screening a large number of
patients to identify the target patient population. In addition, analytical challenges
such as patient crossover from the control group to the new treatment group can
threaten the study integrity and confound survival analyses. For these feasibility
reasons and other ethical considerations (see below), physicians and patients may
refuse to participate in RCTs under these circumstances.
The ethical considerations that can impose significant challenges include a lack of
equipoise resulting from the early findings of new treatment or the dismal outcomes
of the current standard of care (SOC) for many serious life-threatening illnesses.
Even if a disease is not life threatening, a study could involve potential invasive/
risky monitoring or follow-up procedures, administered to controls who have little to
49 Use of Historical Data in Design 879
gain in terms of improvement or benefit in their disease status under the current SOC.
All of these scenarios could render an RCT as unethical.
Under such circumstances single-arm clinical trials, accompanied with historical
controls, can provide the needed comparative data for evaluation of outcomes for
patients who did not receive the new treatment. Such studies can be faster and more
efficient than RCTs and still generate needed clinical evidence. In some cases,
comparative data may be obtained from readily available study group data from
completed clinical trial treatment and/or control cohorts, patient cohorts from pre-
vious clinical case series, or meta-analyses of previous RCT or observational patient
data. However, for more precise comparisons, accounting for important clinical
characteristics likely requires individual patient-level data. Use of these data is the
focus of the discussion in this chapter.
The broad concept of historical controls has been described previously (Pocock
1976) and has been given various labels such as “nonrandomized control group,”
“external control,” “synthetic control,” “natural history comparison,” or “historical
comparator.” Although there are subtle distinctions across these different labels,
generally they are used to refer to a comparison group which is not randomized, can
be concurrent or historical, and can be derived from a single or multiple sites or data
sources. For purposes of discussion in this chapter, we will use the term “historical
control” to describe this type of comparison group. These data could be historical or
contemporaneous with the new treatment group; with either type of data, similar
study design and analytical principles would apply (with the exception of the
assessment of trends over time for the historical data). The goal of using historical
controls is to provide evidence of the expected patient outcomes in the hypothetical
randomized control group in the absence of randomized control study. Additionally,
the data could improve the generalizability of the study findings.
Considering the potential ethical and feasibility challenges described above,
situations that favor the use of historical controls involve the following:
This list is not intended to be exhaustive of all potential scenarios where historical
controls may be advantageous or feasible, nor is it a requirement that all of these
attributes be present for the use of historical controls in lieu of a randomized control
880 C. Kim et al.
group. Data quality and accessibility are key components to the use of historical
controls for evidence generation and regulatory review. Each disease area and
specific indication may necessitate one type of data or another.
Controls derived from historical clinical trial data, when “reused” for other
studies, are considered observational data (Desai et al. 2013) and, because they
involve rigorous data collection and validation procedures, can provide robust
historical control data. These data are often limited in how many patients may be
eligible to be evaluated as they represent highly selected patient populations. The
characteristics of the included patients should be carefully considered for represen-
tativeness and comparability to the population being considered. Other data sources
include electronic health records, disease registries, insurance administrative claims
records, and clinical data abstracted for use as historical controls (USFDA 2018).
These data can often provide large numbers of patients but may not have as
meticulous collection of data as with other trials. But when the inclusion/exclusion
criteria are matched to that of the population being considered, the outcomes
can truly reflect what would occur if patient did not receive the investigational
intervention. Regardless of the type of data source, historical control data need to
include a sufficient number of patients, critical exposure, outcome, and covariate
information and systematic and sufficient follow-up to provide reliable comparative
information. The potential for bias can be reduced by assuring similar covariate,
endpoint, and exposure (e.g., previous treatment information, comorbidity data)
definitions, similar SOC across the historical patient cohort, and unbiased patient
selection processes. It is also important to fully understand data sources/data sys-
tems: how data are recorded, who are the patient populations captured in the data
source (e.g., an understanding of patient referral patterns), reasons for missing data,
geographic distribution of data, variation in standards of care, and other critical data
characteristics or system attributes (Dreyer 2018).
In addition to consideration and evaluation of the quality of historical control data
sources, appropriate study design and analytical methods that are aimed at reducing
bias and improving comparability/balance between patients receiving new treatment
and historical controls are critical aspects in providing an accurate comparison to the
new treatment patient group. Study design considerations and analytical strategies to
address bias due to confounding generally and, in particular, confounding by
indication include descriptive, stratified, weighted analyses, multivariate modeling,
propensity score (PS) methods for weighting and matching, and, when appropriate,
Bayesian approaches for statistical analysis. Sensitivity analysis should also be
proposed to assess impacts of study assumptions. A key to the scientific integrity
and successful regulatory evaluation of this process is the upfront specification of
these design and analysis strategies. Further details, examples from oncology, and
discussion on these topics are provided in this chapter.
In the first section of this chapter, appropriate study design is discussed. Selecting
the appropriate data source for the historical control group is critical, and evaluating
what data will satisfy the needs is the first key step. Next, defining the study variables
between the trial and historical controls is also critical. Without appropriately defined
exposures and endpoints, any comparisons will be limited. Finally, appropriate
49 Use of Historical Data in Design 881
considerations for bias that must be considered are discussed. Randomization typ-
ically balances differences between patient populations, but use of historical controls
will almost certainly be biased without proper design and analysis. After study
design, a number of different analytic options are presented, and examples of
comparing single-arm trials to historical controls are discussed.
Study Design
Data Sources
There are three primary methods to conduct a study’s data collection, each with its
own sets of strengths and limitations: (1) prospective primary data collection, (2)
clinical data abstraction from patient charts/clinical databases with a standardized
case report form, and (3) databases such as an electronic medical record (EMR)
linked to an administrative claims database. The last data option is to use former
clinical trial data. The data options that make the most sense will be determined by
the disease of interest and what the data will contextualize. The first option is to run a
primary prospective data collection. The second option is clinical data abstraction
from patient charts or clinical databases with a standardized case report form.
Another option is to use existing secondary databases such as comprehensive
EMR data, administrative claims data, EMR data linked to administrative claims,
disease registry data, or national health screening data. The specific details of each
approach are detailed below.
Primary prospective data collection will provide the most in-depth and complete
data but will be the most costly and time-consuming effort. This option may be
needed if the required data are not regularly collected in clinical practice and cannot
be derived from routinely collected measures (e.g., in graft-versus-host disease,
measurement of response assessment to therapy by NIH 2014 Consensus Response
Criteria Working Group is widely collected and reported in clinical trials but is not
integrated into routine clinical care (Lee et al. 2015)). In situations where temporality
of the data is particularly important, for example, if standard of care has changed
substantially over time, prospective data collection may be the only feasible option.
For instance, in 2005, the use of bortezomib was fully approved for the treatment of
relapsed or refractory multiple myeloma after compelling data from the phase 3
trial demonstrated bortezomib superiority to dexamethasone in overall survival
(Richardson et al. 2005). The introduction of bortezomib changed the SOC of
multiple myeloma, and as a result, a reasonable comparison of outcomes for
myeloma patients would need to include data after 2005 when bortezomib became
a backbone of myeloma therapy. Additionally, some endpoints may have a much
more complicated ascertainment than what is routinely conducted under typical
clinic care and cannot be derived from existing data. In these circumstances, the
only way to collect these assessments is to prospectively design a study that collects
such data. This approach will require the most time as sites will need to be enrolled
and each patient will need to be screened and provide informed consent.
882 C. Kim et al.
Retrospective extraction of clinical data using a standardized case report form can
provide depth of clinical data with less complexity and time than a prospectively
designed study. This option is a good choice if the data needed are commonly
collected but not necessarily in a structured field of an electronic medical record
(EMR) (e.g., to measure response to therapy for an acute lymphoblastic leukemia
patient (Gokbuget et al. 2016a)). The primary benefit of doing a retrospective data
collection directly from clinical sites is that most centers will have years of data
available. With a specific case report form, a focused effort can extract just the
necessary data. In some instances, centers may maintain a database or registry that
contains most, if not all, of the data elements needed, which will streamline the data
abstraction process. However, many centers do not keep a routine database of
clinical data for research purposes. The biggest barrier for these sites will be the
process of medical chart abstraction which requires extensive staff support. The
process of data abstraction and entry is often a slow and expensive process due to the
labor involved. Additionally, many sites and investigators may be less interested in
participating in retrospective studies which may not be as novel or impactful as
investigational therapies.
Use of databases can be the most time- and cost-efficient method. This option is
feasible when the primary endpoints are routinely collected in everyday medical
practice in structured EMR or insurance claims diagnosis (e.g., incidence of bleeding
events in patients with thrombocytopenia (Li et al. 2018)). The most common types
of data used would likely be an EMR linked to an administrative claims database.
These data are more likely to be highly generalizable as these data provide a large
sampling of centers and providers from a geographic region. Additionally, the
sample size provided by these datasets is likely to be far greater than a clinical
study. Despite these advantages however, the appropriateness of the databases needs
to be considered. For instance, existing electronic databases may lack the specificity
and depth of data needed for comparative purposes to clinical trial data. Additionally,
because the data are not provided on a protocol or for research purposes, there may
be missing elements that were never or rarely collected. Some covariate and end-
point assessments may be less frequently or sporadically measured, as these data-
bases reflect real-world medical practice. Lastly, many types of endpoints cannot be
assessed in these types of databases. Understanding the limitations of the data is key
to understand if this approach is feasible.
Previous clinical trials can provide robust control data. Clinical trials are
considered the gold standard of clinical evidence as they are highly controlled
and perform thorough data collection. Many variables may be collected for
completeness. Additionally, adherence to medications is often more closely
monitored and most potential variables are recorded. This allows investigators
to evaluate a wide range of variables during the design and analysis phase.
However, clinical trials tend to have highly selected populations, which limits
the generalizability and applicability to many populations. Often, most of these
data will not mimic the studied population, intervention, or inclusion/exclusion
criteria of the population to be compared to, making it difficult to use these data
as a source for controls.
49 Use of Historical Data in Design 883
Defining Variables
Defining exposure/treatment can vary depending on the type of data collection and
study being conducted. In prospective data collection, exposure definition, dates,
duration, dose, and any changes due to adverse events can be matched exactly to the
trial assessment schedule and definitions. In retrospective data abstraction efforts, it
may be straightforward to identify what specific regimen or protocol was anticipated
for the patient. However, there may be specific details missing such as the exact
dose(s) administered or adjustments for toxicity. Using large databases, treatment
regimen typically must be derived using an algorithm which can be prone to errors
and assumptions, particularly for multi-agent treatment regimens. For prescriptions
that are filled through a pharmacy and not administered in a clinic setting, you only
know with certainty that the prescription was picked up but not necessarily whether
it was actually taken by the patient. This highlights that in using historical data,
assumptions must be made and algorithms developed that need to be validated where
possible and/or evaluated in sensitivity analyses (Table 1).
Collection of relevant prognostic covariates is important for assessment of patient
population comparability. In a prospective data collection, all baseline covariates can
be ascertained with a complete baseline assessment. In a retrospective collection,
typically, disease-relevant clinical covariates will be routinely collected. However, a
complete assessment as typically done on a trial will not be conducted unless a
patient has specific medical conditions necessitating it. In a database, some labs or
imaging data may not be routinely available. Claims typically do not contain specific
lab values; EMR typically do not contain comorbid conditions captured outside of
that specific clinic.
Carefully determined endpoints are critical to study success. In prospective data
collection, endpoints can follow assessment schedules and definitions just like in the
clinical trial. In retrospective data sources where death is captured, overall survival is
Table 1 Data collection methods suitability for exposure, covariate, and endpoint
Data
collection Exposure Covariates Endpoints
Prospective Collect exact dates, Can do a full baseline Set the exact definitions
duration, dosing, assessment for all of endpoints and the
and changes relevant covariates assessment schedule
Retrospective Identify treatments, Demographics and Can be routinely
chart but may be missing disease-relevant clinical collected in medical
abstraction or exact doses and fine characteristics; some practice, but schedule of
clinical details comorbidities assessments is less
database frequent than trial
Claims, Can be inconsistent Claims can identify May lack some endpoints
registry, or in details, may demographics and that require lab or
EMR require algorithms comorbidities; EMR can imaging results
database to identify identify demographics
treatments and some clinical
characteristics
884 C. Kim et al.
Bias
In 1976, Pocock described the use of historical controls in clinical trials (Pocock
1976), and the acceptability of a historical control group requires that it meet the
following conditions:
1. Exposure: such a group must have received a precisely defined standard treatment
which must be the same as the treatment for randomized controls.
2. Patient selection: the group must have been part of a recent clinical study which
contained the same requirements for patient eligibility.
3. Outcome: the methods of treatment evaluation must be the same.
4. Covariates: the distributions of important patient characteristics in the group
should be comparable with those in the new trial.
5. Site selection: the previous study must have been performed in the same organi-
zation with largely the same clinical investigators.
6. Confounding: there must be no other indications leading one to expect differing
results between the randomized and historical controls.
Only if all these conditions are met can one safely use the historical controls as
part of a randomized trial. Although meeting all of these conditions would result in
an ideal historical comparator, it is not always feasible. In previous sections, types of
data sources, how to define exposures, outcomes, and covariates are discussed.
Later, analytic methods to assess and control for confounding will be discussed,
and in this section, Pocock’s conditions, bias, and how to mitigate bias are
49 Use of Historical Data in Design 885
Selection Bias
Selection bias may be introduced if the patients selected for the historical compar-
ators are not comparable to the clinical trial-treated subjects (other than treatment
exposure) or do not contain the same patient eligibility requirements. Prospective
study designs would minimize this bias the most, as investigators are able to define
eligibility criteria similarly to that of the clinical trial, with the goal to include
patients that would have been able to participate in the clinical trial. For studies
using existing data, either from clinical sites or through large existing databases,
careful selection of patients restricting inclusion and exclusion criteria is required;
however, not all clinical trial eligibility criteria may be found in these types of data
sources. Additionally, for existing data sources, selection of patients when the
outcome is known can bias the results to produce a favorable evaluation for the
drug or device under study. This bias may be mitigated by including all patients who
meet the eligibility criteria. If a random sample of patients is selected, then the
outcome must be blinded. Random selection can help provide a mix of patients at
various lines of therapy. When there are multiple treatments received over time (e.g.,
different lines of therapy), bias may be introduced when selecting the treatment line
for which to assess outcomes. For instance, if subjects in the clinical trial had to have
previously failed at least two prior lines of therapy and the majority of subjects only
had two prior lines of therapy, bias would be introduced if the majority of patients in
the historical comparator had three prior lines of therapy.
In addition to bias resulting from patient selection, for studies using existing data
from clinical sites, investigators need to ensure the clinical sites, including type of site
(e.g., academic hospitals, large specialty centers), country of site, standard treatments
used, and types of patients undergoing treatment at those clinics are comparable to the
clinical trial sites and subjects. However, the selected sites do not necessarily have to
be performed with the same clinical trial sites and investigators.
Finally, the time period in which the historical comparator is drawn from will
need to be carefully assessed for comparators that are nonconcurrent. For instance, if
there have been significant changes in medicine or technology over time (e.g., earlier
diagnosis of disease, changes in treatment effectiveness, or better supportive care
measures), having nonconcurrent comparators can bias the results to favor the drug
or device. Thus, it is important to carefully assess changes in the treatment landscape
over time when selecting patients for historical comparators.
Information Bias
reduce information or measurement bias. When data are collected from multiple
existing data sources (e.g., multiple clinical sites), standardization of data collection
forms will minimize measurement error. An issue comparative effectiveness obser-
vational research is the inappropriate selection of follow-up time. This creates a bias
in favor of the treatment group where during a period of follow-up, an event cannot
occur due to a delay or wait period for the treatment to be administered. This is
known as immortal time bias. Immortal time bias can be appropriately handled by
assigning follow-up when events can occur and using a time-varying analysis, not a
fixed-time analysis. Finally, the length of follow-up time after the treatment exposure
can impact outcomes. For instance, when following patients for death, the longer the
follow-up time, the more likely events will accrue. The historical comparator must
have a comparable follow-up time as the clinical trial patients.
Analytic Methods
when using such a simple adjustment method as it only adjusts for a few characteristics
with simple groupings. But when there are not many prognostic covariates to consider
or the populations are relatively well balanced on measured covariates without
adjustment, simply weighting may provide adequate and easy-to-interpret compari-
sons. Another option is to use a multiple regression model. Multiple regression will
adjust estimates of an endpoint measure with the assumptions normally associated
with fitting multiple covariates in the model. However, many of these assumptions
such as linearity or distributions of the covariates may not be met leading to a violation
of model assumptions and resulting in biased estimates.
Another method that is more flexible and reduces bias in comparisons is the use of
propensity score as adjustment (D’Agostino 1998). Propensity scores estimate the
probability of being assigned to a treatment group based on baseline covariates
entered in a model. First, the propensity score for each patient is derived using a
logistic model with many baseline covariates. It is possible to account for many
covariates, including interactions between variables, in deriving the propensity
scores. A wide range of variables can be accounted for compared to a traditional
multivariate logistic or Cox regression model. The distribution of propensity score
for both cohorts should be described and then the adjustment method can be chosen.
Two predominant methods of using propensity scores exist: matching or
weighting. When sample size is abundant, matching has some advantages. However,
when sample size is a concern, weighting provides a bit more flexibility at the cost of
less direct matching. Risk of outcomes associated with treatment exposure can be
presented as propensity score-adjusted odds ratios (OR) or hazard ratios (HR) with
95% CIs. Additionally, other covariates may be assessed and adjusted for in the risk
models. Propensity score models have been used in comparing phase 2 single-arm
trials to historical or real-world studies. Such an example compared blinatumomab to
chemotherapy for relapsed/refractory acute lymphoblastic leukemia which will be
discussed (Gokbuget et al. 2016b) and alectinib for non-small cell lung cancer
compared to real-world clinic outcomes (Davies et al. 2018). When sample size of
the assessed studies is small, statistical power can be boosted by including use of a
Bayesian prior. This method “borrows” data from a similar statistical analysis
(typically the same disease with similar drug) to augment the effect estimates. This
may lend additional credibility to study results if the a priori data is a relevant.
Additional analytics to consider are outlined well elsewhere (Lim et al. 2018).
Lim et al. describe several examples where drug approvals have used historical
comparator data in settings of rare diseases, including oncology and other life-
threatening conditions (Lim et al. 2018). One of these examples was for acute
lymphoblastic leukemia (ALL) and the use of historical comparators to put the
results from the blinatumomab single-arm phase 2 clinical trial into context
(Gokbuget et al. 2016b). In this example, historical data were pooled from large
study groups and individual clinical sites treating patients with Philadelphia
888 C. Kim et al.
The use of historical controls can serve to provide an alternate form of evidence for
understanding treatment effects in nonrandomized studies. In appropriate situations
where a randomized trial may be unethical and/or the unmet need is great, historical
control may be particularly useful for contextualizing study outcomes in the absence
of randomized trials. However, many methodologic considerations must be consid-
ered such as the appropriateness of data sources, the specific data that needs to
be collected, application of appropriate inclusion/exclusion criteria, accounting for
bias, and statistical adjustment for confounding. When these considerations
are appropriately handled, use of historical controls can have immense value by
providing further evidence of the efficacy, effectiveness, and safety in the develop-
ment of novel therapies and increase efficiency of the conduct of clinical trials and
regulatory approvals.
Key Facts
• External or historical controls can provide the needed clinical evidence for
granting conditional or accelerated regulatory approvals for novel drugs.
• Methodologic considerations for the use of external or historical controls must be
considered in order to appropriately compare real-world data to clinical trial data.
• When the methodologic considerations are appropriately handled, use of histor-
ical controls can have immense value by providing evidence to support the
efficacy and safety of novel therapies.
Cross-References
Funding Statement and Declarations of Conflicting Interest CK, VC, and MK are employees
and shareholders of Amgen Inc.
References
D’Agostino RB Jr (1998) Propensity score methods for bias reduction in the comparison of a
treatment to a non-randomized control group. Stat Med 17:2265–2281
Davies J et al (2018) Comparative effectiveness from a single-arm trial and real-world data:
alectinib versus ceritinib. J Comp Eff Res. https://doi.org/10.2217/cer-2018-0032
Desai JR, Bowen EA, Danielson MM, Allam RR, Cantor MN (2013) Creation and implementation
of a historical controls database from randomized clinical trials. J Am Med Inform Assoc 20:
e162–e168. https://doi.org/10.1136/amiajnl-2012-001257
Dreyer NA (2018) Advancing a framework for regulatory use of real-world evidence: when real is
reliable. Ther Innov Regul Sci 52:362–368. https://doi.org/10.1177/2168479018763591
890 C. Kim et al.
Contents
Introduction, Definitions, and General Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 892
What Is an Outcome? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 892
Outcomes in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 892
Where Are We Going? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
Types of Outcome Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894
Clinical Distinctions Between Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 894
Quantitative and Qualitative Descriptions of Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897
Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899
Choosing Outcome Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 900
Nonstatistical and Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 901
Assessing Outcome Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 901
Statistical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904
Reporting Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906
Multiple Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907
Multiple (Possibly Related) Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907
Longitudinal Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 910
Summary/Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 912
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 912
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 913
Abstract
Selecting outcomes for clinical trials requires a wide range of considerations relating
to clinical interpretation, ethics relating to therapy effectiveness and safety, and
statistical optimality of measures. Appropriate outcome choice plays a key role in
determining the usefulness and/or success of a study and can affect whether a
proposed study is view favorably by funding and regulatory agencies. Many
regulatory and funding agencies provide guidance on the types of outcomes that
are appropriate or acceptable in various contexts, and it is important to understand
J. M. Leach · I. Aban · G. R. Cutter (*)
Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
e-mail: [email protected]; [email protected]; [email protected]
Keywords
Primary outcomes · Secondary outcomes · Biomarker · Multiple measures
What Is an Outcome?
The FDA defines a clinical outcome as “an outcome that describes or reflects how an
individual feels, functions, or survives” (FDA-NIH Biomarker Working GroupB).
50 Outcomes in Clinical Trials 893
Choices regarding outcomes in clinical trials are often further constrained compared
to studies in general. Ethical considerations guide the choice of outcomes, and the
regulators, i.e., Food and Drug Administration (FDA), EMA, institutional review
boards, and/or other relevant funding agencies, ensure that these ethical consider-
ations are considered. Many clinical trials involve therapies that carry significant
risks, especially in the case of surgical interventions or drugs. In phase III trials, the
FDA requires that the effects of therapies under consideration be clinically mean-
ingful to come to market (Sullivan n.d.). The expectation is that a therapy’s benefits
will sufficiently outweigh the risks. The FDA gives three reasons for which patients
reasonably undertake treatment risks:
The focus of this chapter is the mathematical, clinical, and practical considerations
necessary to determine appropriate outcomes. Section “Types of Outcome Mea-
sures” introduces and discusses biomarkers and direct and surrogate outcomes and
defines mathematical descriptions of variables. Section “Choosing Outcome Mea-
sures” considers the clinical, practical, and statistical considerations in outcome
choice. Finally, Section “Multiple Outcomes” examines the benefits and complica-
tions of using multiple outcomes.
894 J. M. Leach et al.
Direct Endpoints
The FDA defines direct endpoints as outcomes that directly describe patient well-
being; these are categorized as objective or subjective measures. Objective measures
explicitly describe and/or measure clinical outcomes and leave little room for
individual interpretation. Some common objective measures are as follows:
1. Patient survival/death.
2. Disease incidence; e.g., did the subject develop hypertension during the study
period given they were free of hypertension at the start of the study?
3. Disease progression; e.g., did the subject’s neurological function worsen during
the study period?
4. Clinical events; e.g., myocardial infarction, stroke, multiple sclerosis relapse.
Subjective measures often depend upon a subject’s perception. For health out-
comes, this is often in terms of disease symptoms or quality of life (QoL) scores.
Subjective endpoints are complicated by their openness to interpretation, either
between or within subject’s responses or rater’s assessments, and whether or
which measures adequately capture the quality of interest is often debatable.
Ensuring unbiased ascertainment and uniformity of measurement interpretation is
difficult when the outcome is, say QoL, global impressions of improvement, etc.
compared to objective endpoints such as death or incident stroke. Measure assess-
ment is covered in detail in section “Choosing Outcome Measures.”
Note that regulatory agencies prefer direct endpoints as primary outcomes,
particularly for new drug approval. There are several issues that arise from using
what we will denote as the elusive surrogate measures or biomarkers and how these
issues can make their use less than optimal.
Surrogate Endpoints
Surrogate endpoints are substitutes for direct or clinically meaningful endpoints and
are typically employed in circumstances where direct endpoints are too costly, are
too downstream in time or complexity, or are unethical to obtain. Few true surrogates
exist if one uses the definition provided by Prentice (1989). In the Prentice definition,
the surrogate is tantamount to the actual outcome of interest (E), but this is often
unachievable. While there is some concurrence on the existence of a so-called
surrogate, these are often laboratory measures or measurable physical attributes
from subjects, such as CD4 counts in HIV trials, although still lacking in meeting
the Prentice definition.
Surrogate endpoints may avoid costly or unethical situations, but the researcher
must provide strong evidence that the surrogate outcome is predictive of, correlated
with, and/or preferably in the therapeutic pathway between the drug or treatment and
expected clinically significant benefit. Importantly, while the Prentice criteria argue
50 Outcomes in Clinical Trials 895
for complete replacement of the endpoint by the surrogate, the generally accepted
goal of a surrogate endpoint is to be sufficiently predictive of the direct endpoint.
In the case of sufficiently severe illness, researchers may obtain “accelerated
approval” for surrogate endpoints, but further trials demonstrating the relation
between surrogate and direct endpoints are typically required despite initial
approval. Surrogate endpoints can be classified into the following stages of valida-
tion (Surrogate Endpoint Resources for Drug and Biologic Development n.d.):
For validation, regulatory agencies prefer more than one study establishing the
relationship between direct and surrogate endpoints. A major drawback to surrogate
outcomes is that relationships between surrogate and direct endpoints may not be
causal even when the correlation is strong; even if the relationship is (partially)
causal, surrogate outcomes may not fully predict the clinically relevant outcome,
especially for complicated medical conditions. Two problems thus arise:
1. A drug could have the desired beneficial effect on the surrogate outcome but also
have a negative effect on an (possibly unmeasured) aspect of the disease, render-
ing the drug less effective than anticipated/believed.
2. Drugs designed to treat a medical condition may have varying mechanisms of
action, and it does not follow that validated surrogate endpoints are equally valid
for drugs with differing mechanisms of action.
These drawbacks can bias the estimate of a benefit-risk ratio, especially in smaller or
shorter studies, where there may be insufficient sample size or follow-up time to
capture a representative number of adverse events. Pairing underestimation of
adverse events with too-optimistic beliefs regarding the therapeutic benefits can
result in overselling a mediocre or relatively ineffective therapy.
Surrogates are often used in phase II studies initially before they can be accepted
as legitimate for phase III clinical outcomes as surrogate endpoints are not often
896 J. M. Leach et al.
clinically meaningful in their own right. Phase II trials can use biomarkers or
indicators of processes that are not necessarily surrogates.
Biomarkers
Biomarkers are “a defined characteristic that is objectively measured as an indicator
of normal biological processes, pathologic processes, or responses to an exposure or
intervention, including therapeutic interventions” (FDA-NIH Biomarker Working
Group). Biomarkers are often useful as secondary outcomes regarding subject safety
or validation that a therapy induces the expected biological response or a primary
outcome in phase II proof of concept trials. Most validated surrogate endpoints are in
fact biomarkers.
Biomarkers are often chosen as outcomes for the same reasons that surrogates are
used: shortened trials, smaller sample sizes, etc. However, biomarkers are often more
specific to certain treatments than surrogates. For example, in multiple sclerosis,
MRI reveal small areas of inflammation when viewed after injection of a special
chemical, gadolinium. Gadolinium-enhanced lesions are used in phase III trials as
proof of concept primary outcomes, but they do not clearly predict disability out-
comes, which are the goal of disease-modifying therapy, and they are clinically
meaningful only through their repeated linkage to successful drug treatment.
Sormani et al. have shown that they are acting as surrogates at the study level
(Sormani et al. 2009, 2010). These counts of enhancing lesions seem to be bio-
markers for inflammation, and their absence following treatment has been taken as a
sign of efficacy. However, there are now drugs that virtually eliminate these enhanc-
ing lesions, yet progression of disability still occurs, so they are not a good choice for
an outcome comparing two effective drugs where both may eliminate enhancing
lesions but have differences in their effects on disability.
Biomarkers are also useful not only as outcome variables, but as predictors of
outcomes on which to enrich a trial making it easier to see changes in the biomarkers
or primary outcomes. Biomarker-responsive trials select individuals who have
shown that with certain biomarkers or certain levels of biomarkers, the participants
are at increased risk of events or are more responsive to treatment. This seems like a
rational approach, but there are several caveats to the uncritical use of this selection.
Simon and Maitournam point out that the efficiency of these designs is often not seen
unless the proportion of biomarker-positive responders is less than 50% and the
response in those who are biomarker negative is negligible (Simon and
Maitournam 2004). The reasons for this counterintuitive finding are that the cost
of screening can overwhelm the dampening of the response by the biomarker-
negative individuals making the biomarker selection an added logistic issue while
not enhancing the design over simple increases in sample size and stratification. In
other situations, the biomarker’s behavior needs to be carefully considered. In
Alzheimer’s trials, it has been argued that more efficient trials could be done if
patients were selected based on a protein, tau-beta, found in their spinal cords. This is
because patients with these proteins have more rapid declines in their disease as
measured by the usual cognitive test outcomes. However, Kennedy et al. (2015)
showed that when designing a study based on selection for tau positivity, the gains in
50 Outcomes in Clinical Trials 897
sample size reduction due to the greater cognitive declines, which make percent
changes easier to detect, were offset by the increased variation in cognitive decline
among the biomarker positive subset. This results from assuming that the variance is
the same or smaller in the biomarker-positive subset compared to the larger
population.
Thus far we have definitions of outcomes that relate to what we want to measure
rather than how to quantify the measurement. The biological quantity of interest may
be clear, but decisions about how to measure that quantity can affect the viability of a
study or the reasonableness of the results. Outcomes are described as either quan-
titative or qualitative. In the following sections, we distinguish between these, give
several examples, and discuss several common subtypes of outcomes.
Quantitative Outcomes
Quantitative outcomes are measurements that correspond to meaningful numeric
scale and can be broken down into continuous and discrete measurements (or
variables). In mathematical terms, continuous variables can take any of the infinite
number of values between any two numbers in its support; they are uncountable. On
the other hand, discrete variables are countable. A few examples should help clarify
the difference.
Systolic blood pressure is a continuous outcome. In theory, blood pressure can
take any value between zero and infinity. For example, a subject participating in a
hypertension study may have a baseline systolic blood pressure measurement of
133.6224. We may round this number for simplicity, but it is readily interpretable as
it stands. In most cases discrete quantitative outcomes consist of positive whole
numbers and often represent counts. For example, how many cigarettes did a subject
smoke in the last week? The answer is constrained to nonnegative whole numbers: 0,
1, 2, 3, . . .. While perhaps it is possible to conceive of smoking half a cigarette, the
researcher needs to decide a priori whether to record and allow your data collection
system to accept fractions or develop clear rules to record discrete values in the same
way for all participants.
Categorical Outcomes
Categorical outcomes, or qualitative variables, have neither natural order nor inter-
pretation on a numeric scale and result from dividing study participants into cate-
gories. Many drugs are aimed at reducing the risk of negative health outcomes,
which are often binary in nature, and common trial aims are reducing the risk of
death, stroke, heart attack, or progression in multiple sclerosis. The use of these
binary outcomes is not simply convenience or custom, but rather they are much more
easily interpreted as clinically meaningful. To say you have reduced blood pressure
in a trial by 4.5 mmHg is by conditioning a positive result, but it is not in and of itself
immediately clinically meaningful, whereas, if the group that experienced
898 J. M. Leach et al.
4.5 mmHg greater change had lower mortality rates, it would be easier to say this is
clinically meaningful.
Categories need not be binary in nature. For example, consider a study where it is
likely that, in the absence of treatment, patient health is expected to decline over
time. A successful drug might slow the decline, stop the decline but not improve
patient health, or improve patient health, and so researchers could categorize the
subjects as such.
Common Measures
Often outcomes are raw patient measures, e.g., patient blood pressure or incident
stroke. However, summary measures are often relevant. Incidence counts the number
of new cases of a disease per unit of time and can be divided into cumulative
incidence such as that occurring over the course of the entire study or the incidence
per unit of time, 30-day mortality following surgery, etc. Incidence can pertain to
both chronic issues, e.g., diabetes, and discrete health events, e.g., stroke, myocar-
dial infarction, or adverse reaction to a drug.
Another common summary measure is the proportion: how many patients out of
the total sample experienced a medical event or possess some quality of interest? The
incidence proportion or cumulative incidence is the proportion of previously healthy
patients who developed a health condition or experienced an adverse heath event.
Incidence is also commonly described with an incidence rate; that is, per some
number of patients, called the radix, often 1000, how many will develop the
condition, experience the event, etc.; e.g., supposing 4% of patients developed
diabetes during the study, then an incidence rate would say that we expect 40 out
of every 1000 (nondiabetic) subjects to develop diabetes. Note that incidence differs
from prevalence, which is the proportion of all study participants who have the
condition. Prevalence can be this simple proportion at some point in time, known as
point prevalence or period prevalence, the proportion over some defined period. The
50 Outcomes in Clinical Trials 899
period prevalence is often used in studies from administrative databases and counts
the number of cases divided by the average population over the period of interest,
whereas the point prevalence is the number of cases divided by the population at one
specific point in time.
A measure related to incidence is the time to event; that is, how long into the study
did it take for the incident event to occur? This is often useful for assessing a
therapy’s effect on survival or health state. For example, a cancer therapy may be
considered successful in some cases if patient survival is lengthened; similarly, some
therapies may be considered efficacious if they extend the time to a stroke or other
adverse health events.
Safety
Measuring/Summarizing Safety
In addition to efficacy outcomes, safety outcomes are also important to consider. We
mentioned above that there is a necessary balance between the risks and potential
benefits of a therapy. Thus, we need information regarding potential risks, particu-
larly side effects and adverse events. It is possible that a therapy could be highly
effective for treating a disease and yet introduce additional negative health conse-
quences that make it a poor option. Safety endpoints can be direct or surrogate
endpoints. Some therapies may increase the risk of adverse health outcomes like
stroke or heart attack; these direct endpoints can be collected. We often classify
events into side effects, adverse effects, and serious adverse effects.
Side Effects: A side effect is an undesired effect that occurs when the medication
is administered regardless of the dose. Unlike adverse events, side effects are mostly
foreseen by the physician, and the patient is told to be aware of the effects that could
happen while on the therapy. Side effects differ from adverse events and later resolve
on their own with time.
Adverse Events: An adverse event is any new, undesirable medical occurrence or
change (worsening) of an existing condition in a subject that occurs during the study,
whether or not considered to be related to the treatment.
Serious Adverse Events: A serious adverse event is defined by regulatory
agencies as one that suggests a significant hazard or side effect, regardless of the
investigator’s or sponsor’s opinion on the relationship to investigational product.
This includes, but may not be limited to, any event that (at any dose) is fatal, is life
threatening (places the subject at immediate risk of death), requires hospitalization
or prolongation of existing hospitalization, is a persistent or significant disability/
incapacity, or is a congenital anomaly/birth defect. Important medical events that
may not be immediately life threatening or result in death or hospitalization but may
jeopardize the subject or require intervention to prevent one of the outcomes listed
above, or result in urgent investigation, may be considered serious. Examples
include allergic bronchospasm, convulsions, and blood dyscrasias.
Collecting and monitoring these are the responsibility of the researchers as well as
oversight committees such as Data and Safety Monitoring Committees. Collection of
900 J. M. Leach et al.
these can be complicated, such as when treatments are tested in intensive care units
where nearly all actions could be linked to one or the other type of event, to relatively
straightforward. Regulators have tried to standardize the recording of these events
into System Organ Classes using the Medical Dictionary for Regulatory Activities
(MedDRA) coding system. This standardized and validated system allows for
mapping of a virtually infinite vocabulary of events into medically meaningful
classes of events – infections, cardiovascular, etc. for comparison between groups
and among treatments. These aid in the assessment of benefits versus risks by
allowing comparisons of the rates of these medical events that occur within specific
organs or body functions.
While some health-related outcomes have obvious metrics, others do not. For
instance, if a study is conducted on a drug designed to lower or control blood
pressure or cholesterol, then it is straightforward to see that the patient’s blood
pressure is almost certainly the best primary outcome. However, for many complex
medical conditions, arriving at a reasonable metric requires a considerably more
twisted, forking path. For example, in multiple sclerosis (MS), the aim is to reduce
MS-related disability in patients, but “disability” in such patients is a multi-
dimensional problem consisting of both cognitive and physical dimensions and
thus requiring a complex summary metric. Sometimes the choice involves choosing
a metric and other times it involves how to use the metric appropriately. For example,
smoking cessation therapy studies should record whether patients quit smoking, but
at a higher level, we may debate just how long a subject must have quit smoking to
be considered a verified nonsmoker, or we may require biological evidence of
50 Outcomes in Clinical Trials 901
cessation such as cotinine levels, whereas in MS studies, the debate is more often
over which metric most adequately captures patient disability.
Validity
Validity is the ability of the outcome metric to measure that which it claims to
measure. In cases where the outcome is categorical, it is common to assess validity
with sensitivity and specificity. Sensitivity is the ability of a metric to accurately
determine patients who have the medical condition or experienced the event, and
specificity is the ability of a metric to accurately determine which patients do not
have the medical condition or did not experience the event. Both sensitivity and
902 J. M. Leach et al.
specificity should be high for a good metric; for example, consider the extreme case
where a metric always interprets the patient as having a medical condition. In such a
case, we will identify 100% of the patients with the medical condition (great job!)
and 0% of the patients without the medical condition (poor form!). Note that while
these concepts are often understood in terms of medical conditions and events, they
need not be confined in such a way.
For continuous, and often ordinal, measures, assessing validity is somewhat more
complicated. One could impose cutoffs on the continuous measure to categorize the
variable, only then using sensitivity or specificity to assess validity. However, this is
a rather clumsy approach in many cases; we want continuous outcome measures to
capture the continuous value with as little measurement error as possible. This is
often more relevant for medical devices. For example, a wrist-worn measure of
blood glucose would need to be within +/ of a certain amount of the actual glucose
level in the blood to demonstrate validity. Often individuals use regression analyses
to demonstrate that a purported measure agrees with a gold standard, but it should be
kept in mind that a high correlation by itself does not demonstrate validity. A
regression measure should have slope of 1 and an intercept of 0 to indicate validity.
Sensitivity is also used to describe whether an outcome measure can detect
change at a reasonable resolution. Consider a metric for disability, on a scale from
1 to 3, where higher scores indicate increased disability. This will be a good metric if
generally when a patient’s disability increase results in a corresponding increase on
the scale, but if the measure is too coarse, then it could be the case, for example, that
many patients are having disability increases, but not sufficient to move from a 1 to a
2 on the scale. This measure being insensitive to the worsening at the participant
level would lead to high sensitivity (because greater disability would have occurred
before the scale recognized it), but poor specificity because being negative does not
indicate the participant hasn’t progressed. When the metrics pertain to patient well-
being or health dimensions of which a patient is conscious, it is expected that when
the patient notices a change, the metric will reflect those changes. This is particularly
important for determining the effectiveness of therapies in many cases. A measure
that is insensitive to change could either mask a therapy’s ineffectiveness by
incorrectly suggesting that patient conditions are not generally worsening or on the
flip side portray an effective therapy as ineffective since it will not detect positive
change. Further, because if a participant feels they are worsening, but the measure is
insensitive, then this can lead to dropping out of the trial.
Reliability
Reliability is a general assessment of the consistency of a measure’s results upon
repeated administrations. There are several relevant dimensions to reliability. A
measure can be accurate, but not reliable. This occurs because on average the
measure is accurate but highly variable. Various types or aspects of reliability are
often discussed. Perhaps most prominent is interrater reliability. Many trials require
raters to assess patients and assign scores or measures describing the patient’s
condition, and interrater reliability describes the consistency across raters when
presented with the same patient or circumstance. A reliable measure will result in
50 Outcomes in Clinical Trials 903
(properly trained) raters assigning the same or similar scores to the same patient.
When a metric is proposed, interrater reliability is a key consideration and is
typically measured using a variant of the intraclass correlation coefficient (ICC),
which should be high if the reliability is good (Bartko 1966, Shrout and Fleiss 1979).
Intersubject reliability is also a concern; that is, subjects with similar or the same
health conditions should have similar measures. This differs from interrater reliabil-
ity in that it is possible for raters to be highly consistent within a subject, but
inconsistent across subjects, or vice versa. Interrater reliability measures whether
properly trained raters assign sufficiently similar scores to the same patient; that is, is
the metric such that sufficiently knowledgeable individuals would agree about how
to score a specific subject? Intersubject reliability measures whether a metric assigns
sufficiently similar scores to sufficiently similar subjects.
Other Concerns
There are several other issues in evaluating outcome measures. Practice effects occur
when patients’ scores on some measure improve over time not due to practice rather
than therapy. Studies involving novel outcome measures should verify that either no
practice effects are present or that the practice effects taper off; for example, in
developing the Multiple Sclerosis Functional Composite (MSFC), practice effects
were observed, but these tapered off by the fourth administration (Cohen et al. 2001).
Practice effects are problematic in that they can lead to overestimates of a treatment
effect because the practice effect improvement is ignored when comparing a post-
intervention measure to a baseline measure. In randomized clinical trials, we can
assume both groups experience equivalent practice effects and the difference
between the two groups is still an unbiased estimate of the treatment effect, but
how much actual improvement was achieved is biased unless the practice effects can
be eliminated prior to baseline by multiple administrations or adjusted for in the
analyses. Practice effects are often complex and require adjustments to the measure
or its application; for example, the Paced Auditory Serial Addition Test (PASAT), a
measure of information processing speed (IPS) in MS, was shown to have practice
effects that increased with the speed of stimulus presentation and was more prom-
inent in relapse-remitting MS compared to chronic-progressive MS (Barker-Collo
2005). Therefore, using PASAT in MS research requires either slower stimulus
presentation or some correction accounting for the effects.
Another source of (unwanted) variability in outcome measures, particularly sub-
jective measures, is response shift. Response shift occurs when a patient’s criteria for a
subjective measure change over the course of the study. It is clearly a problem if the
meaning of the same recorded outcome is different at different times in a study, and
therefore response shift should be considered and addressed when subjective measures
and/or patient-reported outcomes are employed (Swartz et al. 2011). This is often the
case with long-term chronic conditions such as multiple sclerosis where participants
report on their quality of life in the early stages of the disease and when reassessed
years later when they have increased disability record the same quality of life scores.
Adaptation and other factors are at the root of these response shifts, but outcome
measures that are subject to this type of variability can be problematic to use.
904 J. M. Leach et al.
Statistical Considerations
Statistical Definitions
In hypothesis testing, variable selection, etc. there are two kinds of errors to
minimize. The first is the false positive, formally Type I error, which is the proba-
bility that we detect a therapy effect, given that one does not exist. False positives are
typically controlled by assignment of the statistical significance threshold, α, which
is generally interpreted as the largest false-positive rate that is acceptable; by
convention, α ¼ 0.05 is usually adopted.
The second error class is the false negative, or Type II error, which occurs when
we observe no significant therapy effect, when in reality one exists. Power is given
as one minus the false-negative rate and refers to the power to detect a difference,
given that one exists. For a given statistical method, the false-positive rate should be
as low as possible and the power as high as possible. However, there is a tension in
controlling these errors because controlling one in a stronger manner corresponds to
a reduction in ability to control the other.
There are several primary reasons that these errors arise in practice (and theory).
Sampling variability allows for occasionally drawing samples that are not represen-
tative of the population; this problem may be exacerbated if the study cohort is
systematically biased in recruitment or changes over time during the recruitment
period so that it is doubtful that the sample can be considered as drawn from the
population of interest. The second primary reason for errors is sample size. In the
absence of compromising systematic recruitment bias, a larger sample size can often
increase the chance that we detect a treatment difference if one exists. The funda-
mental reason for improvement is that sample estimates will better approximate
population parameters with less variation about the estimates, on average. Small
sample sizes can counterintuitively make it difficult to detect significant effects,
overstate the strength of real effects, and more easily find spurious effects. This is
because in small samples a relatively small number of outliers can have a large
biasing effect, and in general sampling variability is larger for smaller compared to
larger samples.
The choice of statistical significance thresholds can also contribute to errors in
inference. If the threshold is not sufficiently severe, then we increase the risk of
detecting a spurious effect. Correspondingly, a significance threshold that is too
severe may prevent detection of treatment differences in all but the most extreme
cases. Errors related to the severity of threshold can affect both large and small
samples.
Note that caution must be employed with respect to interpreting differences.
Statistically significant differences are not necessarily clinically significant or
50 Outcomes in Clinical Trials 905
and that can lead to arbitrary cutoffs, and without the a priori specification of the
outcome, finding a cut point that “works” is certainly changing the chances of a
false-positive result.
On the other hand, it is sometimes useful to treat a discrete variable as continuous;
the most common instance of this is count data, where counts are generally very large.
In some cases, ordinal data may be treated as continuous with reasonable results. Even
though analyzing the ordinal data implicitly assumes that each step is the same in
meaning, this provides a simple summarization of response rate. Nevertheless, treating
these ordinal data as ranks can be shown to have reasonable properties in detecting
treatment effects. However, one should use caution when analyzing ordinal data by
applying statistical methods designed for continuous outcomes; in particular, models
for continuous outcomes perform badly when the number of ranks is small, and/or the
distribution of the ordinal variable is skewed or otherwise not approximately normally
distributed (Bauer and Sterba 2011; Hedeker 2015).
Reporting Outcomes
There are several main approaches for assessing outcomes: patient-reported out-
comes, clinician-reported outcomes, and observer-reported outcomes. We define and
discuss each below.
Patient-Reported Outcomes
Patient-reported outcomes (PROs) are outcomes dependent upon a patient’s subjec-
tive experience or knowledge; PROs do not exclude assessments of health that could
be observable to others and may include the patient’s perception of observable health
outcomes. Common examples include quality of life or pain ratings (FDA-NIH
Biomarker Working Group). These outcomes have gained a lot of acceptance
since the Patient-Centered Outcomes Research Institute (PCORI) came into exis-
tence. The FDA and other regulators routinely ask for such outcomes as they are
indicative of the meaningfulness of treatments. Rarely have patient-reported out-
comes been used as primary outcomes in phase III trials, except in those instances,
such as pain, where the primary outcomes are only available in this manner. Most
often they are used to provide adjunctive information on the patient perspective of
the treatments or study. Nevertheless, researchers should be cautioned not simply to
accept the need for PROs, but rather think carefully about what and when to measure
PROs. PROs are subjective assessments and can be influenced by a wide variety of
variables that may be unrelated to the actual treatments or interventions under study.
Asking a cancer patient about their quality of life during chemotherapy may not lead
to the conclusions of benefits of survival because of the timing of the ascertainment.
Similarly, a participant in a trial who is severely depressed may be underwhelmed
with the benefits of a treatment that doesn’t address this depression. In addition, the
frame of reference needs to be carefully considered. For example, when assessing
quality of life, should one use a tool that is a general measure, such as the Short-
Form-36 Health Survey (SF36), or one that is specific to the disease under study?
50 Outcomes in Clinical Trials 907
This depends on the question being asked and should be factored into the design for
any and all data to be collected.
PROs, like many outcomes, are subject to biases. If participants know that they
are on an active treatment arm versus a placebo, then their reporting of the specific
outcome being assessed may be biased. Similarly, participants who know or suspect
that they are on a placebo may report they are doing poorly simply because of this
knowledge rather than providing accurate assessments as per the goal of the instru-
ment. A general rule is that when one can make blinded assessments, the better. A
more detailed discussion of the intricacies involved in PROs is found in Swartz et al.,
and the FDA provides extensive recommendations and discussion (FDA 2009).
Clinician-Reported Outcomes
Clinician-reported outcomes (CRO) are assessments of patient health by medical or
otherwise healthcare-oriented professionals and are characterized by dependence on
professional judgment, algorithmic assessment, and/or interpretation. These are typi-
cally outcomes requiring medical expertise, but do not encompass outcomes or
symptoms that depend upon patient judgment or personal knowledge (FDA-NIH
Biomarker Working Group). Common examples are rating scales or clinical events,
e.g., Expanded Disability Status Scale, stroke, or biomarker data, e.g., blood pressure.
Observer-Reported Outcomes
Observer-reported outcomes (OROs) are assessments that require neither medial
expertise nor patient perception of health. Often OROs are collected from parents,
caregivers, or more generally individuals with knowledge of the patient’s daily life
and often, but not always, are useful for assessing patients who cannot, for reasons of
age or impairment, reliably assess their own health (FDA-NIH Biomarker Working
Group). For example, in epilepsy studies caregivers often keep seizure diaries to
establish the nature and number of seizures a patient experiences.
Multiple Outcomes
Most studies have multiple outcomes (e.g., primary, secondary, and safety out-
comes), but it is sometimes desirable or necessary to include multiple primary
outcomes. These generally consist of repeatedly measuring the same (or similar)
outcomes over time and/or including multiple measures, which can encompass
multiple primary outcomes or multiple secondary outcomes. This section describes
common situations where multiple outcomes are employed and discusses relevant
considerations arising thereof.
When the efficacy of a clinical therapy is dependent upon more than one dimension,
it may be inappropriate to prioritize one dimension or ignore lower-priority
908 J. M. Leach et al.
Composite Outcomes
Composite outcomes are functions of several outcomes. These can be relatively
simple, e.g., the union of several outcomes. Such an approach is common for time-
to-event data, e.g., major adverse cardiovascular events (MACE) or to define an
event as when a patient experiences one of the following: death, stroke, or heart
attack. Composite outcomes can also be more complex in nature. For example, many
MS trials are focused on MS-related disability metrics, which tend to be composites
of multiple outcomes of interest and which may be related to either physical or
cognitive disability; two common options are the Expanded Disability Status Scale
(EDSS) and the Multiple Sclerosis Functional Composite (MSFC).
Composite events are often used in time-to-event studies to increase the numbers
of outcomes and, thus for the same relative risk reductions, increase the power as the
power of a time-to-event trial is directly related to the number of events. When such
events are not reasonably correlated however, care must be given not to dominate
signal by noise. For example, as noted previously MS impacts patients differently
and variably. The EDSS assesses seven functional systems and combines them into a
single ordinal number ranging from 0 to 10 in 0.5 increments. For a person who is
impacted in only one functional system, the overall EDSS may not be moved even
by changes in this one functional system, thereby reducing its sensitivity to change.
50 Outcomes in Clinical Trials 909
ClinicalTrials.gov. While the specific corrections are beyond the scope of this
chapter, it is useful to contrast traditional approaches to false-positive control with
modern extensions.
In many cases traditional methods for controlling FWER or FDR have been
adapted to handle multiple testing in a more nuanced manner. Traditionally, one
defined a family of tests and then applied a particular method for controlling false
positives, but a study may reasonably consist of more than one family of tests; e.g.,
one may divide primary and secondary outcome analyses into two separate families.
Furthermore, families need not be treated as having equal importance, which is the
basis for hierarchical ordered families or a so-called step-down approach. This
approach applies to controlling FWER or FDR when “win” criterion is achieved in
the families of primary endpoints which is required before testing secondary (and
possibly tertiary) outcomes. The two most common frameworks are α-propagation
and gatekeeping. α-Propagation divides the significance level across a series of
ordered tests, and when a test is significant, its portion of the α is “propagated” or
passed to the next test. Gatekeeping approaches depend on a hierarchy of families. In
regular gatekeeping, a second family of tests is only tested if the first family passes
some “win” criteria, but some gatekeeping procedures allow for retesting, and many
methods incorporate both gatekeeping and α-propagation. A detailed discussion is
found in Huque et al. (2013).
Defining power and calculating the required sample size in complex multiplicity
situations may not be straightforward (Chen et al. 2011). However, using traditional
methods in the absence of updated methodology is likely to result in conservative
results, especially since many methods control FWER at or below the specified
significance value and so are often more conservative than the desired value so that
higher sample sizes are required to achieve the desired power. Note that there are no
multiplicity issues when considering coprimary endpoints, since each must success-
fully reject the null hypothesis, but power calculations are nonetheless complicated
and require considering the dependency between test statistics; unnecessarily large
sample sizes will be required if an existing dependency is ignored (Sozu et al. 2011);
for a detailed discussion on power and sample size with coprimary endpoints, see
Sozu et al. (2012).
Longitudinal Studies
such cases therapies may be distinguished by whether they prolong the time to event
instead of, or in addition to, whether they prevent the event entirely. Some events,
such as exacerbations (relapses) in MS, can occur repeatedly over time, and thus
assessing these for the intensity of their occurrence (often summarized as the
annualized relapse rate) is common. Other less extreme examples involve recording
patient attributes, e.g., quality of life assessments or biomarkers like blood pressure
or cholesterol; these and similar situations assess changes over time and address the
existence and/or form of differences between treatment groups.
Benefits
A major benefit of recording patients at multiple time points is that it allows for a
better understanding of the trajectory of change in a patient’s outcome; for example,
nonlinear trends may be discovered and modeled, thereby allowing researchers to
seek understanding of the clinical reasons for each part of the curve. For example,
longitudinal studies of blood pressure have established that human blood pressure
varies according to a predictable nonlinear form (Edwards and Simpson 2014). Such
a model may be used to better define and evaluate healthy ranges for patient blood
pressure. Second, patient-specific health changes and inference are available when a
patient is followed over time. A simple example is given by comparing a cross-
sectional study, a rarity in clinical trials, to a pre-post longitudinal study. Whereas in
a cross-sectional study, we only have access to observed group differences in raw
scores, longitudinal studies provide insight into whether and how a particular
patient’s metrics have changed over time; generalizing beyond pre-post to many
observations allows better modeling for individuals as well as groups. Additionally,
unless the repeated measures are perfectly correlated, the power to detect group
differences is generally increased when using longitudinal data and may require a
smaller sample size to detect a specified treatment difference.
Summary/Conclusion
Key Facts
References
Barker-Collo SL (2005) Within session practice effects on the PASAT in clients with multiple
sclerosis. Arch Clin Neuropsychol 20:145–152. https://doi.org/10.1016/j.acn.2004.03.007
Bartko JJ (1966) The intraclass correlation coefficient as a measure of reliability. Psychol Rep
19:3–11
Bauer DJ, Sterba SK (2011) Fitting multilevel models with ordinal outcomes: performance of
alternative specifications and methods of estimation. Psychol Methods 16(4):373–390. https://
doi.org/10.1037/a0025813
Benjamini Y, Cohen R (2017) Weighted false discovery rate controlling procedures for clinical
trials. Biostatistics 18(1):91–104. https://doi.org/10.1093/biostatistics/kxw030
FDA-NIH Biomarker Working Group. BEST (Biomarkers, EndpointS, and other Tools) Resource
[Internet]. Silver Spring (MD): Food and Drug Administration (US); 2016-. Available from:
https://www.ncbi.nlm.nih.gov/books/NBK326791/ Co-published by National Institutes of
Health (US), Bethesda (MD).
Carpenter JR, Kenward MG (2007) Missing data in randomised controlled trials – a practical guide.
National Institute for Health Research, Birmingham
Chen J, Luo J, Liu K, Mehrotra DV (2011) On power and sample size computation for multiple
testing procedures. Comput Stat Data Anal 55:110–122
Cohen JA, Cutter GR, Jill FS, Goodman AD, Heidenreich FR, Jak AJ, . . . Whitaker JN (2001) Use
of the multiple sclerosis functional composite as an outcome measure in a phase 3 clinical trial.
Arch Neurol 58: 961–967
Cohen JA, Reingold SC, Polman CH, Wolinsky JS (2012) Disability outcome measures in multiple
sclerosis clinical trials: current status and future prospects. Lancet Neurol 11:467–476
Cutter GR, Baier ML, Rudick RA, Cookfair DL, Fischer JS, Petkau KS, . . . Reingold S (1999)
Development of a multiple sclerosis functional composite as a clinical trial outcome measure.
Brain 122(5): 871–882. https://doi.org/10.1093/brain/122.5.871
Edwards LJ, Simpson SL (2014) An analysis of 24-hour ambulatory blood pressure monitoring data
using orthonormal polynomials in the linear mixed model. Blood Press Monit 19(3):153–163.
https://doi.org/10.1097/MBP.0000000000000039
FDA (2009) Guidance for industry patient-reported outcome measures: use in medical product
development to support labeling claims. U.S. Department of Health and Human Services Food
and Drug Administration. Retrieved from https://www.fda.gov/downloads/drugs/guidances/
ucm193282.pdf
FDA (2017) Multiple endpoints in clinical trials: guidance for industry. U.S. Department of Health
and Human Services Food and Drug Administration. Retrieved from https://www.fda.gov/
downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm536750.pdf
Haime M (2016) Somahlution announces study results showing DuraGraft ® vascular graft treat-
ment improves long-term outcomes in coronary artery bypass grafting surgery. European
Association for cardio-thoracic surgery annual meeting. Barcelona. Retrieved from https://
www.somahlution.com/vascular-graft-treatment/
914 J. M. Leach et al.
Hedeker D (2015) Methods for multilevel ordinal data in prevention research. Prev Sci
16(7):997–1006. https://doi.org/10.1007/s11121-014-0495-x
Hedeker D, Gibbons RD (2006) Longitudinal data analysis. Wiley, Hoboken
Huque MF, Dmitrienko A, D’Agostino R (2013) Multiplicity issues in clinical trials with multiple
objectives. Stat Biopharmaceut Res 5(4):321–337. https://doi.org/10.1080/19466315.2013.
807749
Kennedy RE, Cutter GR, Wang G, Schneider LS (2015) Using baseline cognitive severity for
enriching Alzheimer’s disease clinical trials: how does mini-mental state examination predict
rate of change? Alzheimer’s Dementia: Transl Res Clin Interven 1:46–52. https://doi.org/10.
1016/j.trci.2015.03.001
Mallinckrodt CH, Sanger TM, Dubé S, DeBrota DJ, Molenberghs G, Carrol RJ, . . . Tollefson GD
(2003) Assessing and interpreting treatment effects in longitudinal clinical trials with missing
data. Biol Psychiatry 53:754–760
Motooka Y, Matsui T, Slaton RM, Umetsu R, Fukuda A, Naganuma M, . . . Nakamura M (2018)
Adverse events of smoking cessation treatments (nicotine replacement therapy and non-nicotine
prescription medication) and electronic cigarettes in the Food and Drug Administration Adverse
Event Reporting System, 2004–2016. SAGE Open Med 6:1–11. https://doi.org/10.1177/
2050312118777953
Pocock SJ (1997) Clinical trials with multiple outcomes: a statistical perspective on their design,
analysis, and interpretation. Control Clin Trials 18:530–545
Powney M, Williamson P, Kirkham J, Kolamunnage-Dona R (2014) A review of handling missing
longitudinal outcome data in clinical trials. Trials 15:237. https://doi.org/10.1186/1745-6215-
15-237
Prentice RL (1989) Surrogate endpoints in clinical trials: definition and operational criteria. Stat
Med 8(4):431–440. https://doi.org/10.1002/sim.4780080407
Shrout PE, Fleiss JL (1979) Intraclass correlations: uses in assessing rater reliability. Psychol Bull
86(2):420–428
Simon R, Maitournam A (2004) Evaluating the efficiency of targeted designs for randomized clinical
trials. Clin Cancer Res 10:6759–6763. https://doi.org/10.1158/1078-0432.CCR-04-0496
Sormani MP, Bonzano L, Luca R, Cutter GR, Mancardi GL, Bruzzi P (2009) Magnetic resonance
imaging as a potential surrogate for relapses in multiple sclerosis: a meta-analytic approach. Ann
Neurol 65:268–275. https://doi.org/10.1002/ana.21606
Sormani MP, Bonzano L, Luca R, Mancardi GL, Ucceli A, Bruzzi P (2010) Surrogate endpoints for
EDSS worsening in multiple sclerosis. A meta-analytic approach. Neurology 75(4):302–309.
https://doi.org/10.1212/WNL.0b013e3181ea15aa
Sozu T, Sugimoto T, Hamasaki T (2011) Sample size determination in superiority clinical trials with
multiple co-primary correlated endpoints. J Biopharm Stat 21:650–668. https://doi.org/10.1080/
10543406.2011.551329
Sozu T, Sugimoto T, Hamasaki T (2012) Sample size determination in clinical trials with multiple
co-primary endpoints including mixed continuous and binary variables. Biom J 54(5):716–729.
https://doi.org/10.1002/bimj.201100221
Sullivan EJ (n.d.) Clinical trials endpoints. U.S. Food and Drug Administration. Retrieved Novem-
ber 19, 2018, from https://www.fda.gov/downloads/Training/ClinicalInvestigatorTrai
ningCourse/UCM337268.pdf
Surrogate Endpoint Resources for Drug and Biologic Development (n.d.) U.S. Food and Drug
Administration. Retrieved November 19, 2018, from https://www.fda.gov/Drugs/
DevelopmentApprovalProcess/DevelopmentResources/ucm606684.htm
Swartz RJ, Schwartz C, Basch E, Cai L, Fairclough DL, Mendoza TR, Rapkin B (2011) The king’s
foot of patient-reported outcomes: current practice and new developments for the measurements
of change. Qual Life Res 20:1159–1167. https://doi.org/10.1007/s11136-011-9863-1
Wolfe GI, Kaminski HJ, Aban IB, Minisman G, Kuo H-C, Marx A, . . . Evoli A (2016) Randomized
trial of thymectomy in myasthenia gravis. N Engl J Med 375(6):511–522. https://doi.org/10.
1056/NEJMoa1602489
Patient-Reported Outcomes
51
Gillian Gresham and Patricia A. Ganz
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 916
Role of PROs in Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919
PROs to Support Labeling Claims in the United States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 921
PROs to Support Labeling Claims in Other Countries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 922
Types of PROs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923
Health-Related QOL (HRQOL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 924
Healthcare Utility and Cost-Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925
PROs Developed by the National Institutes of Health (NIH) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926
Patient-Reported Common Terminology Criteria for Adverse Events (PRO-CTCAE) . . . . 927
Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928
Selection of the Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928
Modes of Administration and Data Collection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929
Frequency and Duration of PRO Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 930
Other Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 931
Clinical Trial Protocol Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 932
Reporting PRO Results from Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 933
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934
G. Gresham
Samuel Oschin Comprehensive Cancer Institute, Cedars-Sinai Medical Center,
Los Angeles, CA, USA
e-mail: [email protected]
P. A. Ganz (*)
Jonsson Comprehensive Cancer Center, University of California at Los Angeles,
Los Angeles, CA, USA
e-mail: [email protected]
Abstract
Patient-reported outcomes (PROs) are defined as any report that comes directly
from a patient. Their use as key outcomes in clinical trials has increased signif-
icantly, especially during the last decade. PROs encompass a variety of measure-
ments including health-related quality of life (HRQOL), symptoms, functional
status, safety, utilities, and satisfaction ratings. Selection of the PRO in a trial will
depend on a variety of factors including the trial’s objectives, study population,
disease or condition, as well as the type of treatment or intervention. PROs can be
used to inform clinical care and to support drug approval and labeling claims.
This chapter will provide an overview of the different types of PROs
with examples and their role in healthcare within the context of clinical trials.
Summaries of important regulatory documents including the FDA PRO guidance
and recommendations (SPIRIT and CONSORT PRO extensions) will also be
provided. Considerations when designing clinical trials are described in the last
section, highlighting important issues and topics that are unique to PROs. Many
methodologic and analytic features of PROs are similar to those of any outcomes
used in clinical trials; thus they require the same methodological rigor with
special attention to missing data. This chapter is written with a focus on the use
of PROs in interventional trials in the United States, although most information
can be applied to any context. Information presented in this chapter is relevant to
clinicians, researchers, policy makers, regulatory and funding agencies, as well as
patients. When used appropriately, PROs can generate high-quality data about the
effects of a particular intervention on a patient’s physical, psychological, func-
tional, and symptomatic experience.
Keywords
Patient-reported outcomes (PROs) · Quality of life · Health-related quality of
life · Health status · Symptoms · Adverse events · NIH PROMIS
Introduction
During the last two decades, patient-reported outcomes (PROs) have increasingly
been used in clinical trials to inform clinical care and to support drug approval and
labeling claims. PROs are defined as: “any report of the status of a patient’s health
condition that comes directly from the patient, without interpretation of the patient’s
response by a clinician or anyone else” (FDA 2009). A list of key terms related to
PROs and their definitions is included in Table 1.
There are several key regulatory and academic events that led to the increased
incorporation of PROs in clinical trials. Some of the first research studies to use
PROs included the Alameda County Human Population Laboratory Studies, the
RAND Health Insurance Experiment, and the Medical Outcome Studies occurring in
the 1970s and 1980s (Ganz et al. cited in Kominski 2013). In July 1990, the US
51 Patient-Reported Outcomes 917
Table 1 (continued)
Term Definition
Questionnaire A set of questions or items shown to a respondent to get answers
for research purposes. Types of questionnaires include diaries and
event logsb
Reliability The ability of a PRO instrument to yield consistent, reproducible
estimates of true treatment effectb
Sign Any objective evidence of a disease, health condition, or
treatment-related effect. Signs are usually observed and
interpreted by the clinician but may be noticed and reported by
the patientb
Structure of care Refers to how medical and other services are organized in a
particular institution or delivery systema
Symptom Any subjective evidence of a disease, health condition, or
treatment-related effect that can be noticed and known only by the
patientb
a
Definitions transcribed from Ganz et al. (2014)
b
Definitions transcribed from US Department of Health and Human Services: Food and Drug
Administration (2009)
National Cancer Institute held an all-day workshop to discuss the inclusion of PROs
in cancer clinical trials, informing subsequent strategies for their use in federally
funded cancer trials (Nayfield et al. 1992). Additional regulatory advancements in
the early 2000s include the draft of the EMA Reflection Paper on HRQL (2005) and
release of the Food and Drug Administration (FDA) draft guidance in 2006. In 2009,
the FDA published final guidance regarding the use of PROs for medical product
development to support labeling claims (FDA 2009). The guidance provides infor-
mation related to the evaluation of a PRO instrument and clinical trial design and
protocol considerations. This guidance will be used throughout this chapter to
support recommendations for the design of clinical trials that incorporate PROs.
Another key event in the history of PROs in the United States was the establish-
ment of the Patient-Centered Outcomes Research Institute (PCORI) as part of the
2010 Affordable Care Act (Frank et al. 2014). PCORI is an independent nonprofit,
nongovernmental organization that funds a wide range of research that incorporates
the patient perspective and will improve healthcare delivery and patient outcomes. It
is the largest public research funder that focuses on comparative effectiveness
research (CER) having funded over $2 billion dollars in research and related projects
to date. A unique feature of PCORI includes the active engagement of patients and
stakeholders throughout the research and review process (Frank et al. 2014). The
PCORI merit review includes five criteria that address the “impact of the condition
on the health of individuals and populations, population for the study to improve
health care and outcomes, technical merit, patient-centeredness, and patient and
stakeholder engagement” (Frank et al. 2014). Additional funding information and
details about PCORI can be found at the following web link: https://www.pcori.org/.
The purpose of this chapter is to provide a summary of the role of PROs within
the context of clinical trials, describe different types of PROs currently being used,
51 Patient-Reported Outcomes 919
PROs play different, but equally important roles across each phase of drug devel-
opment (e.g., Phase I, II, III, IV). The role of the PRO depends largely on the clinical
trial endpoint model, defined in the FDA 2009 PRO guidance as “a diagram of the
hierarchy of relationships among all endpoints, both PRO and non-PRO, that
corresponds to the clinical trial’s objectives, design, and data analysis plan” (FDA
2009). The conceptual framework for each endpoint model is illustrated in Figs. 1
and 2 as adapted from the FDA guidance. A PRO may be defined as the primary,
secondary, or exploratory endpoint in a clinical trial. If used as a primary endpoint,
the methods for sample size determination and power calculations should be
included, as described further in ▶ Chap. 92, “Statistical Analysis of Patient-
Reported Outcomes in Clinical Trials.” Regardless of the type of outcome, the
specific PRO hypothesis and objectives should be clearly stated a priori along with
details of the instrument’s conceptual framework and measurement properties in the
protocol.
The use of PROs across all phases of clinical trials, as registered in clinicaltrials.
gov between 2000 and 2018, has increased over time as displayed in Fig. 3. Based on
a general search of clinicaltrials.gov, we identified approximately 146 trials that
included PROs or HRQOL as outcomes in 2000, 693 in 2005, 1056 in 2010, and
1505 in 2018. The majority of trials registered during this period were phase 2
(n ¼ 6768), phase 3 (n ¼ 5807) followed by phase 4 (2715), and phase 1 (n ¼ 1769).
In early development trials (phase I/II), PROs can provide important information
about specific toxicities and symptoms before the treatment progresses to the next
phase (Lipscomb et al. 2005; Basch et al. 2016). In the earlier phases of clinical
trials, disease-specific measures and PROs may be more useful and clinically
relevant to clinicians and patients (Spilker 1996). Because early phase trials enroll
selective groups of subjects, PROs should be employed cautiously (Lipscomb et al.
2005). Early phase trials are less likely to support PRO labeling claims but can
provide early insight into toxicity and tolerability of new drugs or devices.
51 Patient-Reported Outcomes 921
There has been an increasing call for the incorporation of the patient voice into FDA
drug approval and labeling claims. This led to the release of PRO-specific FDA
guidance for industry (FDA 2009). The 2009 FDA guidance for use of PROs to
support labeling claims provides recommendations for areas that should be
addressed in PRO documents that are submitted to the FDA for review (FDA
2009). To ensure high-quality PRO data that is used to support the labeling claims,
the guidance focuses on the evaluation of the PRO instrument and design consider-
ations. Briefly, these areas include as follows: (I) the PRO instrument being used
along with instructions; (II) targeted claims or target product profile related to the
trial outcome measures (e.g., disease or condition, intended population, data analysis
plan); (III) the endpoint model; (IV) the PRO instrument’s conceptual framework;
(V) documentation for content validity; (VI) assessment of other measurement
properties (e.g., reliability, construct validity); (VII) interpretation of scores; (VIII)
language translations (if applicable); (IX) the data collection method; (X) any
modifications to the original instrument with justification; (XI) the protocol includ-
ing PRO-specific content; and (XII) key references. Detail and explanation for each
of these topics are included in the Appendix of the final 2009 FDA guidance
document and described within the context of clinical trial design in a later section.
While the FDA PRO guidance marks a significant advancement for the field, the
use of PROs to support labeling claims and their inclusion in published reports of
clinical trials remains low (Basch 2012). In the United States, it has been reported
that approximately 20-25% of all drug approvals were supported by PROs between
2006, when the FDA draft guidance was released, and 2015, despite the fact that
50% of drug approval packages included PRO endpoints (Basch 2012; DeMuro
et al. 2013). A literature review identified 182 NDAs between 2011 and 2015, for
which 16.5% had PRO labeling, defined as any treatment benefit related to PROs
that are mentioned in the FDA product label (Gnanasakthy et al. 2017). Authors
found that the majority of PRO labeling has been based on primary outcomes in
PRO-dependent diseases (e.g., mental, behavioral, and neurodevelopment disorders,
diseases of the respiratory system, diseases of the musculoskeletal system, etc.) as
922 G. Gresham and P. A. Ganz
While other countries have not issued formal guidelines specific to PROs such as the
US 2009 FDA guidance for PROs, recommendations exist for the use of PROs to
support the evaluation and approval of drug products. Perspectives of different
International regulatory scientists, focusing on the incorporation of PROs into the
regulatory decision-making process has been described in a recent paper by Kluetz
et al. (2018). In Europe, the European Medicines Agency (EMA) published a
reflection paper that provides recommendations on HRQOL evaluation within the
context of clinical trials (European Medicines Agency 2005; DeMuro et al. 2013). A
review published in 2013 compared PRO label claims granted by the FDA to those
granted by the EMA between 2006 and 2010 (DeMuro et al. 2013). The authors
51 Patient-Reported Outcomes 923
found that the EMA granted more PRO labels with 47% of products with at least one
EMA-granted PRO label compared to 19% by the FDA. They also observed that the
majority of FDA claims focus on symptoms, while the EMA-granted claims are
more likely to approve treatments based on higher-order concepts such as HRQOL,
patient global ratings, and functioning. Of the 52 PRO label claims granted by both
agencies, 14 products were approved by both the FDA and EMA (DeMuro et al.
2013). The UK National Institute for Clinical Excellence (NICE) has also provided
guidance for the inclusion of patients and public in treatment evaluations (NICE
2006).
In most parts of Europe, Australia, and Canada, PROs are included as an
important component of health technology assessment (HTA). The WHO defines
HTA as: “The systematic evaluation of properties, effects, and/or impacts of health
technology.” HTA was first developed in the United States in the 1970s as a policy
tool and later introduced to Europe, Canada, the United Kingdom, and Scandinavia
(Banta 2003). HTA is used to support healthcare policy and inform reimbursement
and coverage insurers (Banta 2003). Although the United States supports HTA
research, also as a component of comparative effectiveness research, no formal
HTA agencies exist in the United States. Within the context of clinical trials,
survival, QOL, and cost-effectiveness outcomes generated from clinical trials are
subsequently used to inform HTA. Thus, it is important that trials are designed to
incorporate these important and informative outcomes in order to generate high-
quality evidence.
Types of PROs
HRQOL is a special type of PRO that is widely used in clinical trials (Piantadosi
2017). It is considered an outcome of care that focuses on a “patient’s own percep-
tion of well-being, and the ability to function as a result of health status or disease
experience” (Ganz et al. 2014). The World Health Organization (WHO) defines
health as: “A state of complete physical, mental and social well-being and not merely
the absence of disease or infirmity,” and it has been schematically represented in
three levels of QOL that included the overall assessment of well-being, broad
domains, and the specific components of each domain (Spilker 1996). Early reviews
of the MEDLINE database demonstrated an exponential increase in the number of
studies that used QOL evaluation, with only five studies identified in 1973 and 9291
articles in 2010 (Testa and Simonson 1996; Ganz et al. 2014). In 2018, this number is
just over 34,000 articles, using “quality of life” as a topic heading with 1532 if
restricted to human clinical trials. In a clinicaltrials.gov search of the term “quality of
life,” a total of 28,886 interventional trials started between 01/01/2000 and 12/31/
2018 were identified. It is anticipated that this number will continue to grow as the
relevance and value of QOL are increasingly recognized.
HRQOL is measured using instruments (self-administered questionnaires) that
contain questions or items that are subsequently organized into scales (Ganz et al.
2014). HRQOL instruments may be general, addressing common components of
functioning and well-being, or disease-targeted, which focus on the specific impact
of the particular organ dysfunctions that affect HRQOL (Patrick and Deyo 1989 in:
Ganz et al. 2014). HRQOL is considered multidimensional, consisting of different
domains (physical, psychological, economic, spiritual, social) and specific
51 Patient-Reported Outcomes 925
measurement scales within each domain (Testa and Simonson 1996; Spilker 1996).
The overall assessment of well-being or global HRQOL score is often achieved by
summing scores across the different domains or may contain a global rating scale for
summary. Scores can also be established by each of the broad domains (e.g., physical
domain) or of the specific item (e.g., symptoms, functioning, disability).
One of the most common generic HRQOL instruments, emerging from the
Medical Outcomes Study (RAND Corporation), is the SF-36. It is composed of 36
items across 8 scales (physical functioning, role-physical, bodily pain, general
health, vitality, social functioning, role-emotional, mental health), which can be
further grouped into physical health and mental health summary measures. An
abbreviated, but equivalent version of the SF-36 was also developed through the
MOS and includes 12 items (SF-12) summarized into the mental and physical
domains. The SF-36 and SF-12 have been particularly useful in comparing the
relative burden of different diseases. They also have potential for evaluating the
QOL burden from different treatments (Spilker 1996).
An example of a disease-specific instrument of QOL is the European Organiza-
tion for Research and Treatment of Cancer (EORTC) QLQ-30 questionnaire
(Aaronson et al. 1993). The QLQ-30 is a cancer-specific questionnaire that was
developed from an international field study of 305 patients from centers across 13
different countries. The QLQ-C30 is composed of nine multi-item scales, which
include functional scales (physical, role, emotional, cognitive, social), symptom
scales (fatigue, nausea/vomiting, pain, dyspnea, sleep disturbance, appetite loss,
constipation, diarrhea, financial impact), and a global health and QOL scale. The
EORTC questionnaire and its many cancer-specific modules have been used
throughout the world to evaluate the HRQOL outcomes in cancer treatment trials:
(https://www.eortc.org/). Comprehensive inventory of HRQOL instruments is avail-
able online at https://eprovide.mapi-trust.org/ with information for over 2000 generic
and disease-specific instruments.
HRQOL can be used as utility coefficients to weight or adjust for outcomes such as
survival or progression-free survival and inform policy and healthcare decisions. For
example, quality-adjusted life years (QALY) are a measure of life expectancy that
adjusts for QOL (Kominski 2014). QALYs are often used to evaluate programs and
assist in decision-making processes where the Institute of Medicine (IOM) Com-
mittee on Public Health Strategies to Improve Health recommended that QALYs are
used to monitor the health status of all communities (Institute of Medicine 2011).
Consequently, there have been an increasing number of studies in which QALYs
have appeared in the literature, and advancements in methodologies for measuring
and reporting QALYs have been made (Ganz et al. in Kominski 2014). QALYs can
also be combined with cost evaluations, where the cost per QALY can be used to
show the relative efficiency of different health programs. While QALYs have played
an important role in health policy, they are also associated with limitations and
926 G. Gresham and P. A. Ganz
obstacles for which the Affordable Care Act states that QALYs cannot be used for
resource allocation decisions, and thus studies funded by PCORI do not include
these assessments (Ganz et al. in Kominski 2014). However, in the United Kingdom,
and other jurisdictions, their assessments play an important role in drug and device
approval processes.
Another measure that incorporates quantity and quality of life is the quality-
adjusted time without symptoms of disease and toxicity of treatment (Q-TWIST).
It is based on the concept of QALYs and represents a utility-based approach to QOL
assessment in clinical trials (Spilker 1996). Q-TWIST is calculated by subtracting
the number of days with specified disease symptoms or treatment-induced toxicities
from the total number of days of survival (Lipscomb et al. 2005). It requires
calculation of QOL-oriented clinical health states, for which treatments can then
be compared using a weighted sum of the mean duration of each health state with
weights being utility-based (Spilker 1996; Lipscomb et al. 2005). Often, the area
under the Kaplan-Meier survival curves is partitioned to calculate the average time a
patient spends in each clinical health state. An example of where Q-TWIST was
informative is in the case of adjuvant chemotherapy with tamoxifen for breast cancer
versus tamoxifen alone, where findings demonstrated the additional burden the
cytotoxic chemotherapy imposed on patients (Gelber et al. cited in Lipscomb et al.
2005). Another example is in a trial that demonstrated lengthened progression-free
survival (0.9 months) with zidovudine in patients with mildly symptomatic HIV
infection. However, Q-TWIST showed that patients who were treated with zidovu-
dine did worse (Spilker 1996). Additional details and analysis information related to
QALYs and Q-TWIST are described in ▶ Chap. 92, “Statistical Analysis of Patient-
Reported Outcomes in Clinical Trials.”
NIH PROMIS ®
In 2004, the NIH established the Patient-Reported Outcomes Measurement Infor-
mation System (PROMIS ®) as part of a multicenter cooperative group with the
purpose of developing and validating PROs for clinical research and practice (Cella
et al. 2010). PROMIS ® consists of over 300 self-reported and parent-reported
measures of global, physical, mental, and social health. This domain mapping
process was built based on the WHO framework of physical, mental, and social
health. Qualitative research and item response theory (IRT) analysis from 11 large
datasets informed an item library of close to 7000 PRO items to be further reviewed
and evaluated in field testing (Cella et al. 2010). PROMIS® measurement tools have
been validated in adults and children from the general population and those living
with chronic conditions. Additional information and PROMIS ® measures are avail-
able through the official information and distribution center: HealthMeasures (http://
www.healthmeasures.net/index.php).
The adult PROMIS® measures framework is composed of the PROMIS® profile
domains and additional domains, which are further categorized into physical, mental,
or social health components (HealthMeasures 2019). Physical health measures include
51 Patient-Reported Outcomes 927
fatigue, pain intensity, pain interference, physical function, and sleep disturbance.
Mental health profile domains include anxiety depression, and social health includes
the ability to participate in social roles and activities. A general global health measure
also makes up the self-reported health framework. The complete framework and a list
of additional PROMIS® domains can be accessed at http://www.healthmeasures.net/
explore-measurement-systems/promis/intro-to-promis.
A framework for pediatric self-reported and proxy-reported health was also
developed including the same health measures (physical, mental, social, global
health) with slightly different profile and additional domains. For instance, physical
health adds mobility and upper extremity function to the list of physical profile
domains, while sleep disturbance is not included. Anxiety and depressive symptoms
represent profile domains for mental health, while peer relationships are assessed as
part of social health.
PROMIS® measures can be administered using computer adaptive testing or on
paper through short forms or profiles (HealthMeasures 2019). The PROMIS ® self-
report measures are intended to be completed by the respondent themselves without
the help from others, unless they are unable to answer on their own, in which case a
parent or proxy measure may be used. Computer adaptive tests and short forms can
be imported into common data platforms and web applications such as REDCap,
Epic, OBERD, the Assessment Center (SM), and the Assessment Center Application
Programming Interface (API), which can connect any data collection software
application with the full library of PROMIS ® measures.
PROMIS® items include questions accompanied by Likert-type responses (e.g., not
at all, very little, somewhat, quite a lot, cannot do), which are associated with a
numerical score (0–5). The sums of the scores can then be converted into standardized
T-scores through the HealthMeasures scoring service, automatic scoring in data
collection tools, or manually. T-score metrics have a mean of 50 and standard deviation
of 10, making it easy to compare to reference populations including the general
population and clinical samples (e.g., cancer, pain populations) (Cella et al. 2007).
PROMIS® scores can also be converted to similar items from different instruments
such as the SF-36. For example, a PROMIS® physical T-score for physical function
can be linked to the SF-36 physical function score (www.prosettastone.org). This
allows for PRO evaluation and comparisons even when different measures are used.
The use of PROMIS ® measures in clinical trials has significantly increased since
their development, with over 2000 observational or interventional studies identified
in clinicaltrials.gov as of January 2021. The majority of trials that use NIH PRO-
MIS ® measures are for pain conditions, musculoskeletal diseases, and mental and
psychotic disorders. Their use is also increasing in cancer clinical trials, especially
breast cancer, and diseases of the central nervous system.
et al. 2014; Atkinson et al. 2017; Dueck et al. 2015), reflecting toxicities and adverse
events typically measured by clinical trial staff as part of the Common Terminology
Criteria for Adverse Events (CTCAE) assessment system. The PRO-CTCAE mea-
surement system includes 78 treatment toxicities that patients can systematically use
to document the frequency, severity, and interference of each toxicity (Basch et al.
2014). PRO-CTCAE includes a library of 124 questions that can be selected to
include as relevant to the specific trial. Each question includes a measure of the
frequency of the AE (e.g., “never,” “rarely,” “occasionally,” “frequently,” “almost
constantly”), the severity (“none,” “mild,” “moderate,” “severe,” “very severe”), or
the interference with usual or daily activities (“not at all,” “a little bit,” “somewhat,”
“quite a bit,” “very much”) (Basch et al. 2016). The patient-reported AEs can be
systematically collected at baseline, during active treatment, and during follow-up
using a pre-populated questionnaire.
An electronic platform for PRO-CTCAE also exists allowing for customized data
tailored to a particular treatment schedule and the incorporation of patient reminders
and clinician alerts (Dy et al. 2011; Basch et al. 2017b). Studies of both the PRO-
CTCAE and electronic PRO-CTCAE symptom collection system have demon-
strated feasibility, validity, and reliability of both the PRO-CTCAE and electronic
PRO-CTCAE in cancer patients (Dy et al. 2011; Basch et al. 2017a). Their use has
also been associated with enhanced care, improved QOL, and survival, possibly as
the result of earlier responsiveness to patient symptoms by medical personnel (Basch
et al. 2014, 2017; Aaronson et al. 1993).
Design Considerations
Selection of the PRO in a trial will depend on a variety of factors including the trial’s
objectives, study population, disease or condition, as well as the type of treatment or
intervention to a certain extent (Piantadosi 2017). When designing a clinical trial, the
instrument(s) used should be specified a priori and appropriately selected for the
specific population being enrolled in the clinical trial. For instance, additional
measurement considerations may need to be accounted for when assessing PROs
in pediatric, cognitively impaired, or seriously ill patients (FDA 2009). Investigators
should first determine whether an adequate PRO instrument exists to assess and
measure the concepts of interest (FDA 2009). In some cases, a new PRO instrument
may be developed or modified, with additional steps that would need to be taken to
ensure validity and reliability.
Validity and reliability should be supported before using an instrument in a
clinical trial. The FDA guidance requires that content validity as well as other
51 Patient-Reported Outcomes 929
There are several different ways that PROs can be administered including
self-administration, interview-administered, telephone-administered, surrogate-
(or proxy) administered, or a combination of modes (Spilker 1996; Lipscomb
et al. 2005; FDA 2009). When selecting the mode of administration and data
collection method in a trial, it is important to consider its intended use, the cost,
and how missing data can be reduced (Lipscomb et al. 2005). While self-admin-
istrated PROs require the minimal amount of resources, they are associated with
increased likelihood for missing items, misunderstanding, and lower response rates
(Spilker 1996). For both face-to-face and telephone interviews, response rates are
maximized, while missing data and errors of misunderstanding are minimized
(Spilker 1996). Disadvantages of these methods are that more time and resources
are required to train the interviewers and administer the questionnaires. Addition-
ally, for telephone interviews, the format of the instrument is further limited
(Spilker 1996). A third option is the use of surrogate responders to complete the
assessments. Advantages of using surrogates or proxies are that it is more inclusive
of patients who may not be able to complete the questionnaires themselves such as
children and those who are cognitively impaired or have language barriers (Spilker
1996). A risk associated with using surrogate responders is that the perceptions of
the surrogate may be different from those of the target group and not accurately
represent the patient’s perspective. For example, proxy reports of more observable
domains such as physical or cognitive function domains may be overestimated,
while symptoms or signs may be underestimated by proxy respondents (Spilker
1996). Thus, it is important to consider the strengths and weaknesses of each mode
of administration and identify the mode that is most relevant and appropriate for
each context.
930 G. Gresham and P. A. Ganz
The frequency and duration of PRO assessments must correspond to the specific
research question and objectives of the particular clinical trial. The frequency of
assessments will depend on the natural history of disease, the timing of the thera-
peutic and diagnostic interventions, and the likelihood of changes in the outcome
within the time period (Lipscomb et al. 2005; FDA 2009). Clinical trials with PROs
will often require at least one baseline assessment and several PRO assessments over
the course of the study period. Assessments should be frequent enough to capture the
meaningful change without introducing additional burden to the patients. They
should also not be more frequent than the specific period of recall, explained in
the next section, as defined in the instrument. For example, if an instrument has a 1-
month recall period, assessments should not occur weekly or daily.
The duration of the assessment will also depend on the research question and
should cover the period of time that is sufficient to observe changes in the outcome
(Lipscomb et al. 2005). Investigators should also consider whether they are inter-
ested in specific changes that occur during therapy or the long-term effect of the
therapy on that particular outcome. Therefore, the duration of follow-up with a PRO
assessment may be the same of other measures of efficacy or may be longer in
duration if the study objectives require continued assessment. In the former case, it is
important that efforts are made to reduce missing data and loss to follow-up.
Investigators should also take the recall period for the PRO instrument into
consideration when designing a trial. This is defined as the period of time patients
are asked to consider in responding to a PRO or question and can be momentary or
retrospective of varying lengths (FDA 2009). The recall period will depend on the
purpose and intended use of the PRO instrument, the concept being measured, and
the specific disease and treatment schedule. Items with shorter recall periods are
preferred over retrospective as patients are likely to be influenced by their current
51 Patient-Reported Outcomes 931
state during the time of recall (FDA 2009). Thus, the use of PRO instruments with
short recall periods administered at regular intervals (e.g., 2, 4, 6 weeks) may
enhance the quality of the data and reduce the risk for recall bias but have more
chance of missing data.
The FDA PRO guidance also reviews clinical trial design principles unique to PRO
endpoints. The first consideration relates to masking (blinding) and randomization of
trial participants. In the case of open-label trials, patients may overestimate the
treatment benefit in their responses, while those who are not receiving active
treatment may underreport any potential improvements (FDA 2009). Authors sug-
gest administering the PRO assessments prior to clinical assessments or procedures
to minimize the potential influence on patient perceptions. It is rare that such open-
label trials will be adequate to support labeling claims based on PRO instruments. In
masked clinical trials, there is still a possibility for inadvertent unblinding if a
treatment has obvious effects, such as adverse events. Consequently, similar over-
or underreporting of the treatment effect may occur if a patient thinks they are
receiving one treatment over the other. To decrease the risk of possible unblinding,
the guidance suggests using response options that ask for current status, not giving
patients access to previous responses, and using instruments that include many items
about the same concept (FDA 2009). Investigators should also take specific host
factors into consideration, where randomization may not achieve balance with
regard to the specific PROs (psychological and functional outcomes) that partici-
pants have at baseline or develop as a result of treatment.
The FDA guidance provides additional recommendations for clinical trial quality
control when using PROs in order to ensure standardized assessments and processes.
Specifically, the protocol should include information on how both patients and
interviewers (if applicable) will be trained for the PRO instrument along with
detailed instructions. The protocol should also include instructions regarding the
supervision, timing, and order of questionnaire administration as well as the pro-
cesses and rules for the questionnaire review for completeness; documentation of
how and when data are filed, stored, and transmitted to/from a clinical trial site; and
plans for confirmation of the instrument’s measurement properties using the clinical
trial data (FDA 2009).
A third recommendation as it relates to the use of PROs is to provide detailed
plans on how investigators will minimize and handle missing data (FDA 2009).
Because longitudinally measured PROs are subject to informative missingness, they
can introduce bias and interfere with the ability to compare effects across groups.
Thus, the protocol should include plans for collecting reasons that patients
discontinued treatment or withdrew their participation. Efforts should also be
made to continue to collect PRO data, regardless of whether patients discontinued
treatment, and a process should be established for how to obtain PRO measurement
before or after patient withdrawal to prevent loss to follow-up. Details on statistical
932 G. Gresham and P. A. Ganz
methods to account for missing data in the analysis plan are further described in
▶ Chap. 92, “Statistical Analysis of Patient-Reported Outcomes in Clinical Trials,”
of this book. Despite more stringent guidance on the assessment of PROs in clinical
research, there remains a need for a more standardized and coordinated approach to
further improve the efficiency for which PROs are collected and to maximize the
benefits of PROs in healthcare (Calvert et al. 2019).
It is essential that clinical trial protocols incorporating PROs are designed with the
same methodological rigor and detail as any other clinical trial protocol. To improve
standardization and enhance quality across clinical trial protocols, the SPIRIT
(Standard Protocol Items: Recommendations for Interventional Trials) statement
was developed, with the most recent version being published in 2013 (Chan et al.
2013) (Appendix 10.7). It consists of 33 recommended items to include in clinical
trial protocols organized by protocol section. To address PRO content-specific
recommendations, a PRO extension of the SPIRIT statement was developed in
2017. In addition to the 33 checklist items from the SPIRIT statement, the PRO
extension includes 11 extensions and 5 elaborations that focus on PRO-specific
issues across each protocol section (Calvert et al. 2018). Extensions and elaborations
for the PRO-specific elaborations and extensions to the standard SPIRIT checklist
are paraphrased in the following section.
As part of the administrative information and introduction components of the trial
protocol, PRO elaborations and extensions include SPIRIT-5a specifying the indi-
vidual(s) responsible for the PRO content of the trial; SPIRIT-6a describing the
PRO-specific research question and rationale for PRO assessment and summarizing
PRO findings in relevant studies; and SPIRIT-7 stating the PRO objectives or
hypotheses. PRO extensions related to the methods section of the protocols are:
SPIRIT-10 Specify any PRO-specific eligibility criteria. If PROs are only collected in
a subsample, provide rationale and description of methods for obtaining the PRO
subsample.
SPIRIT-12 Specify the PRO concepts/domains used to evaluate the intervention and,
for each one, the analysis metric and principle time point or period of interest.
SPIRIT-13 Include a schedule of PRO assessments (with rationale for the time
points) and specify time windows and whether order of administration will be
standardized if using multiple questionnaires.
SPIRIT-14 State the required sample size and how it was determined (if PRO is the
primary endpoint). If sample size is not established based on PRO, discuss the
power of the principal PRO analyses.
SPIRIT-18a(i) Justify the PRO instrument that will be used and describe domains,
number of items, recall period, instrument scaling, and scoring. Provide informa-
tion about the instrument measurement properties, interpretation guidelines,
patient acceptability and burden, as well as the user manual, if available.
51 Patient-Reported Outcomes 933
SPIRIT-18a(ii) Include the data collection plan that outlines the different modes of
administration and the setting.
SPIRIT-18a(iii) Specify whether more than one language version will be used and
state whether translated versions have been developed and will be used.
SPIRIT-18a(iv) If trial requires a proxy-reported outcome, state and justify the use of
a proxy respondent and cite the evidence of the validity of the proxy assessment,
if available.
SPIRIT-18b(i) Specify the PRO data collection and management strategies used to
minimize missing data.
SPIRIT-18b(ii) Describe the process of PRO assessment for participants who dis-
continue or deviate from the assigned intervention.
SPIRIT-20a State PRO analysis methods and include plans for addressing multiplic-
ity and type 1 error.
SPIRIT-20c State how missing data will be described and outline methods for
handling missing items or entire assessments.
SPIRIT-22 State whether PRO data will be monitored during the study to inform
patient care and how it will be managed in a standardized way.
Standards exist to improve the quality and completeness of reporting PROs from
clinical trials (Calvert et al. 2013). The Consolidated Standards of Reporting Trials
(CONSORT) statement is intended for use as a tool for authors, reviewers, and
consumers and endorsed by major journals and editorial groups. A CONSORT
extension was published in 2013 that includes recommendations for RCTs that
incorporate PROs in order to facilitate interpretation of PRO results and inform
clinical care. CONSORT statement was first published in 1996 to improve clinical
trial reporting (Calvert et al. 2013) (Appendix 10.5). Since then, extensions have
been developed to address specific trial designs and methods, including the CON-
SORT PRO extension. The CONSORT PRO extension describes five specific
recommendations that have been added to supplement the standard CONSORT
guidelines. The specific PRO items include the following: (1) PROs should be
identified as primary or secondary outcomes in the abstract; (2) the scientific
934 G. Gresham and P. A. Ganz
PROs are important sources of information for the evaluation of clinical trial out-
comes. There is an increasing emphasis on the use of PROs to inform healthcare policy
and clinical care. Routine collection of PROs has also been shown to improve quality
of care, engagement, and survival in some cases (Basch 2010). To ensure high-quality
PRO data, guidance and recommendations exist for the design of trials (FDA PRO
guidance), protocol development (SPIRIT-PRO), as well as the reporting of PROs
(CONSORT). The role of PROs in clinical trials will depend on the study objectives,
the disease/condition, and the treatment/intervention. It is important to recognize some
of the challenges with using PROs in clinical trials, such as the difficulty validating
different PRO measures, missing data, and the reporting biases that inheriently exist in
their assessment. Given the added value that PROs provide as well as their potential
for improving patient-centered care, PROs should be incorporated into clinical trial
design following the available guidelines and recommendations that exist to ensure
high-quality PRO data.
Key Facts
– The use of PROs in clinical trials can provide added value about the efficacy and
tolerability of an intervention
– PROs are increasingly being used as primary or secondary outcomes to support
labeling claims in the United States and inform clinical decision making
– There remains a need to standardize methods and coordinate efforts in the
collection, assessment, analysis, and reporting of PROs
– Guidance and recommendations have been established to improve the quality and
methodologic rigor of clinical trials that incorporate PROs
References
Aaronson NK, Ahmedzai S, Bergman B, Bullinger M, Cull A, Duez NJ, Filiberti A, Flechtner H,
Fleishman SB, de Haes JC et al (1993) The European Organization for Research and Treatment
of Cancer QLQ-C30: a quality-of-life instrument for use in international clinical trials in
oncology. J Natl Cancer Inst 85(5):365–376
51 Patient-Reported Outcomes 935
Atkinson TM, Stover AM, Storfer DF, Saracino RM, D’Agostino TA, Pergolizzi D, Matsoukas K,
Li Y, Basch E (2017) Patient-reported physical function measures in cancer clinical trials.
Epidemiol Rev 39(1):59–70
Banta D (2003) The development of health technology assessment. Health Policy 63(2):121–132
Basch E (2010) The missing voice of patients in drug-safety reporting. New England Journal of
Medicine 362(10):865–869
Basch (2012) Beyond the FDA PRO Guidance: Steps toward Integrating Meaningful
Patient-Reported Outcomes into Regulatory Trials and US Drug Labels Value in Health
15(3):401–403
Basch E (2018) Patient-reported outcomes: an essential component of oncology drug development
and regulatory review. The lancet Oncology 19(5):595–597
Basch E, Reeve BB, Mitchell SA, Clauser SB, Minasian LM, Dueck AC, Mendoza TR, Hay J,
Atkinson TM, Abernethy AP, Bruner DW, Cleeland CS, Sloan JA, Chilukuri R, Baumgartner P,
Denicoff A, Germain DS, O’Mara AM, Chen A, Kelaghan J, Bennett AV, Sit L, Rogak L, Barz
A, Paul DB, Schrag D (2014) Development of the National Cancer Institute’s patient-reported
outcomes version of the common terminology criteria for adverse events (PRO-CTCAE). J Natl
Cancer Inst 106(9):dju244. https://doi.org/10.1093/jnci/dju244. PMID: 25265940; PMCID:
PMC4200059
Basch E, Rogak LJ, Dueck AC (2016) Methods for implementing and reporting patient-reported
outcome (PRO) measures of symptomatic adverse events in cancer clinical trials. Clin Ther 38
(4):821–830
Basch E, Deal AM, Dueck AC, Scher HI, Kris MG, Hudis C, Schrag D (2017a) Overall survival
results of a trial assessing patient-reported outcomes for symptom monitoring during routine
cancer treatment overall survival for patient-reported symptom monitoring in routine cancer
treatment letters. JAMA 318(2):197–198
Basch E, Dueck AC, Rogak LJ, Minasian LM, Kelly WK, O’Mara AM, Denicoff AM, Seisler D,
Atherton PJ, Paskett E, Carey L, Dickler M, Heist RS, Himelstein A, Rugo HS, Sikov WM,
Socinski MA, Venook AP, Weckstein DJ, Lake DE, Biggs DD, Freedman RA, Kuzma C,
Kirshner JJ, Schrag D (2017b) Feasibility assessment of patient reporting of symptomatic
adverse events in multicenter cancer clinical trials patient reporting of symptomatic adverse
events in multicenter cancer trials patient reporting of symptomatic adverse events in multicen-
ter cancer trials. JAMA Oncol 3(8):1043–1050
Calvert M, Blazeby J, Altman DG, Revicki DA, Moher D, Brundage MD (2013) Reporting of
patient-reported outcomes in randomized trials: the CONSORT PRO extension. JAMA 309
(8):814–822
Calvert M, Kyte D, Mercieca-Bebber R, Slade A, Chan A-W, King MT, a. t. S.-P. Group (2018)
Guidelines for inclusion of patient-reported outcomes in clinical trial protocols: the SPIRIT-
PRO extension guidelines for inclusion of patient-reported outcomes in clinical trial protocols
guidelines for inclusion of patient-reported outcomes in clinical trial protocols. JAMA 319
(5):483–494
Calvert M, Kyte D, Price G, Valderas JM, Hjollund NH (2019) Maximising the impact of patient
reported outcome assessment for patients and society. BMJ 24;364
Cella D, Yount S, Rothrock N, Gershon R, Cook K, Reeve B, Ader D, Fries JF, Bruce B,
Rose M (2007) The patient-reported outcomes measurement information system (PROMIS):
progress of an NIH roadmap cooperative group during its first two years. Med Care
45(5 Suppl 1):S3
Cella D, Riley W, Stone A, Rothrock N, Reeve B, Yount S, Amtmann D, Bode R, Buysse D, Choi S,
Cook K, Devellis R, DeWalt D, Fries JF, Gershon R, Hahn EA, Lai JS, Pilkonis P, Revicki D,
Rose M, Weinfurt K, Hays R (2010) The patient-reported outcomes measurement information
system (PROMIS) developed and tested its first wave of adult self-reported health outcome item
banks: 2005–2008. J Clin Epidemiol 63(11):1179–1194
Chan A-W, Tetzlaff JM, Altman DG, Laupacis A, Gøtzsche PC, Krleža-Jerić K, Hróbjartsson A,
Mann H, Dickersin K, Berlin JA, Doré CJ, Parulekar WR, Summerskill WSM, Groves T,
Schulz KF, Sox HC, Rockhold FW, Rennie D, Moher D (2013) SPIRIT 2013 statement:
defining standard protocol items for clinical trials. Ann Intern Med 158(3):200–207
936 G. Gresham and P. A. Ganz
Testa MA, Simonson DC (1996) Assessment of quality-of-life outcomes. N Engl J Med 334
(13):835–840
US Department of Health and Human Services (USDHHS): Food and Drug Administration.
Accessed 1 March 2019; Draft guidance for industry. Patient-reported outcome measures: use
in medical product development to support labeling claims. 2006 February.; www.ispor.org/
workpaper/FDAPROGuidance2006.pdf
US Department of Health and Human Services (USDHHS): Food and Drug Administration.
Accessed 1 March 2019; Guidance for industry. Patient-reported outcome measures: use in
medical product development to support labeling claims. 2009 December.; www.fda.gov/
downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM193282.pdf
Translational Clinical Trials
52
Steven Piantadosi
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 940
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 941
Issues in Translational Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 942
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944
Reducing Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944
Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946
Safety Versus Efficacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 948
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 950
Abstract
Translational research and clinical trials are often discussed especially in aca-
demic centers from the perspective that such efforts are difficult or endangered.
Yet every new therapeutic must pass through this stage of investigation where
promising evidence supports staged development with the goal of product regis-
tration. Many clinical investigators instinctively know that relatively small clin-
ical trials can be essential in translation, but this is often contrary to the statistical
rigors of later development. This chapter attempts to reconcile these equally valid
perspectives.
Keywords
Translational research · Translational clinical trials · Biomarkers · Information ·
Entropy · Sample size
S. Piantadosi (*)
Department of Surgery, Division of Surgical Oncology, Brigham and Women’s Hospital, Harvard
Medical School, Boston, MA, USA
e-mail: [email protected]
Introduction
The terms basic research, clinical research, and translational research require some
practical definitions. Basic and clinical investigations are the ends of a spectrum and
therefore have always been part of the research landscape by default. Basic implies
that the research does not need an immediate application but represents knowledge
for its own sake. Clinical is the direct application to care or prevention of illness –
literally “at the bedside.” Without fanfare, translational research takes place between
the two ends of the spectrum, so it too has always been with us but perhaps less
obvious.
Historically, academic centers seem to have concentrated on basic and clinical
research, while commercial entities focused on translational research. Translational
research began to be characterized explicitly mostly in the 1990s and later.
Translation is difficult to define universally, but many academic institutions and
sponsors try to foster it alongside basic and clinical research at least in the sense that
they know it when they see it. A problem with a precise definition is that there is no
single anchor for the domain of translation like the laboratory or clinic. In fact, the
apparent gap between basic and clinical research often seems to be widening (Butler
2008). There is a National Center for Advancing Translational Sciences (NCATS) as
part of the National Institutes of Health. But NCATS does not offer a simple clear
definition.
Translational clinical trials are elusive because the label is often applied even
when the research has conventional developmental purposes such as dose finding. A
search of ClinicalTrials.gov (National Library of Medicine 2019) found only 243 tri-
als with the term “translational” out of over 300,000 entries in the database. The
search filters were “recruiting,” “enrolling by invitation,” “interventional,” “early
phase 1,” “phase 1,” and “phase 2.” Removing all filters except “interventional”
raised the number to 2888. These are imperfect snapshots. For example, a random-
ized trial came up under the filter “early phase 1.” ClinicalTrials.gov does not have
an explicit filter for translational trials.
The clinical scope of trials captured was universal as one might expect. Most of
the trials were probably not translational in the sense to be developed in this chapter.
But many trials have components or sub-studies that have translational objectives.
Many other studies in the database used similar descriptors but were not obviously
interventional trials. It seems unlikely that less than one in a thousand trials is
translational, but this low number points to nonuniformity in the way the concept
is used. The characterizations of translational trials in this chapter will push toward
common concepts and usage.
Presently there are about 100 medical journals in at least 5 languages with the
word “translational” in their title or self-identified as translational. Most of these are
online. A wide range of disciplines and diseases are represented among them,
including pediatrics, cardiovascular disease, cancer, psychiatry, immunology, and
informatics. As one might expect from the minority of clinical trials that are
classified as translational, publications in such journals seem to emphasize techno-
logical and pre-developmental studies.
52 Translational Clinical Trials 941
This chapter will attempt to define and characterize translational clinical trials as a
distinct type apart from the typical developmental classifications such as phase I, II,
or III. Major conceptual differences are that unlike developmental trials, translational
clinical trials are unlikely to employ clinical outcome measures and may beget
laboratory experiments as readily as additional clinical trials. Due to the high
uncertainty regarding treatment effects when they are begun, even the relatively
weak evidence produced from translational trials is informative for initiating or
suspending developmental steps.
Definitions
For this chapter, I will define translational research simply as converting observa-
tions from the laboratory into clinical interventions. This definition keeps away from
the ends of the spectrum in the following sense. Basic science observations must
already exist: translational research does not discover them. Clinical interventions
are created and await confirmation: translational research does not prove health
benefits.
With this simple definition of translational research, a translational clinical trial
(TCT) will be seen to be something special. First, although a TCT takes place in the
clinic, it relies heavily on laboratory foundations. Second, a TCT will not provide
strong evidence for therapeutic efficacy. Third, a TCT must inform the subsequent
clinical trials or laboratory experiments that will lead to strong evidence.
For the purposes of this chapter, I will use the following definition of a transla-
tional clinical trial (Piantadosi 2005):
A clinical trial where the primary outcome: 1) is a biological measurement (target) derived
from a well-established paradigm of disease, and 2) represents an irrefutable signal regarding
the intended therapeutic effect. The design and purposes of the trial are to guide further
experiments in the laboratory or clinic, inform treatment modifications, and validate the
target, but not necessarily to provide reliable evidence regarding clinical outcomes.
Therapy acts on a signal derived from a disease model but measured in the clinic
where it implicates definitive effects (Fig. 1). A TCT should inform the design of
subsequent experiments by reducing uncertainty regarding effects on the target.
Hence it should be designed to yield useful evidence whether the treatment succeeds
or fails. It must contain two explicit definitions for lack of effect: one for each study
subject and another for the entire study cohort. A TCT will not carry formal
hypothesis tests regarding therapeutic efficacy. One result for such a trial might be
that the treatment requires modifications or replacement: translational trials are
circular between the laboratory and clinic.
A translational treatment is not fixed. Imagine needing to modify a small mole-
cule or antibody based on an initial human trial. Not only might the treatment itself
change, but it is being tracked by a signal derived from the laboratory and measured
in people. Clinical outcomes will come only in later more definitive studies.
942 S. Piantadosi
Fig. 1 Translational clinical trial paradigm. The irrefutable signal is derived from understanding of
the disease model but it is measured in human subjects. The treatment modifies the signal which
then implies alteration in a definitive outcome
TCTs may focus on any of several natural questions. The most pointed question is
targeting – does the therapy hit the biological target and does it appear to produce
the intended effect. A small molecule or biologic might target a cellular receptor,
enzyme, antibody ligand, or gene, for example. The answer to this question implies
quantitative measurement of the relevant product.
Another question in translation might be signaling – are there products or effects
(signals) downstream from the target that reveal the intended effect after treatment. If
52 Translational Clinical Trials 943
we observe and measure the signal, we can reasonably infer that the target was hit. A
signal might be a change in activity level or switching of a gene expression, for
example.
A TCT could focus on feasibility – can we successfully implement and refine an
unproven complex method for delivering a therapy. Feasibility is vital when it is a
legitimate question but a straw man otherwise. Feasibility is not an appropriate
objective when the ability to administer a treatment is neither complex nor ques-
tionable. Such questions are sometimes chosen merely to deflect criticism from small
or poorly designed studies.
In a feasibility study, two definitions for infeasible are essential because it represents
a type of failure mentioned above. One definition pertains to each study participant so
there is an outcome measure relevant to the primary objective in every subject. A
second definition of infeasible refers to the entire study cohort so there will be a
prespecified measure of success. That tolerance specification also needs a precision
or confidence level. For example, suppose our translational trial addresses delivery
feasibility and we have an appropriate list of show-stopping problems that could be
encountered. Assume we can continue development only if 85% or more of subjects
have no clinically significant delivery problems. The smallest possible study that can
satisfy our requirement is 20 subjects, none of whom could have feasibility problems.
Then the lower 95% confidence bound on the success rate would exceed 85%.
The potential scope of delivery issues in translation is wide as a few examples will
illustrate. A drug might depend on its size, polarity, or solubility for reaching its target.
The dose or schedule will determine blood and tissue levels. For oral medications
subject behavior, adherence, or diet may also be factors. Individual genetic or epige-
netic characteristics can affect drug exposure via metabolism. For gene or cell
therapies, properties of the vector, dose or schedule, need for replication, immunolog-
ical characteristics of the recipient, and individual genetic or epigenetic characteristics
may affect delivery. Devices or skill-dependent therapies may depend on procedural
technique, function of a device, or the anatomy of the subject or disease. This daunting
array of issues indicates that study goals must be set thoughtfully, and off-the-shelf
drug development designs may not be appropriate for many feasibility questions.
TCTs may involve biomarkers at any of several levels. They are not unique in
this regard, but some uses of biomarkers are specific to translational trials. A
biomarker is an objective measurement that informs disease status or treatment
effects. Surrogate outcomes also track the effects of treatment but also change in
proportion to the way a definitive outcome would respond to the treatment. From the
perspective of trial design, a biomarker creates a subset in the study population. It
predicts whether a treatment works or carries information regarding prognosis. A
principal design question for a biomarker is if the study population should be
enriched with respect to it. If the biomarker indicates definitively whether treatment
will work, the population should be selected accordingly.
In some cases, companion diagnostics are at issue in translational trials. A
companion diagnostic is the test that reveals the biomarker level or presence. Such
a test might be new and could be based on evolving technology. The diagnostic test
could therefore be refined alongside the therapeutic. Different companion
944 S. Piantadosi
diagnostic tests, if they exist, could yield different study compositions and poten-
tially different results.
Examples
As a hypothetical example of a TCT, suppose we have a local gene therapy for brain
tumors to be delivered using a well-studied (safe) lentiviral vector. The virus delivers
the gene product to tumor cells where it stops their growth and kills them. Suppose
further that production of the delivery vector is straightforward, injecting the viral
particles is feasible, and the correct dose is known from previous studies. The
treatment will be administered days prior to routine surgical resection of the tumor
so effects can be measured clinically and in the specimens.
An appropriate design for the first human trial in this circumstance will not
emphasize a dose question. To rule out adverse events at a 10% threshold requires
at least 30 subjects yielding zero such events. Then the upper 95% confidence bound
on the adverse event rate would be 0.1. This would be a relatively large early clinical
trial. However, we might reliably establish the presence of gene product in resected
tumor cells with many fewer subjects. Furthermore, seeing a handful of tumor
responses prior to surgery would also be promising evidence of efficacy. Hence a
translational trial could reveal more about efficacy than safety.
An example of an actual translational trial is in the ClinicalTrials.gov database at
record NCT02427581 which is a breast cancer vaccine trial (Gillanders et al. 2019).
Subjects in this trial have not had a complete response after chemotherapy and are at
very high risk for disease recurrence. The planned sample size is 15, and the primary
outcome measure is grade and frequency of adverse events at 1 year. Secondary
outcomes of the trial are immunogenicity measures for the vaccine.
Another translational trial is illustrated by a test of mushroom powder for
secondary breast cancer prevention in 24 subjects (Palomares et al. 2011). The
trial was formally described as “dose finding” although up to 13 g of white button
mushroom was not a difficult tolerance challenge, especially compared to typical
cancer therapies. The translational objectives were aromatase inhibition, with
response defined as a 50% decrease in free estradiol. No participants met the
predefined response criterion in this trial according to the report.
A final example of a translational clinical trial is a test of valproate for
upregulating CD20 levels in patients with chronic lymphocytic leukemia (Scialdone
et al. 2017). This trial was planned in four subjects but one dropped out due to a
hearing disorder. No upregulation of CD20 mRNA or protein could be detected
in vivo in cells from patients on this trial according to the report.
Reducing Uncertainty
Clinicians want translational clinical trials to be small. I have often heard excellent
clinical investigators indicate that there would be much to learn from an early test of
a new therapeutic idea in a handful of subjects, say 6 or 8, for example. This is
52 Translational Clinical Trials 945
consistent with some of the examples just cited. Resource limitations are partly at the
root of clinicians’ concerns to make TCTs small. But the reasoning is more consid-
ered than resources alone. Small experiences can be critical when uncertainty is high.
But statisticians tend to disrespect this notion because statistical learning is
connected to narrowing confidence or probability intervals around estimands. A
handful of observations does not do much by those measures.
Both perspectives are correct. Statisticians are usually interested in definitive
evidence in service of decisions. Clinical investigators often seek ways to reduce
uncertainty regarding the biological effects of a new therapy as a prerequisite for
further clinical trials. In settings of high uncertainty, relatively few observations can
reduce uncertainty and lead in the correct direction for further experimentation. Here
I will illustrate some hypothetical circumstances of high uncertainty and its reduction
using small trials.
Suppose that our therapy can yield a positive, neutral, or negative outcome.
Before the trial, assume we are maximally uncertain as to the effect that the treatment
will produce. This means that we hypothesize that each outcome is equally likely.
We hope our experiment will reveal a dominance of positive outcomes. When the
trial is over, we assess the information gained and decide what studies should be
performed next. We will temporarily put aside concerns over sample size and focus
only on the outcome frequencies produced by such a trial.
Table 2 shows a hypothetical example. Before the trial, each outcome is assumed
to have an equal chance of happening. After the trial, suppose half the subjects have
a positive result, and 25% have each negative or neutral. Gain in information can be
calculated using entropy (Gillanders et al. 2019) and yields a value of 0.6. Alterna-
tively, the relative information from these results can be calculated using the
Kullback-Leibler divergence (Kullback and Leibler 1951) and yields a value of
0.05. The Kullback-Leibler divergence is sometimes called relative entropy and
can be taken as a measure of surprise. Most of us do not have an intuitive feel for
these information values, but both indicate a small gain in information. Depending
on the clinical context, this might be a very promising result because half the subjects
seemed to benefit from the therapy.
The consequences of assuming too much prior to the experiment can be
serious. Consider the hypothetical results in Table 3 where optimism prior to
the trial was very strong. The same trial results from Table 2 are measured
against that very optimistic initial hypothesis. The information value is 0.52,
and the divergence is 0.28 indicating that before and after are quite different.
However, we seem to know less after the trial than before because we have “lost”
information. In a sense this is true because the initial hypothesis implied too
Table 2 Hypothetical prior (before) and outcome (after) probabilities for a translational clinical
trial. Before the trial, maximum uncertainty regarding the treatment effect was assumed
Outcome probabilities
Time Neg Neutral Pos
Before 0.33 0.33 0.33
After 0.25 0.25 0.50
946 S. Piantadosi
Table 3 Hypothetical prior (before) and outcome (after) probabilities for a translational clinical
trial. Before the trial, strong assumptions were made regarding the treatment effect. Implications are
different compared to Table 2, despite the outcome data being identical
Outcome probabilities
Time Neg Neutral Pos
Before 0.05 0.10 0.85
After 0.25 0.25 0.50
Table 4 Hypothetical prior (before) and outcome (after) probabilities for a translational clinical
trial. Prior assumptions were strong. Formally there is no gain in information but the results are
divergent and carry biological implications
Outcome probabilities
Time Neg Neutral Pos
Before 0.10 0.50 0.40
After 0.50 0.40 0.10
much certainty. This might be a very unpromising result if the strong optimism
prior to the trial was justified.
As a third example, consider the scenario in Table 4. There the anticipated
outcome probabilities are the same as those observed after the trial but rearranged.
The information gain is apparently zero. The divergence is 0.5, indicating before and
after are substantially different. Such a result would likely prevent development
because half the subjects show negative outcomes.
These examples have assumed we have reasonable estimates of the classification
or response probabilities when the TCT is complete. In small sample sizes, the
calculated information and its variance are biased (Butler 2008). The bias can be
substantially reduced in modest sample sizes, meaning that we can then obtain
reasonable quantitative estimates of information gain.
These simple examples show that simple prior hypotheses coupled with outcome
summaries yield quantifiable information and relative information. It seems appro-
priate to assume we are uncertain before human data are obtained. The clinical
setting and properties of existing treatments must be brought into the assessment of
how promising results are.
Sample Size
We can now return to the question of how large such trials need to be. What does it
take to get reasonable estimates of information gain? Simulation according to the
following algorithm provides an answer. We begin with a fixed sample size and
combinatorically enumerate all possible outcomes, calculating entropy and diver-
gence for each. Each outcome has a probability of occurring according to a multi-
nomial distribution that is assumed to represent the truth of nature. From the
52 Translational Clinical Trials 947
Fig. 2 CDF of entropy as defined in Shannon [1948]. Three outcome categories were used as in
Table 1 in the text and the multinomial probabilities were uniform. Curves correspond to sample
sizes ranging from 6 to 36. A reference curve for N ¼ 100 is shown to indicate how small sample
sizes perform relative to it
948 S. Piantadosi
medical practice is valuable and can guide subsequent experimental steps. A TCT
shares as many characteristics with lab experiments as with definitive clinical trials.
The decisions faced after a TCT are treatment effects on the irrefutable signal and
what is the best next experiment. Whether or not to accept a treatment for wide use is
not an issue. Therefore, lesser evidence is required to guide the next experimental
steps.
Another dimension for translational clinical trials is the extent to which they can
reassure investigators as to the safety of the intervention in question. Safety is a
ubiquitous concern in clinical trials but is more aspirational than real in small studies.
There are two strong reasons for this. One is biological: there are no surrogates for
safety, so we never have more promising evidence than zero events in the trial
cohort. In contrast, efficacy may show promise via the irrefutable signal which is its
intended purpose.
A second reason why proof of safety is unattainable in TCTs derives from
statistical considerations. Safety is not a measurement but is a clinical judgment
based on an informative absence of events. Only sizable experiences are informative.
For example, if 10% is the upper tolerance for serious adverse events, data would
have to show zero events in 30 subjects for the confidence bound on the event rate to
be below the 10% limit. A 5% limit would require 60 event-free subjects. These are
realistic tolerances for serious adverse events putting proof of safety out of reach of
many TCTs that enroll only a few subjects.
52 Translational Clinical Trials 949
To contrast, far fewer subjects could show evidence of efficacy. For example, a
handful of disease responses out of 10 or 12 participants in a TCT could be very
promising. Hence, we must be alert to the chance to learn about efficacy before
safety in such trials. This is a reverse of the stereotypical sequence.
The goal of this chapter has been to describe features of translational clinical trials
that codify them as a distinct type of study. There are trials that fit this paradigm very
well. However, many studies labeled as such have only certain components that are
translational. Still others are relatively ordinary developmental trials that carry a
translational label probably to make them appear more vital because of the impor-
tance of the term.
The key feature of a translational trial is the intersection of a well-characterized
laboratory model of disease with the actual human condition. An irrefutable signal is
derived and validated from laboratory studies but measured as an outcome in the
clinical trial. If it does not reflect favorably on the treatment, investigators must
either abandon or redesign the therapy.
Statistical precision is not obtained in the small experiences of many translational
trials. But measurable gains in information relative to a state of maximal uncertainty
can be obtained. The acceptability of this perspective may be a point of divergence
for the instincts of clinician investigators versus statisticians. In any case, transla-
tional trials must provide enough information to justify performing more expensive
developmental studies or to terminate or not begin development at all. Trialists
should be alert to the needs of clinical studies in translation so they are not forced
into unserviceable commonplace designs.
Key Facts
Translation is a key step in the evolution of a new therapeutic idea, whether viewed
as predevelopment or as part of staged development. It constitutes the bridge
between basic science and preclinical investigations and human trials. Relatively
small translational trials can reduce the considerable uncertainty regarding therapeu-
tic effects at this stage of development. Such trials rely on biological models as
implemented in the laboratory and biological targets and signaling. Simple
approaches to quantifying information gained show that small sample sizes can
reduce uncertainly enough to guide subsequent clinical trials.
Cross-References
References
Butler D (2008) Translational research: crossing the valley of death. Nature 453(7197):840–842
Gillanders WE et al (2019) Safety and immunogenicity of a personalized synthetic long peptide
breast cancer vaccine strategy in patients with persistent triple-negative breast cancer following
neoadjuvant chemotherapy. ClinicalTrials.gov identifier: NCT02427581
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
National Library of Medicine (2019) ClinicalTrials.gov. https://clinicaltrials.gov
Palomares MR et al (2011) A dose-finding clinical trial of mushroom powder in postmenopausal
breast cancer survivors for secondary breast cancer prevention. J Clin Oncol 29(15_suppl):
1582–1582
Piantadosi S (2005) Translational clinical trials: an entropy-based approach to sample size. Clin
Trials 2:182–192
Scialdone A et al (2017) The HDAC inhibitor valproate induces a bivalent status of the CD20
promoter in CLL patients suggesting distinct epigenetic regulation of CD20 expression in CLL
in vivo. Oncotarget 8(23):37409–37422. https://doi.org/10.18632/oncotarget.16964
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423
Dose-Finding and Dose-Ranging Studies
53
Mark R. Conaway and Gina R. Petroni
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 952
Designs Based on Increasing Dose-Toxicity Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 952
Rule-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954
Interval-Based Methods for Dose-Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955
Model-Based Methods for Dose-Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957
Semiparametric and Order-Restricted Methods for Dose-Finding . . . . . . . . . . . . . . . . . . . . . . . . . 959
Evaluating Methods for Dose-Finding and Dose-Ranging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
Operating Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
Ease of Implementation and Adaptability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 961
Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 961
Extensions Beyond Single-Agent Trials with a Binary Toxicity Outcome . . . . . . . . . . . . . . . . . . . . 963
Time-to-Event Toxicity Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963
Combinations of Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964
Heterogeneity of Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 968
Abstract
There is a growing recognition of the importance of well-designed dose-finding
studies in the overall development process. This chapter is an overview of designs
for studies that are meant to identifying one or more doses of an agent to be tested
in subsequent stages of the drug development process. The chapter also provides
M. R. Conaway (*)
University of Virginia Health System, Charlottesville, VA, USA
e-mail: [email protected]
G. R. Petroni
Translational Research and Applied Statistics, Public Health Sciences, University of Virginia
Health System, Charlottesville, VA, USA
e-mail: [email protected]
a summary of dose-finding designs that have been developed to meet the chal-
lenges of contemporary dose-finding trials, including the use of combinations of
agents, more complex outcome measures, and heterogeneous groups of
participants.
Keywords
Interval-based methods · Model-based methods · Operating characteristics ·
Coherence · Combinations of agents · Patient heterogeneity
Introduction
This chapter describes the design of studies that have the goal of identifying one or
more doses of an agent to be tested in subsequent stages of the drug development
process. Piantadosi (2017) makes a distinction between dose-ranging studies, in
which doses are to be explored without a pre-specified objective, and dose-finding,
where the objective is to find a dose that meets a pre-specified criterion such as a
target rate of toxicities. This chapter provides a description of designs for both dose-
ranging and dose-finding studies.
Some early-phase studies are designed as randomized trials (Eussen et al. 2005;
Partinen et al. 2006; Schaller et al. 2010; Vidoni et al. 2015), with participants
allocated randomly among several pre-specified doses. Design considerations for
these trials, such as the choice of outcome measures, sample size, the use of
stratification factors, or interim monitoring, are common to randomized clinical
trials and are covered in Sections 4 (▶ Bias Control and Precision) and 5 (▶ Basics
of Trial Design). The distinguishing feature of the designs described in this chapter is
that allocation of participants to study dose is done sequentially; the choice of a dose
for a participant is determined by the observed outcomes from participants previ-
ously treated in the trial. These trials often involve the first use of an agent or
combination of agents, and the sequential allocation is intended to avoid exposing
participants to undue risk of adverse events. While there are examples of sequential
dose-ranging studies in many fields, including anesthesiology (Sauter et al. 2015)
and addiction research (Ezard et al. 2016), these designs are most commonly
associated with “phase I” trials in oncology.
A primary goal of a dose-finding trial is to learn about the safety and adverse events
related to study agent(s) and is often labeled “phase I.” Historically, for phase I trials
involving cytotoxic agents in oncology, a main objective was to identify the “max-
imum tolerated dose” (MTD). This remains a main objective even for noncytotoxic
agents. The MTD is defined as the highest dose that can be administered to
participants with an “acceptable” level of toxicity where toxicity is assessed based
53 Dose-Finding and Dose-Ranging Studies 953
upon observed adverse events. The amount of toxicity at a given dose level is
considered “acceptable” if the proportion of participants treated at that dose who
experience a “dose-limiting toxicity” (DLT) is less than or equal to a target level of
toxicity. The definition of a DLT is study specific, depends on the type of agent being
studied, and is used to set the study target level. Traditionally, the target level has
been in the range of 20–33%. Participants are sequentially assigned to dose levels
with the starting dose being the lowest dose. Dose allocations can be done for
individual participants or in groups of participants where groups of participants are
referred to as cohorts. Each participant is assigned to a single dose level and is
observed on a binary outcome measure specifying whether or not the participant
experienced a DLT. Many of the methods for these trials were developed for
cytotoxic agents, where it is assumed that the dose-toxicity and dose-efficacy
relationships are monotonic, in which the probability of a DLT and the potential
for clinical benefit, often termed “efficacy,” both increase with dose (see Fig. 1).
In this setting, the MTD is to be chosen from a pre-specified set of doses,
d1 < d2 < . . . < dK, with the probability that a participant given dose level dk
experiences a DLT denoted by πk, k ¼ 1, . . ., K. The probability of a DLT is assumed
to increase with dose, π1 < π2 < . . . < πK. At any point in the trial, there are nk
participants who have been observed on dose level dk, and of the participants treated,
Yk have experienced a DLT. The target level of toxicity is denoted by θ.
Although the discussion in this chapter will center on dose-finding and dose-
ranging studies in oncology, the designs can be applied more widely to any clinical
setting in which both the probability of an adverse event and efficacy can be
expected to increase with dose. One example is the study of a new anesthetic. The
probability of an adverse event and efficacy, defined in this case as sufficient
sedation, both increase with dose. As in the oncology studies, the goal is to find
the highest dose that can be administered safely, with the added requirement that the
dose yields sufficient sedation to an acceptable proportion of participants.
For these trials the statistical design revolves around two questions:
(Ananthakrishnan et al. 2017) How should doses be allocated to participants as the
trial proceeds? and (Babb et al. 1998) At the end of the trial, what dose should be
nominated as the MTD? There are, of course, many other clinical and statistical
issues to be made in carrying out a dose-finding or dose-ranging trial, including the
pre-specification of dose levels and the definition of a DLT (Senderowicz 2010), but
this chapter focuses primarily on the statistical issues of dose allocation and estima-
tion of the MTD at the end of the study.
Methods in this situation are broadly categorized as “rule-based,” in which
decisions to decrease, increase, or assign the same dose level for a new participant
are determined by rules for the observed proportion of toxicities at the current dose,
or “model-based,” in which a parametric model is fit to all the accumulated data and
used to guide dose allocation and the estimation of the MTD (Le Tourneau et al.
2009). In practice, the distinction between rule-based and model-based is not
completely clear, as there are methods that use rule-based dose allocation (Storer
1989; Stylianou and Flournoy 2002; Ji et al. 2007) but use a parametric model or
isotonic regression at the end of the trial to estimate the MTD. Other methods (Shen
and O’Quigley 1996; Wages et al. 2011a) are model-based but start with an initial
rule-based stage before using the parametric model. In this chapter, methods are
designated as rule-based or model-based depending on how participants are allo-
cated to doses.
Rule-Based Algorithm
from a small sample which provides little information on what proportion of future
patients would experience a DLT. Similar results are found in Storer (1989), Reiner
et al. (1999), and Iasonos et al. (2008). More general versions of this design, “A+B,”
have been proposed (Ananthakrishnan et al. 2017). An R Shiny app that can be used
to investigate the operating characteristics of A+B designs, including the 3+3, is
given by Wheeler et al. (2016).
The original interval-based method is the “cumulative cohort design” (CCD) pro-
posed by Ivanova et al. (2007). Given a target toxicity probability, θ, the term
“interval-based” derives from basing decision rules for treating the next set of
participants at a lower, higher, or the same dose as the current cohort of participants
on intervals placed around θ. If the current dose level is dk and there are Yk
participants experiencing a DLT out of the nk participants who have been treated at
that dose, the decision rules proposed by Ivanova et al. (2007) are:
similarity in the dose allocation rules is not surprising; Clertant and O’Quigley
(2017, 2019) develop a semiparametric method that can be calibrated to produce
identical operating characteristics to all of these individual methods.
The most widely recognized model-based method for dose-finding trials is the
continual reassessment method (CRM). The original CRM paper proposed a sin-
gle-stage design; a later version (Shen and O’Quigley 1996) is a two-stage design
using a rule-based algorithm in the first stage and maximum likelihood estimation in
the second stage. An excellent overview of the theoretical properties and guidelines
for the practical application of the method is given in Cheung (2011) and O’Quigley
and Iasonos (2012). The CRM assumes a parametric model for the dose-toxicity
curve, but it does not require that the model be correct across all the doses under
consideration. The model only needs to be increasing in dose and be such that there
is a parameter value that enables the function to equal the target value, θ, at the true
MTD. The original CRM paper discussed one- and two-parameter models but
focused primarily on one-parameter models because the simpler models tended to
have better properties in terms of identifying the correct MTD.
The most common implementation of the CRM uses the “empiric” model for πk;
the probability of a DLT at dose level dk is assumed to be equal to
where 0 < φ1 < φ2 < . . . < φK < 1 are pre-specified constants, often referred to as
the “skeleton” values and “a” is a scalar parameter to be estimated from the data. The
parametrization exp(a) ensures that the probability of toxicity is increasing in dose
for all values of the parameter a. The original CRM paper (O’Quigley et al. 1990)
was a Bayesian method that put a prior distribution on the parameter, a. The paper
provided guidance on eliciting a gamma prior for exp(a) but noted that in many
cases, the special case of an exponential prior with mean 1 gave satisfactory
performance. The skeleton values can be based on the prior, as suggested in the
original CRM paper, or by using the method of Lee and Cheung (2009) where the
skeleton values are calibrated in a way to give good performance for the CRM across
a variety of true dose-toxicity curves.
Once the prior and skeleton values are chosen, the first participant is assigned to
the dose level with prior probability closest to the target θ. After that, the CRM
allocates participants sequentially, with each participant assigned to the dose level
with the model-based estimated probability of toxicity closest to the target. To be
specific, suppose that j1 participants have been observed on the trial, with nk 0
participants observed on dose levels k, k ¼ 1, . . ., K. Of the nk participants, Yk
participants experienced a DLT. Using the data accumulated from the j1 partici-
pants observed, the updated model-based toxicity probabilities are
where ab can be the posterior mean computed via numerical integration, an approx-
imation to the posterior mean as in O’Quigley et al. (1990), the posterior mode, or
the posterior median (Chu et al. 2009). The next participant is assigned to the dose
level k with the estimated toxicity probability closest to the target, where “closest” is
according to a pre-specified measure of distance between the estimate and the target.
The original paper uses a quadratic distance, but asymmetric distance functions,
which give greater loss to deviations above the target than below, could also be used.
The updating of “a” and the allocation of participants to the dose with updated
toxicity probability closest to the target continues until a pre-specified number of
participants have been observed. At the end of the study, the MTD is taken to be the
dose that the next participant would have received had the trial not ended.
A two-stage version of the CRM is presented by Shen and O’Quigley (1996). The
first stage uses a rule-based design and continues until at least one participant
experiences a DLT and one participant does not experience a DLT. Once heteroge-
neity in responses is observed, the trial proceeds as in the original CRM, except that
the estimate of the parameter “a” is based on maximum likelihood. The paper uses a
rule-based design using single-participant cohorts: if a participant does not experi-
ence a DLT, the next participant is treated at the next higher dose level, but the
authors note that any rule-based design could be used in stage I until heterogeneity is
observed.
The “escalation with overdose control” (EWOC) method of Babb et al. (1998) is a
popular method for dose-finding. As with the previous dose-finding methods, Babb
et al. (1998) set a target toxicity probability, θ, and assume that the true MTD,
53 Dose-Finding and Dose-Ranging Studies 959
defined as the dose that has a toxicity probability equal to the target, is in a pre-
specified interval [Xmin, Xmax]. Their method is designed to identify the MTD while
providing “overdose control,” limiting the proportion of participants exposed to
doses above the MTD.
The EWOC design is based on a two-parameter model for the probability of a
DLT at dose x in [Xmin, Xmax]. One of several possibilities for the dose-toxicity
relationship model is the logistic model:
These methods do not fall neatly into either the interval-based or model-based
classifications. The semiparametric dose-finding method of Clertant and O’Quigley
(2017) is more of a class of methods than a specific design. If parametric conditions
are added to the class, the result is the CRM design. Using less structure on the class
960 M. R. Conaway and G. R. Petroni
results in the interval-based designs, including the CCD, mTPI-2, and BOIN.
Clertant and O’Quigley (2017) present results for a “semiparametric” design that
corresponds to CRM with an additional nuisance parameter. This formulation leads
to a design that reduces the dependence of the CRM on a single model and produces
results similar to those of the CRM.
The methods of Leung and Wang (2001) and Conaway et al. (2004) are based on
methods for order-restricted inference. These methods are, in a way, model-based
designs but rely only on the assumption that the probability of a DLT increases with
dose. The methods do not specify a full parametrized parametric model. Recent work
(Wages and Conaway 2018) has shown that the order-restricted methods are com-
petitive in performance to the CRM, the method that generally has the best operating
characteristic, as described in the following section, in dose-finding studies over a
wide range of scenarios.
There are a number of criteria on which the methods can be compared. These include
the statistical properties, the ease of implementation and adaptability to changes in
the study conduct, and principles for conducting early-stage studies.
Operating Characteristics
where ρk is a measure of the deviation of the true toxicity probability, πk, at dose k
from the target toxicity probability θ. Cheung (2011) gives several choices for ρk,
including an absolute deviation, ρk ¼ |πk θ|. The accuracy index has a maximum
value of 1, which occurs when the design always recommends the correct MTD.
With few exceptions, comparisons among the methods are made based on
simulations, and historically, these simulations were done using a limited number
53 Dose-Finding and Dose-Ranging Studies 961
of true dose-toxicity curves. Even when exact small sample results are available
(Durham and Flournoy 1994, 1995; Lin and Shih 2001), these results depend on the
true unknown underlying dose-toxicity curve. This can make comparisons difficult,
since every method has some scenarios under which it will perform well. A tool for
evaluating the properties of designs is given in O’Quigley et al. (2002) and Paoletti
et al. (2004). The benchmark cannot be used in practice because it requires knowl-
edge of the true underlying dose-toxicity curve; however, it has been shown to be
useful in investigating the efficiency of proposed designs (Wages et al. 2013) in the
context of studies with monotone dose-toxicity curves.
It is important, when evaluating a design, to consider the performance across a
broad range of scenarios, varying the location of the MTD and the steepness of the
dose-toxicity curve. To this end, a number of families of dose-toxicity curves have
been proposed. Evaluating methods for specific dose-toxicity curves randomly
sampled from the family of curves is intended to test the method across a range of
curves that vary in MTD location and steepness. One of the first families of curves
was generated by Paoletti et al. (2004); subsequent proposals can be found in Horton
et al. (2017), Clertant and O’Quigley (2017), and Conaway and Petroni (2019a).
Rule-based methods have the practical advantage of being simpler to carry out
because all of the decision rules can be laid out in a table prior to starting the
study. The model-based methods also have some practical advantages over rule-
based methods. Model-based methods can enroll participants even if the follow-up
period for previously enrolled participants is not yet complete. Model-based
methods can accommodate revisions to data errors; on subsequent review, partici-
pants thought not to have had DLTs could be found to have had DLTs, or vice versa,
participants thought not to have DLTs could be classified upon further review as
having had a DLT. Subsequent allocations can proceed based on models fit to the
corrected data.
Principles
Cheung (2005) defined the principle of coherence for single-agent dose-finding and
dose-ranging studies. By this definition, a method is coherent for dose escalation if
the method does not increase the dose following an observed DLT and coherent for
dose de-escalation if the method does not decrease the dose following the observa-
tion of a non-DLT. This is an important principle in implementing a study, since
clinicians can be reluctant to follow an incoherent design, particularly one that is
incoherent in dose escalation.
The 3+3, despite its poor operating characteristics, is at least coherent by this
definition. The biased coin design (Durham et al. 1997), EWOC (Tighiouart and
Rogatko 2014), and the semiparametric CRM (Clertant and O’Quigley 2017) are all
coherent. Cheung (2011) shows that the one-stage Bayesian CRM is coherent and
962 M. R. Conaway and G. R. Petroni
that the two-stage CRM is coherent as long as it does not produce an incoherent
transition between the rule-based and model-based stages. Clertant and O’Quigley
(2017) show that the semiparametric CRM is coherent. In general, the interval-based
methods are not coherent by this definition, and in practice, incoherent decisions
occur frequently with these designs (Wages et al. 2019 under review).
Interval-based designs defined a separate principle and unfortunately also used
the term “coherence” (Liu and Yuan 2015). The principle, defined as “long-term
memory coherence,” means that a method will not increase the dose for the next
participant if the observed proportion of participants at the current dose who have
experienced a DLT exceeds the target toxicity rate and the method will not reduce the
dose for the next participant if the observed proportion of DLTs at the current dose is
less than the target. By construction, the interval-based designs are all “long-term
memory coherent.”
An example from Wages et al. (2019 under review) serves to illustrate the
difference between the original definition of coherence and “long-term memory
coherence.” Table 2 shows the dose allocations from BOIN and Keyboard for a
simulated trial using the settings in Scenario 2 in Table S2 of Zhou et al. (2011).
Decisions at participants 9 and 13 represented incoherence in dose de-escalation; in
each case, the next participant is treated at a lower dose immediately following a
non-DLT. Both of these decisions are “long-term memory coherent.” The decision at
participant 16 is incoherent in dose escalation. Even though this participant
experienced a DLT on dose level 2, both BOIN and Keyboard treat the next
participant at a higher dose, dose level 3, which has an observed proportion of
toxicities greater than the target.
The original CRM paper (O’Quigley et al. 1990) noted that the observation of a
toxicity does not occur immediately. As a result, there may be new participants ready
to enroll in the study before all the prior participants have been observed for a DLT.
O’Quigley et al. (1990) suggested treating the new participants at the last allocated
dose, or given the uncertainty in the dose allocations, treating participants’ one level
above or one level below the most recent model-recommended dose.
Cheung and Chappell (2010) proposed an extension to the continual reassessment
method known as the “time-to-event CRM” (TITE-CRM). This method allows for a
weighted toxicity model, with weights proportional to the time that the participant
has been observed. They consider a number of weight functions, but simulation
results suggested that a simple linear weight function of the form w(u) ¼ u/T, where
T is a fixed length of follow-up observation time for each participant, is adequate.
If a participant is observed to have a toxicity at time u < T, the follow-up time u is set
to equal T. If DLT information has been observed for all participants, the method
reduces to the continual reassessment method of O’Quigley et al. (1990).
Normolle and Lawrence (2006) discuss the use of the TITE-CRM in radiation
oncology studies, where the toxicities tend to occur late in the follow-up period.
Polley (2011) observes that in studies with rapid participant accrual and late toxic-
ities, the TITE-CRM can allocate too many participants to overly toxic doses. The
paper has a comparison of a modification of the TITE-CRM that was suggested in
the original TITE-CRM paper, as well as a modification that incorporates wait times
between participant accruals. A version of EWOC with time-to-event endpoints is
described in Mauguen et al. (2011) and Tighiouart et al. (2014a).
A modification of the 3+3, called the rolling 6 (Skolnik et al. 2008), is meant to
reduce the time to completion of a dose-ranging trial by allowing participants to
964 M. R. Conaway and G. R. Petroni
enter a trial before complete information is available for participants in the prior
cohort. Zhao et al. (2011) has shown that this method is less efficient and less
accurate than TITE-CRM.
Combinations of Agents
The simplest form of a study with a combination of two agents is depicted below.
Each of two agents (A and B) is being studied at two dose levels (“low” and “high”).
The probability of a DLT with agent at level “a” and agent B at level “b” is denoted
by πab. The problem of dose-finding or dose-ranging differs from the single-agent
case in that the probabilities of toxicity no longer follow a complete order, in which
the ordering or any two toxicity parameters are known, but instead follow a “partial
order” (Robertson et al. 1988). In a partial order, there are pairs of parameters whose
ordering is not known. For example, it is not known whether πHL > πLH or πHL < πLH.
A second distinction from the single-agent case is that in a combination study, there
may be more than one “MTD,” meaning more than one dose combination with a
toxicity probability close to the target.
The initial suggestion for studies of drug combinations was to lay out a specific
ordering of the combinations (Korn and Simon 1993; Kramar et al. 1999). While
simple to implement, this approach follows only one path through the combinations
and could have poor properties in identifying an MTD, particularly if the assumed
ordering is incorrect.
As in the single-agent case, methods can broadly be classified as rule-based or
model-based methods. The interval-based “Bayesian optimal interval design
(BOIN)” has been extended to studies involving combinations (Lin and Yin 2017).
Based on the observed proportion of participants experiencing a DLT at the current
dose, a decision is made to “escalate,” “de-escalate,” or “stay” at the current dose.
More than one dose combination might be considered an “escalation” or a “de-
escalation,” and Lin and Yin (2017) propose pre-specifying, for each dose combi-
nation, a set of “admissible escalation doses” and “admissible de-escalation doses.”
A similar idea had been proposed by Conaway et al. (2004), who used the estimation
method of Hwang and Peddada (1994) as well as “possible escalation sets” for each
dose combination to guide dose combination allocations. Similarly, bivariate iso-
tonic regression was the basis of the method proposed by Wang and Ivanova (2005).
This method estimates the probability of a DLT for each combination under the
assumption that for a fixed row, the toxicity probabilities increase across columns
and for each column, toxicity probabilities increase across each column. In Table 3
for example, it is known that πLL < πLH, πHL < πHH, πLL < πHL and πLH < πHH.
The majority of methods for dose-finding and dose-ranging for combinations of
agents are model-based. Extensions of the CRM were proposed by Wages et al.
(2011a, b). These methods consider either a subset or all possible orders of the
toxicity probabilities. For example, in Table 3 for the simplest case, there are two
possible orders:
53 Dose-Finding and Dose-Ranging Studies 965
The CRM is fit separately within each order. After each participant is observed,
the recommendation is based on the order that yields a greater value of the likeli-
hood. In many studies with more than two dose levels for each of the agents, there
can be too many possible orderings to specify all of them. In this case, clinical
judgment can be used to guide the choice of orders under consideration, or the
default set of orders recommended by Wages and Conaway (2013) can be used.
Yin and Yuan (2009) also generalize the single-agent CRM for studies consider-
ing combinations of J levels of agent A and K levels of agent B. They pre-specify
values p1 < p2 < . . . < PK for agent A and values q1 < q2 < . . . < qJ for agent B and
use a model for the probability of a DLT that depends on the pre-specified values as
well as three parameters to be estimated from the data. At the end of the trial, the
estimate of the MTD is the dose with the estimated DLT probability closest to the
pre-specified target.
Other model-based methods are based on a full mathematical specification of the
probabilities of toxicity for each combination. Thall et al. (2003) propose a nonlinear
six-parameter model to describe how the probability of toxicity depends on the dose
combination.
All of the previous methods are for studies in which a discrete set of combinations
have been pre-specified. For dose-finding studies, Shi and Yin (2013) and Tighiouart
et al. (2014b) have generalized the EWOC method for combinations. Both of these
generalizations use a logistic model for the probability of toxicity that included main
effects for the dose level of each combination and a multiplicative interaction term to
specify the joint effect of the two agents.
Heterogeneity of Participants
In some dose-finding trials, there are several groups of participants, and the goal is to
estimate a MTD within each group. These groups may be defined by the participants’
degree of impairment at baseline (Ramanathan et al. 2008; LoRusso et al. 2012) or
genetic characteristics (Kim et al. 2013). For example, Ramanathan et al. (2008)
enrolled 89 participants with varying solid tumors to develop dosing guidelines for
the administration of imatinib in participants with liver dysfunction. Prior to dosing,
participants were stratified into “none,” “mild,” “moderate,” or “severe” liver
966 M. R. Conaway and G. R. Petroni
O’Quigley and Iasonos (2014), takes a different approach to generalizing the CRM
to two ordered groups. For two groups, the assumption underlying this method is
that the MTD in group 2 will be Δ dose levels less than the MTD in group 1, with Δ a
nonnegative integer. O’Quigley and Iasonos (2014) restrict Δ to be 0, 1, 2, or 3
levels, but their method applies to any shift in the MTD. Horton et al. (2019a)
generalize the shift model to more than two groups and to either completely or
partially ordered groups.
Babb and Rogatko (2001) extend the EWOC method to allow for a continuous
covariate. In their application (Babb and Rogatko 2001), the covariate was protec-
tive, with increasing levels of the covariate associated with a lower probability of a
DLT. Data from a previous study of the agent allowed the investigators to set bounds
on the permissible doses for a participant with a specific covariate value. With this
method, participants can receive individualized doses according to their level of the
covariate. Similar methods are found in Tighiouart et al. (2012).
This chapter has presented a number of designs for dose-finding and dose-ranging
studies for a single agent and in which the primary objective of the study is to
establish a maximum safe dose. Many of the applications of these designs are in
oncology, which is currently undergoing a change in the complexity and objectives
of dose-finding and dose-ranging studies. This chapter has also reviewed a number
of designs that have been developed recently to meet the challenges of contemporary
dose-finding and dose-ranging studies.
Key Facts
1. The dose-finding design must be conducted in a way that addresses the study
objectives.
2. Dose-finding studies are an integral part of the overall drug development process.
3. The commonly used “3+3” algorithm has poor operating characteristics.
4. Interval-based dose-finding designs are simple to implement and provide better
operating characteristics than the “3+3.”
5. In general, model-based dose-finding designs have superior operating character-
istics and provide the flexibility needed to handle data revisions and delayed
dose-limiting toxicities.
Cross-References
References
Ananthakrishnan R, Green S, Chang M, Doros G, Massaro J, LaValleya M (2017) Systematic
comparison of the statistical operating characteristics of various phase I oncology designs.
Contemp Clin Trials Commun 5:34–48
Babb J, Rogatko A (2001) Patient specific dosing in a cancer phase I clinical trial. Stat Med
20:2079–2090
Babb J, Rogatko A, Zacks S (1998) Cancer phase I clinical trials: efficient dose escalation with
overdose control. Stat Med 17:1103–1120
Cheung YK (2005) Coherence principles in dose-finding studies. Biometrika 92:203–215
Cheung YK (2011) Dose finding by the continual reassessment method. Chapman and Hall/CRC
Biostatistics Series, New York
Cheung YK, Chappell R (2010) Sequential designs for phase I clinical trials with late-onset
toxicities. Biometrics 56:1177–1182
Chu PL, Lin Y, Shih WJ (2009) Unifying CRM and EWOC designs for phase I cancer clinical trials.
J Stat Plann Inference 139:1146–1163
Clertant M, O’Quigley J (2017) Semiparametric dose finding methods. J R Stat Soc Ser B 79
(5):1487–1508
Clertant M, O’Quigley J (2019) Semiparametric dose finding methods: special cases. Appl Stat 68
(2):271–288
Conaway M (2017a) A design for phase I trials in completely or partially ordered groups. Stat Med
36(15):2323–2332
Conaway M (2017b) Isotonic designs for phase I trials in partially ordered groups. Clin Trials 14
(5):491–498
Conaway M, Petroni G (2019a) The impact of early stage design on the drug development process.
Clin Cancer Res 25(2):819–827
Conaway M, Petroni G (2019b) The role of early-phase design-response. Clin Cancer Res 25
(10):3191
Conaway M, Dunbar S, Peddada S (2004) Designs for single- or multiple-agent phase I trials.
Biometrics 60:661–669
Durham S, Flournoy N (1994) Random walks for quantile estimation. In: Gupta S, Berger J (eds)
Statistical decision theory and related topics V. Springer, New York, pp 467–476
Durham S, Flournoy N (1995) Up-and-down designs I: stationary treatment distributions. In:
Flournoy N, Rosenberger W (eds) Adaptive designs. Institute of Mathematical Statistics,
Hayward, pp 139–157
Durham S, Flournoy N, Rosenberger W (1997) A random walk rule for phase 1 clinical trials.
Biometrics 53(2):745–760
Eussen S, de Groot L, Clarke R, Schneede J, Ueland P, Hoefnagels W, van Staveren W (2005) Oral
cyanocobalamin supplementation in older people with vitamin B12 deficiency: a dose-finding
trial. Arch Intern Med 165:1167–1172
Ezard N, Dunlop A, Clifford B, Bruno R, Carr A, Bissaker A, Lintzeris N (2016) Study protocol: a
dose-escalating, phase-2 study of oral lisdexamfetamine in adults with methamphetamine
dependence. BMC Psychiatry 16:428
53 Dose-Finding and Dose-Ranging Studies 969
Guo W, Wang S-J, Yang S, Lynna H, Ji Y (2017) A Bayesian interval dose-finding design
addressing Ockham’s razor: mTPI-2. Contemp Clin Trials 58:23–33
Horton B, Wages N, Conaway M (2017) Performance of toxicity probability interval based designs
in contrast to the continual reassessment method. Stat Med 36:291–300
Horton BJ, Wages NA, Conaway MR (2019a) Shift models for dose-finding in partially ordered
groups. Clin Trials 16(1):32–40
Horton BJ, O’Quigley J, Conaway M (2019b) Consequences of performing parallel dose finding
trials in heterogeneous groups of patients. JNCI Cancer Spectrum. https://doi.org/10.1093/
jncics/pkz013. Online ahead of print
Hwang J, Peddada S (1994) Confidence interval estimation subject to order restrictions. Ann Stat
22:67–93
Iasonos A, Wilton AS, Riedel ER, Seshan VE, Spriggs DR (2008) A comprehensive comparison of
the continual reassessment method to the standard 3+3 dose escalation scheme in phase I dose-
finding studies. Clin Trials 5(5):465–477
Ivanova A, Flournoy N, Chung Y (2007) Cumulative cohort design for dose-finding. J Stat Plann
Inference 137:2316–2327
Ji Y, Li Y, Bekele B (2007) Dose-finding in phase I clinical trials based on toxicity probability
intervals. Clin Trials 4:235–244
Kim K, Kim H, Sym S, Bae K, Hong Y, Chang H, Lee J, Kang Y, Lee J, Shin J, Kim T (2013) A
UGT1A1*28 and *6 genotype-directed phase I dose-escalation trial of irinotecan with fixed-
dose capecitabine in Korean patients with metastatic colorectal cancer. Cancer Chemother
Pharmacol 71:1609–1617
Korn E, Simon R (1993) Using tolerable-dose diagrams in the design of phase I combination
chemotherapy trials. J Clin Oncol 11:794–801
Kramar A, Lebecq A, Candalh E (1999) Continual reassessment methods in phase I trials of the
combination of two agents in oncology. Stat Med 18:849–864
Le Tourneau C, Lee J, Siu L (2009) Dose escalation methods in phase I clinical trials. J Natl Cancer
Inst 101:708–720
Lee S, Cheung YK (2009) Model calibration in the continual reassessment method. Clin Trials
6:227–238
Leung D, Wang Y-G (2001) Isotonic designs for phase I trials. Clin Trials 22:126–138
Lin R (2018) R codes for interval designs. https://github.com/ruitaolin/IntervalDesign
Lin Y, Shih W (2001) Statistical properties of traditional algorithm-based designs for phase I cancer
clinical trials. Biostatistics 2(2):203–215
Lin R, Yin G (2017) Bayesian optimal interval design for dose finding in drug-combination trials.
Stat Methods Med Res 26(5):2155–2167
Liu S, Yuan Y (2015) Bayesian optimal interval designs for phase I clinical trials. J R Stat Soc Ser C
Appl Stat 32:2505–2511
LoRusso P, Venkatakrishnan K, Ramanathan R, Sarantopoulos J, Mulkerin D, Shibata S, Hamilton
A, Dowlati A, Mani S, Rudek M, Takimoto C, Neuwirth R, Esseltine D, Ivy P (2012)
Pharmacokinetics and safety of Bortezomib in patients with advanced malignancies and varying
degrees of liver dysfunction: phase I NCI Organ Dysfunction Working Group Study NCI-6432.
Clin Cancer Res 18(10):1–10
Mauguen A, Le Deleya M, Zohar S (2011) Dose-finding approach for dose escalation with overdose
control considering incomplete observations. Stat Med 30:1584–1594
Normolle D, Lawrence T (2006) Designing dose-escalation trials with late-onset toxicities using the
time-to-event continual reassessment method. J Clin Oncol 24:4426–4433
O’Quigley J (2006) Phase I and phase I/II dose finding algorithms using continual reassessment
method. In: Crowley J, Ankherst D (eds) Handbook of statistics in clinical oncology, 2nd edn.
Chapman and Hall/CRC Biostatistics Series, New York
O’Quigley J, Iasonos A (2012) Dose-finding designs based on the continual reassessment method.
In: Crowley J, Hoering (eds) Handbook of statistics in clinical oncology, 3rd edn. Chapman and
Hall/CRC Biostatistics Series, New York
970 M. R. Conaway and G. R. Petroni
Tighiouart M, Liu Y, Rogatko A (2014a) Escalation with overdose control using time to toxicity for
cancer phase I clinical trials. PLoS One 9(3):e93070
Tighiouart M, Piantadosi S, Rogatko A (2014b) Dose finding with drug combinations in
cancer phase I clinical trials using conditional escalation with overdose control. Stat Med 33
(22):3815–3829
Vidoni ED, Johnson DK, Morris JK, Van Sciver A, Greer CS, Billinger SA et al (2015) Dose-
response of aerobic exercise on cognition: a community-based, pilot randomized controlled
trial. PLoS One 10(7):e0131647
Wages NA, Conaway MR (2013) Specifications of a continual reassessment method design for
phase I trials of combined drugs. Pharm Stat 12(4):217–224
Wages N, Conaway M (2018) Revisiting isotonic phase I design in the era of model-assisted dose-
finding. Clin Trials 15(5):524–529
Wages N, Conaway M, O’Quigley J (2011a) Dose-finding design for multi-drug combinations. Clin
Trials 8:380–389
Wages N, Conaway M, O’Quigley J (2011b) Continual reassessment method for partial ordering.
Biometrics 67:1555–1563
Wages N, Conaway M, O’Quigley J (2013) Performance of two-stage continual reassessment
method relative to an optimal benchmark. Clin Trials 10:862–875
Wages NA, Iasonos A, O’Quigley J, Conaway MR (2019) Coherence principles in interval-based
dose-finding. Submitted
Wang K, Ivanova A (2005) Two-dimensional dose finding in discrete dose space. Biometrics
61:217–222
Wheeler G, Sweeting M, Mander A (2016) AplusB: a web application for investigating A+B
designs for phase I cancer clinical trials. PLOS. https://doi.org/10.1371/journal.pone.0159026.
Published: July 12, 2016
Yan F, Mandrekar S, Ying Y (2017) Keyboard: a novel Bayesian toxicity probability interval design
for phase I clinical trials. Clin Cancer Res 23(15):3994–4003
Yin G, Yuan Y (2009) Bayesian dose finding in oncology for drug combinations by copula
regression. Appl Stat 58(2):211–224
Yuan Z, Chapell R (2004) Isotonic designs for phase I cancer clinical trials with multiple risk
groups. Clin Trials 1(6):499–508
Zhao L, Lee J, Mody R, Braun T (2011) The superiority of the time-to-event continual reassessment
method to the rolling six design in pediatric oncology phase I trials. Clin Trials 8(4):361–369
Inferential Frameworks for Clinical Trials
54
James P. Long and J. Jack Lee
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974
Inferential Frameworks: Samples, Populations, and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 976
Frequentist Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 978
Optimizing Trial Design Using Frequentist Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 981
Limitations and Guidance on Frequentist Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 981
Bayesian Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 983
Sequential Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986
Bayesian Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986
Prior Distributions and Schools of Bayesian Thought . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987
Guidance on Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989
Inferential Frameworks: Connections and Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 990
Reducing Subjectivity and Calibrating P-Values with Bayes Factors . . . . . . . . . . . . . . . . . . . . . 990
Model-Based and Model-Assisted Phase I Designs Constructed Using Inferential
Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 992
Model-Based and Model-Assisted Phase II Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 993
Inferential Frameworks and Modern Trial Design Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996
Precision Medicine, Master Protocols, Umbrella Trials, Basket Trials, Platform Trials,
and Adaptive Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996
Multiple Outcomes and Utility Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 998
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 998
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 999
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1000
Abstract
Statistical inference is the process of using data to draw conclusions about
unknown quantities. Statistical inference plays a large role both in designing
clinical trials and in analyzing the resulting data. The two main schools of
inference, frequentist and Bayesian, differ in how they estimate and quantify
uncertainty in unknown quantities. Typically, Bayesian methods have clearer
interpretation at the cost of specifying additional assumptions about the unknown
quantities. This chapter reviews the philosophy behind these two frameworks
including concepts such as p-values, Type I and Type II errors, confidence
intervals, credible intervals, prior distributions, posterior distributions, and
Bayes factors. Application of these ideas to various clinical trial designs including
3 + 3, Simon’s two-stage, interim safety and efficacy monitoring, basket,
umbrella, and platform drug trials is discussed. Recent developments in comput-
ing power and statistical software now enable wide access to many novel trial
designs with operating characteristics superior to classical methods.
Keywords
Bayes factors · Bayesian · Confidence intervals · Credible intervals · Frequentist ·
Hypothesis testing · P-value · Prior distribution · Sequential designs · Statistical
inference
Introduction
Statistical inference is the process of using data to draw conclusions about unknown
quantities. Rigorous frameworks for statistical inference ensure that the conclusions
are backed by some form of guarantee. For example, in drug development, one is
interested in estimating the response rate of a new agent with certain precision and/or
concluding whether the new agent has at least 30% response rate with certain
confidence. The two dominant frameworks for statistical inference, Bayesian and
Frequentist, both provide these guarantees in the form of probabilistic statements.
Inference frameworks play an important role both in designing clinical trials and in
analyzing and interpreting the information contained in the collected data.
Consider the popular oncology Phase I 3 + 3 dose escalation design with a set of
predefined dose levels (Storer 1989). Starting at the lowest dose level, three patients
are enrolled. If none experience a toxicity, the dose is escalated. If one experiences a
toxicity, an additional three are enrolled at the same dose level. Among the additional
three patients, if none experience a toxicity, the dose is escalated. If one experiences
a toxicity, the current dose or one dose lower is defined as the maximum tolerated
dose (MTD). If two or more patients experience toxicity in three or six patients, the
dose exceeds the MTD.
Table 1 lists the probability of dose escalation for given toxicity probability p
based on Eq. 1.
54 Inferential Frameworks for Clinical Trials 975
The table provides assurance that as the probability of toxicity of the current dose
increases, the chance of escalation decreases. This table requires simple probability
computations to construct.
The 3 + 3 design (and variants; see Le Tourneau et al. (2009), Lin and Shih
(2001)) is a rule-based method in which the trial design and interpretation of results
are typically not performed within a formal statistical inference framework. This
results in several weaknesses:
1. The algorithm does not explicitly produce an estimate of the toxicity rate at the
MTD. Studies show that the targeted toxicity rate is not the widely believed 33%
and depends on extraneous factors such as the number of dose levels used (Lin
and Shih 2001). In most typical applications, the targeted toxicity rate resulting
from a 3 + 3 design is between 23% and 28%, but it can be much lower or higher
(Smith et al. 1996).
2. Depending on the circumstances, investigators can elicit maximum targeted
toxicities for drugs (higher for drugs with higher potential efficacy and lower
for preventive agents), yet the 3 + 3 algorithm cannot incorporate this information
when finding the MTD.
fixed since it has been collected. Bayesian inference applies probabilistic statements
to estimate these unknowns directly given the data. While the frameworks exhibit
important differences, the fact that both have been successfully used in nearly every
scientific field demonstrates their breadth and versatility. These characteristics are
critical for clinical trials which seek answers to a broad range of questions, including
evaluation of efficacy, safety monitoring, dose identification, treatment assignment,
adaptation to interim events, go/no-go decision on drug development, cost-benefits
analysis, etc.
In the context of clinical trials, frequentist inference dominated the design and
analysis of clinical trials in the twentieth century. However, in the last 25 years,
Bayesian methods have increased in popularity due to a number of factors including
advances in computing power, better computational algorithms, deficiencies in
frequentist measures of evidence (such as the p-value), and the need for complex,
adaptive trial designs (Biswas et al. 2009; Lee and Chu 2012; Tidwell et al. 2019).
This chapter is organized as follows. Section “Inferential Frameworks: Samples,
Populations, and Assumptions” contrasts inferential frameworks with rule-based
methods for inference and discusses sampling assumptions common to both
frequentist and Bayesian frameworks. The two dominant inferential frameworks,
frequentist and Bayesian, are reviewed in sections “Frequentist Framework” and
“Bayesian Framework,” respectively. Popular designs are discussed such as the
frequentist Simon’s optimal two-stage design (Simon 1989) and the Bayesian design
of Thall et al. (1995). In section “Inferential Frameworks: Connections and
Synthesis,” efforts to connect the frameworks as well as Bayesian-frequentist hybrid
designs are reviewed. Section “Inferential Frameworks and Modern Trial Design
Challenges” describes recent developments and challenges in trial design. Section
“Summary and Conclusion” concludes with a discussion.
1. What is a reasonable best guess (i.e., estimate) for p? Typically, this sample-based
estimate of p is denoted by pb (read “p hat”).
2. Is the new treatment better than standard of care? If standard of care produces a
response in 20% of patients, this statement becomes is p greater than 20%?
3. What is a range of plausible values for p?
4. What is the probability that the response rate is greater than 40%?
54 Inferential Frameworks for Clinical Trials 977
All of these questions except question #4 can be addressed in the frequentist and
Bayesian inferential frameworks. Question #4 can only be answered by the Bayesian
framework because it assumes that the unknown response rate is random, while in the
frequentist framework, the unknown response rate is considered fixed and does not
have a distribution. Note that the unknown, population response rate p is treated
separately from the sample-based estimate pb. This separation of sample and population
is critical for formal inferential reasoning and is not typically discussed in the 3 + 3
dose finding algorithm. In 3 + 3, the MTD is defined based on the sample. In fact, 3 + 3
produces a number more akin to MTD, d an estimate of the true MTD. The 3 + 3 design
does not precisely define the true MTD which it is attempting to estimate. However,
one could posit the true MTD as the highest of the prespecified doses which will
produce a toxicity in no more than 33% of all patients in the population. With this
d to be
definition, one can now ask questions akin to 2 and 3, such as how likely is MTD
no higher than MTD? The 3 + 3 design does not offer any answer to this seemingly
important question and in fact makes this question difficult to even ask by not
d from what is being estimated, MTD.
conceptually separating the estimate, MTD,
Inferential frameworks require the existence of a population on which inferences
are to be drawn. Typically, the population of interest in clinical trials is all patients
who would receive the treatment if it were to be approved by a regulatory agency. In
the case of the experimental cancer drug, this may be all present and future patients
in the United States with Stage III pancreatic cancer.
Nearly all clinical trial designs, both Bayesian and frequentist, assume that the
sample is collected by randomly selecting patients from the population with each
patient being equally likely to be selected. In practice this random sampling is not
done for several reasons including that there do not exist easily accessible lists of all
patients in the population and that the resulting selection would result in a geograph-
ically diverse set of patients who would be difficult to treat. Instead patients are
enrolled at one or a small set of usually academic institutions (thus geographically
biasing the sample) and must consent to be treated (thus biasing the sample towards
individuals with a high desire for new treatments). The resulting sample may differ
from the population in terms of socioeconomic status, ethnicity, present health
condition, disease severity, etc.
The extent to which deviations in sampling assumptions occur is situation
dependent. Phase II clinical trials typically seek to demonstrate treatment efficacy
relative to efficacy on historical controls. Observed differences between treatment
and control may be the result of treatment efficacy or differences between trial
subjects and historical controls, e.g., the trial may systematically select healthier
patients than the historical control group. In this setting, standard Bayesian and
frequentist methods will both produce biased results. Phase III double-blinded
randomized studies are less likely to suffer from selection bias caused by deviations
in sampling assumptions. Yet problems can arise regarding generalization of study
findings to patient groups not meeting study eligibility criteria. See Jüni et al. (2001)
for a discussion of these issues. In addition, journals are more likely to publish
studies with positive findings. Hence, readers must be aware that the publication bias
may portray a rosier picture than what the truth is.
978 J. P. Long and J. J. Lee
Frequentist Framework
0.3
0.2
f(x|p=20%)
0.1
0.0
0 1 2 3 4 5 6 7 8 9 10
x (Number of Responses)
Fig. 1 Binomial probability mass function. This is a graphical illustration of Eq. 1 with p ¼ 20%
54 Inferential Frameworks for Clinical Trials 979
only reasonable estimator, the Bayesian inferential approach discussed in the next
section typically suggests a different estimate. In this response example, MLE and
MOM theory do not produce a particularly surprising result. However, for more
complicated statistical models, these methods can be used to derive estimators when
intuition fails. For example, suppose dozens of candidate biomarkers (SNPs, protein
expression levels, prior treatments, etc.) are measured and some of them are associ-
ated with the response status. MLE, in conjunction with a logistic regression
statistical model, could be used to identify combinations of biomarkers which
predict response.
Hypothesis testing in conjunction with p-values is used to answer 2. Typically, the
hypothesis that the experimental treatment does not exceed standard of care
( p ¼ 20%) is termed the null hypothesis, while experimental treatment beating
standard of care ( p > 20%) is termed the alternative hypothesis. These may be
written as
H 0 : p 20%
H a : p > 20%:
parameter. In the context of the example, with six in ten responses, a 95% confidence
interval for p may span 31–83% (Wilson binomial confidence interval. Wilson
1927). Here, again, the natural interpretation is to claim that the probability that p
is in the interval 31–83% is 95%. However, the frequentist inferential framework
cannot apply probabilities to p, so this statement does not have meaning. Instead, the
95% refers to what would occur if the trial were run again and again, and in each
trial, one made a 95% confidence interval. Approximately 95% of these intervals
would contain p. However, any individual interval either does or does not contain p.
Again, under the frequentist approach, the data is random while the parameter is
unknown but fixed.
Figure 2 illustrates this concept. Here 100 hypothetical trials were run each with
10 patients. The true response rate is 40%. For each trial, a 95% confidence interval
is produced. The horizontal lines in Fig. 1 show the confidence intervals for each
trial. The black lines denote trials in which the resulting confidence interval includes
the true response rate, while the red lines denote trials in which the resulting
confidence interval does not include the true response rate. For about 95% of trials
(in this particular example, 96%), the confidence interval will include the true
response rate. However, under the frequentist paradigm, any particular trial confi-
dence interval either does or does not contain the true response rate, so the 95%
interpretation can only be applied to the average coverage probability of confidence
intervals, not any single interval. The “frequentist” framework studies the property
of long-run frequency of an estimator by how often it occurs, hence the name. In
both the hypothesis testing and confidence interval estimation, the frequentist
approach answers the question of interest indirectly.
0 20 40 60 80 100
Percentage Remission
54 Inferential Frameworks for Clinical Trials 981
• P-values can indicate how incompatible the data are with a specified statistical
model.
• P-values do not measure the probability that the studied hypothesis is true, or the
probability that the data were produced by random chance alone.
• Scientific conclusions and business or policy decisions should not be based only
on whether a p-value passes a specific threshold.
• Proper inference requires full reporting and transparency.
• A p-value, or statistical significance, does not measure the size of an effect or the
importance of a result.
• By itself, a p-value does not provide a good measure of evidence regarding a
model or hypothesis.
The frequentist hypothesis testing paradigm does not offer a method to conclude
that the null hypothesis is true. The result of the test is generally a p-value and a
conclusion that the null was rejected at some Type I error control rate (usually 0.05)
or the null hypothesis was not rejected. The conclusion is never that the null
hypothesis is true or highly likely to be true. Bayesian hypothesis testing permits
reaching these conclusions as will be discussed in Sections “Bayesian Framework”
and “Reducing Subjectivity and Calibrating P-Values with Bayes Factors.”
In other words, since null hypothesis significance testing (NHST) calculates the
p-value assuming that the null hypothesis Ho is true, the p-value is not the probability
that Ho is true. No specific alternative hypothesis H1 needs to be specified. No
estimation of the treatment effect is given. The inference is based on unobserved data
and violates the likelihood principle (Berger and Wolpert 1988).
In a recent special issue of the American Statistician, a collection of 43 articles
provides further discussion on p-values and statistical inference in general (Wasser-
stein et al. 2019). There are a few “Do’s” and “Don’t’s” offered in the editorial. For
the “Don’t’s”:
• Don’t base your conclusions solely on whether an association or effect was found
to be “statistically significant” (i.e., the p-value passed some arbitrary threshold
such as p < 0.05).
• Don’t believe that an association or effect exists just because it was statistically
significant.
• Don’t believe that an association or effect is absent just because it was not
statistically significant.
• Don’t believe that your p-value gives the probability that chance alone produced the
observed association or effect or the probability that your test hypothesis is true.
• Don’t conclude anything about scientific or practical importance based on statis-
tical significance (or lack thereof).
• Accept Uncertainty
• Be Thoughtful
54 Inferential Frameworks for Clinical Trials 983
Bayesian Framework
This statement avoids the opaque interpretation of p-values and can be seen as a
benefit of the Bayesian inferential framework (Berger 2003). Bayesian approaches to
questions 1, 2, 3, and 4 are now discussed in the context of the response example.
More complex analyses are discussed in references (Berry et al. 2010; Gelman et al.
2013).
In order to make probabilistic statements, prior knowledge about the unknown
parameters is formalized into a prior distribution. The prior is then combined with
the data to produce a posterior distribution. The mathematical machinery for com-
bining the prior with the data is Bayes theorem, from whence the framework gets its
name (Bayes 1763). Bayes theorem states that for two events A and B
PðBjAÞPðAÞ
PðAjBÞ ¼ :
Pð BÞ
The left side of the equation is read “the probability that A is true given that B is
true.” In the context of the response example, A could be “the population response
rate p is greater than 20%” and B could be data such as six out of ten patients
responded. The left side of the equation, P(A|B), then reads the probability that the
response rate is greater than 20% (in the population) given that six out of ten
responses were observed in the sample.
The central concept of Bayesian statistics is information synthesis. Specifically,
Bayesian 1-2-3 is that prior plus data becomes posterior. The current posterior can be
984 J. P. Long and J. J. Lee
considered as an updated prior for future data acquisition. The Bayesian method
takes a “learn-as-we-go” approach to perform continual learning by synthesizing all
available information at hand.
Note that it is a misconception that frequentist statisticians do not use Bayes
theorem. Bayes theorem is a mathematical fact, accepted and used by all statisticians.
The frequentist framework objects to the representation of unknown quantities using
distributions, in particular the prior, not the existence of Bayes theorem.
Figure 3 demonstrates simple Bayesian statistics with the response example. The
x-axis represents different possible response rates p for the experimental drug. The
blue and red curves and area under the corresponding curves represent prior knowl-
edge about p (prior distribution) and the updated knowledge of p after observing the
data (posterior distribution). The word prior refers to prior to data collection. The
mathematical form for this prior is a beta distribution. For Fig. 3, left panel,
1 p 0:4 p 0:4
π ð pÞ ¼ 1 : ð3Þ
Bð0:6, 1:4Þ 100 100
This prior favors low response percentages with a mean response rate of 0.3 and
an effective sample size of 2. The source of prior knowledge can be subjective and is
sometimes controversial in Bayesian statistics. However, it is reasonable to formu-
late the prior distribution based on response rates observed with past experimental
drugs.
The prior distribution in Eq. 3 is combined with the likelihood function of Eq. 2 to
produce a posterior distribution. The posterior distribution of p after observing six
responses out of ten evaluated patients is mathematically computed by
f ðx ¼ 6jpÞπ ðpÞ
1 p 5:6 p 4:4
π ðpjx ¼ 6Þ ¼ Ð ¼ 1 :
f ðx ¼ 6jpÞπ ðpÞdp Bð6:6, 5:4Þ 100 100
0.05 0.05
Distribution Distribution
0.04 Prior 0.04 Prior
Posterior Posterior
0.03 0.03
Density
Density
0.02 0.02
0.01 0.01
0.00 0.00
25 50 75 25 50 75
p=Response Percentage p=Response Percentage
Fig. 3 (Left) Prior and posterior distributions. The prior is Beta(0.6,1.4), while the posterior is Beta
(6.6,5.4). (Right) A second example with the same data but a different prior. The prior is Beta
(1.4,0.6) and the posterior is Beta(7.4,4.6). The posterior distribution is different than (Left),
representing the subjective nature of Bayesian analyses
54 Inferential Frameworks for Clinical Trials 985
The first equality is Bayes theorem, while the second equality involves algebraic
manipulations. Notice the probabilities have shifted considerably after observing the
data of six responses out of ten patients. While the prior states that the probability p is
greater than 20% is only 53%, the posterior assigns greater than 99% chance to this
event.
The posterior distribution in red is used to answer questions 1, 2, 3, and 4. The
most common Bayesian point estimator is the posterior mean or the average value of
p as indicated by the posterior distribution. With six out of ten responses, the
posterior mean is 55%. This is different than the sample percentage of response of
60%. The reason for this difference is that the prior distribution (in blue) favored low
percentages and is still exerting an effect on the point estimate even after collecting
the data. As the amount of data increases, the prior will have less and less effect. The
posterior mean estimator will become closer to the percentage of responses in the
sample. For example, with 60 responses out of 100 patients, the posterior mean
(using the blue prior) is 59%. Since the typical frequentist estimator is 60%, the
sample proportion, this example illustrates that as the sample size increases, Bayes-
ian and frequentist methods become more in agreement. This is common in many
other settings.
For addressing question 2, one can use the posterior distribution which shows
there is a 99.6% chance that the experimental treatment exceeds standard of care
(20% response). This is determined by calculating the percentage of the red posterior
area which is greater than 20% on the x-axis. This efficacy result would likely form
the basis for proceeding to a Phase III trial. For example, one could decide to proceed
to a Phase III trial if there is at least a 95% chance that the experimental treatment
exceeds standard of care. In practice, such decisions usually involve a number of
factors including toxicity analysis. In section “Multiple Outcomes and Utility
Functions” statistical methodology for formal incorporation of multiple objectives
(e.g., efficacy and safety) in decisions is discussed.
For addressing question 3, the red posterior distribution indicates there is a 95%
chance that the response rate is between 28% and 81%. This is known as a 95%
credible interval, the Bayesian equivalent of a confidence interval. These endpoints,
28% and 81%, are the 2.5 and 97.5 percentiles of the red curve (i.e., the area of the
red region to the left of 28% is 0.025, and the area of the red region to the left of 81%
is 0.975).
For addressing question 4, the tail probability of the response rate greater than
40% can be calculated by summing up or integrating the area under the curve from
the response rate of 0.4 to 1.0, which is 85%.
The frequentist testing paradigm treats the null and alternative hypotheses asym-
metrically, and there is no clear way to conclude the null is true or make a statement
about confidence in the null hypothesis. In contrast Bayesian hypothesis tests treat
the null and alternative symmetrically. A posterior probability for each hypothesis
may be reported (sum of probabilities of null and alternative will of necessity
equal 1). In section “Reducing Subjectivity and Calibrating P-Values with Bayes
Factors,” we will discuss Bayesian hypothesis testing using Bayes factor.
986 J. P. Long and J. J. Lee
Sequential Design
In many data analysis applications, the sample size is fixed and the data is analyzed
only after collection. However, in clinical trials patients accrue sequentially, offering
the opportunity for stopping based on interim analysis of safety and efficacy results.
The Bayesian framework is particularly simple because the reason for stopping
formally has no impact on the subsequent data analysis (Berger and Wolpert
1988). A caveat to this message is that data-dependent stopping rules can increase
sensitivity to prior distribution assumptions. For example, Bayesian credible inter-
vals constructed using conservative priors can have anti-conservative coverage
probability (lower than the specified amount) when data-dependent stopping rules
are used (Rosenbaum and Rubin 1984). Frequentist measures of evidence, such as p-
values, explicitly require incorporating the reason for stopping. This can be
implemented a priori, such as in Simon’s optimal two-stage design, in which interim
stopping is accounted for in the hypothesis test decision with the specified Type I and
Type II error rates.
Thall et al. (1995) is a popular Bayesian sequential design for Phase II clinical
trials. The design is used for monitoring multiple outcomes, such as safety and
efficacy. Data can be monitored patient by patient, but typically interim analyses are
conducted in cohort sizes of five or ten to reduce logistical burden. At each interim
analysis, such as five patients with toxicity and efficacy data available, the method
produces a probability distribution for the parameters, similar to Fig. 3. Prior to the
trial, tolerable safety and efficacy boundaries are established. For example, stop if the
probability of efficacy being greater than 20% is less than 5% or the probability of
toxicity greater than 25% is greater than 95%. Thus, the trial is stopped whenever
one becomes confident in efficacy less than 20% or toxicity greater than 25%. From
these probabilistic thresholds, one can determine stopping boundaries (number of
toxicities or treatment failures which will terminate the trial) at various interim
analysis points. In this way the trial design produces simple rules for continuing or
terminating the trial, much like a 3 + 3 design, but with probabilistic guarantees
about the decisions. In addition to stopping boundaries, operating characteristics of
the trial such as the probability of stopping early given some efficacy and toxicity
levels can be tabulated prior to trial initiation.
Bayesian Computation
Although Bayes theorem was published more than 250 years ago, its use was limited
to conjugate models in which the posterior distribution has the same parametric form
as the prior distribution (such as the beta-binomial example discussed in the previous
section). In these cases, analytic solutions are available for computing the posterior
distribution. Bayesian computation beyond conjugate cases can be demanding.
Since the late 1980s, a combination of faster computers and better algorithms
(e.g., Gibbs sampling, Metropolis-Hastings sampling, and general Markov Chain
Monte Carlo (MCMC)) has made Bayesian computation for clinical trial data sets
54 Inferential Frameworks for Clinical Trials 987
feasible and even routine. Software tools such as BUGS, JAGS, STAN, and SAS
PROC MCMC allow easy implementation of a wide spectrum of Bayesian models
(Spiegelhalter et al. 1996; Plummer 2003; Chen 2009; Carpenter et al. 2017).
1. Empirical: Prior distributions are constructed from data, usually in the context of
a hierarchical model which has multiple levels of parameters (see diagnostic
testing example below for a definition and example of a hierarchical model).
While empirical Bayesian analyses use Bayes theorem, the resulting inferences
typically reported are frequentist (e.g., confidence intervals rather than credible
intervals, MLEs rather than posterior means). Many statisticians who consider
themselves frequentists use empirical Bayes methods.
2. Reference (or Objectivist): A set of default, or reference, prior distributions are
used which do not attempt to incorporate subjective knowledge about the param-
eters. Default priors may be chosen to have favorable mathematical properties
(Jeffreys 1946). Reference Bayesians may employ improper priors or non-infor-
mative/vague priors which do not correspond to prior belief because they are not
probability densities such as Uniform (0, 1) or a normal distribution prior with
infinite variance.
3. Proper (or Subjectivist): The prior is chosen to reflect subject matter knowledge
about the parameter. Different practitioners will have different beliefs about the
parameter, resulting in different informative priors and different inferences.
4. Decision Theoretic: Utility functions, which assign numeric values to different
outcomes, are combined with posterior distributions, which reflect uncertainty
about the state of nature, to make decisions. The particular choice of utility
function adds an additional level of subjectivity to the analysis, for example, in
the trade-off of toxicity versus efficacy or cost versus benefit.
988 J. P. Long and J. J. Lee
An example from diagnostic testing may help illustrate the differences, similarities,
and continuity in these schools of thought. Suppose a test for lung cancer has 95%
sensitivity and 90% specificity. For a particular individual, let the parameter θ equal 1 if
the person has cancer and 0 if not. Let Y equal 1 if the test is positive for the individual
and 0 if negative. What is the probability that this person has cancer given she tests
positive, i.e., P(θ ¼ 1| Y ¼ 1)? The following steps illustrate a dynamic back-and-forth
among the various schools of statistical thinking on how to address this question:
For example, with p ¼ 0.1, the post-test probability is 0.51. This could be
viewed as a subjective Bayesian analysis in which the prior was chosen based
on the practitioner’s belief about disease prevalence.
(ii) A frequentist statistician may object to this analysis as subjective. Where does
the disease prevalence number originate? The frequentist statistician may
engage in a literature search and find that 100 individuals from the population
were given a gold standard lung cancer test (always provides correct result),
and X had cancer. The frequentist computes a (binomial) MLE estimate of
pb ¼ X=n, the sample proportion who have the disease. The posttest probability
estimate is
0:95bp
Pbðθ ¼ 1jY ¼ 1Þ ¼
0:95b
p þ 0:1ð1 pbÞ
put a prior on this quantity. This distribution represents belief about the
population prevalence prior to observing the gold standard data X. A subjec-
tivist Bayesian will incorporate beliefs into this prior, e.g., most diseases are not
common, so the prior on p will favor disease prevalences less than 0.5. In
contrast, the two objective Bayesian priors for this model are Beta(1/2,1/2)
(Jeffreys prior) and Beta(1,1) (suggested by Laplace); see Mossman and Berger
(2001).
(iv) While the empirical, reference, and proper Bayesian will all report
Pbðθ ¼ 1jY ¼ 1Þ, a decision theoretic Bayesian will go a step further and seek
to use the data X and Y (along with the prior distributions) to make some choice
of action. For example, given that an individual tests positive, should she
undergo an additional invasive test which has some potential side effects?
This decision requires balancing many factors, including the risks associated
with having undetected disease and side effects of the invasive test, along with
the measure Pbðθ ¼ 1jY ¼ 1Þ. The consideration of all these factors is typically
formalized in a utility function.
Mossman and Berger (2001) discuss this problem in the more complex case
where the sensitivity and specificity must be estimated from data and confidence
bounds (rather than simply point estimates) for Pbðθ ¼ 1jY ¼ 1Þ are desired. They
find the objectivist Bayesian method has desirable properties relative to frequentist/
empirical Bayesian analyses.
The schools of thought above involve considerable overlap. Some statisticians
advocate using multiple approaches within the same problem. For example, cali-
brated Bayes recommends using Bayesian (either reference or proper) analysis for
fitting statistical models while using frequentist methods to test the quality of the
model itself (Little 2006).
Several important elements for applying Bayesian methods are listed below.
An important principle of the Bayesian analysis is that one should not manipulate
the prior in order to obtain desirable results post data collection. After setting the
prior, sensitivity analysis can be applied to study its influence. For example, the two
posteriors in Fig. 3 are based on two priors (with different parameters a and b) and
can help illustrate to what extent posterior conclusions are influenced by the prior.
Rule-based, frequentist, and Bayesian trial designs all exist and are in use because
each approach has strengths: rule-based methods are simple to follow, frequentist
methods avoid prior selection, and Bayesian methods are flexible. The strengths of
each approach have motivated trial design and statistical methodology which incor-
porate ideas from multiple frameworks.
In the response example above, the prior belief about p (blue curve in Fig. 3 left)
implied a prior belief about whether the response rate of the experimental treatment
exceeded that of the standard of care (20%). In particular, prior to collecting any
data, one assumed a 53% chance that the experimental treatment had response rate
greater than 20% (the area under the blue curve to the right of 20%) and hence a 47%
chance that the experimental treatment was worse than standard of care (the area
under the blue curve to the left of 20%). The act of assigning prior probabilities to the
null and alternative hypothesis may be seen as especially subjective.
One Bayesian remedy for this problem is to construct prior distributions for the
null hypothesis and alternative hypothesis separately. One may then calculate the
probability of the data given the null hypothesis is true and the probability of the data
given the alternative hypothesis is true. The ratio of these two quantities is known as
the Bayes factor. Letting D denote data, H0 denote the null hypothesis, and H1 denote
the alternative hypothesis, the Bayes factor links the posterior odds with the prior
odds:
Table 2 (adapted from Goodman (1999) Table 1) relates the Bayes factor, the prior
probability of the null hypothesis, and the posterior probability of the null hypoth-
esis. For example, with a Bayes factor of 1/5 and a prior probability on the null
hypothesis of 90%, the posterior probability of the null is 64%. This is determined by
noting that a 90% prior on the null is equivalent to a 9:1 odds, so the posterior odds is
1 9 9
5 1 ¼ 5 . The posterior odds is then converted to a posterior probability with
54 Inferential Frameworks for Clinical Trials 991
9
9
5
þ1
0:64. Reporting the Bayes factor (BF) does not require specifying the prior
5
odds (i.e., the Bayes factor does not depend on the prior probability of the null
hypothesis) and is thus is perceived as testing a hypothesis “objectively” (Berger
1985) and removes “an unnecessary element of subjectivity” (Johnson and Cook
2009).
In Fig. 3 (left), the Bayes factor can be computed as a way to remove the influence
of the initially specified prior probability of the null P(H0)¼ 0.47 and only consider
the conditional priors P( p| H0) and P( p| H1) when making a decision about the
veracity of the null hypothesis. The Bayes factor can be determined for Fig. 3 (left)
by noting that P(H0| D) ¼ P( p < 0.2| data) ¼ 0.0041 and thus P(H1| D) ¼ P
( p > 0.2| data) ¼ 0.9959 resulting in posterior odds of 0.0042. The prior distribution
Beta(0.6,1.4) implies P(H0)¼ 0.47 and P(H1)¼ 0.53 with prior odds ¼ 0.87. Thus,
the Bayes factor is the ratio of posterior odds divided by prior odds which is 0.0048
in favor of the alternative hypothesis that the response rate is greater than 0.2.
A user of the Bayes factor can specify his own prior probability and then compute
the posterior probability of the alternative or consider several possible prior proba-
bilities (as in Table 2). For example, a Bayes factor of 1/100 is considered strong–
very strong evidence for the alternative hypothesis because even assuming a 90%
prior probability on the null, the posterior probability on the null is 8%.
Bayes factors can also be used to calibrate p-values in an attempt to find more
objective cutoffs than 0.05. Johnson (2013) developed connections between Bayes
factors and p-values based on Bayesian uniformly most power tests. He argued that
the p-value thresholds of 0.005 and 0.001 should be used to denote significant and
highly significant results in clinical trials, rather than the typical 0.05.
A simple version of this idea can be seen with the Gaussian null hypothesis
testing problem. Suppose the common scenario of testing a simple null hypothesis
(population mean equals some constant) and that under the null and alternative
hypotheses, the frequentist test statistic has a normal distribution (Z score). Then
for a given Z-score (equivalently p-value), one can derive a minimum Bayes factor
and hence a maximum level of support for the alternative hypothesis. Table 3
(adapted from Goodman (1999) Table 2) displays these relations. The common p-
value threshold of 0.05 implies a minimum Bayes factor of 0.15. If one assigns a
50% chance that the null hypothesis is true a priori, then the posterior on the null is
13%. Thus a p-value of 0.05 could be considered moderate, but certainly not
definitive evidence of the alterative.
For more on Bayes factors, see Johnson and Cook (2009) for a Phase II single-
arm trial design and Goodman (1999) and Kass and Raftery (1995) for general
reviews (https://biostatistics.mdanderson.org/softwaredownload/ hosts software for
implementing clinical trial designs with Bayes factors.).
produce an estimate of the toxicity probability at each dose level, features absent in
traditional 3 + 3 designs.
The traditional 3 + 3 remains popular likely due to its simplicity. While the rule-
based 3 + 3 dose levels can be assigned by any clinical trialist, CRM requires
evaluating the posterior distribution at each cohort, a task that requires computer
software and a statistician.
To address this concern, new trial designs have attempted to merge the good
operating characteristics of CRM with the simplicity of rule-based methods such as
3 + 3. These represent one sort of hybrid trial. A particular example of this new
approach is the Phase I Bayesian Optimal Interval Design (BOIN) (Liu and Yuan
2015). In BOIN, users specify a toxicity interval. The design seeks a dose with
toxicity in the interval. BOIN escalates or de-escalates doses based on the results of a
Bayesian hypothesis test of whether the current dose toxicity is in the acceptable
toxicity interval. Similar to CRM, the method can result in an arbitrary number of
escalations and de-escalations. However, unlike CRM, these decisions are made
only based on the toxicities observed at the current dose, not other doses. This
feature results in simple interval boundaries for the escalation decision which can be
pre-computed at the design level. Thus the trial runs without consultation to com-
puter software, much like the 3 + 3 design. Figure 4 displays a decision flowchart for
a BOIN trial design with a targeted toxicity probability of 0.3. Dose escalation, de-
escalation, and retention decisions are entirely based on comparing the DLT rate at
the current dose to fixed thresholds provided in the diagram. BOIN designs (includ-
ing these operational flowcharts) can be created using freely available web applica-
tions (http://trialdesign.org).
The Phase II sequential monitoring design of Thall et al. (1995) (TSE) proposes early
stopping for safety and efficacy based on posterior probabilities. A disadvantage of
this procedure is that, unlike Simon’s two-stage design (Simon 1989), the method does
not provide a recommendation about proceeding to a Phase III trial. TSE suggest
performing a separate analysis on the final data using either a frequentist or Bayesian
framework. A challenge of using frequentist methods in this setting is that standard
frequentist hypothesis tests will not control Type I or Type II error at the specified level
due to the previous sequential monitoring. Control of these errors is considered
desirable by regulatory agencies because they are not based on prior distributions
and are thus seen as more objective than traditional Bayesian hypothesis testing.
Recent Bayesian designs calibrate parameters to obtain desirable frequentist
properties, such as Type I and Type II error control, while simultaneously making
interim decisions in a straightforward Bayesian probabilistic manner. Two such
Phase II designs are the predictive probability design (Lee and Liu 2008) and the
Bayesian Optimal Phase II design (BOP2) (Zhou et al. 2017).
The predictive probability design adds early stopping for efficacy based on the
probability that the trial will achieve its objective, given all current information.
994 J. P. Long and J. J. Lee
Fig. 4 BOIN flowchart with a targeted probability of toxicity of 0.3. BOIN retains the operational
simplicity of 3 + 3 but with better statistical properties
The null hypothesis is that the true experimental efficacy rate p is no better than
standard of care efficacy rate p0 (i.e., H0 : p p0). The targeted efficacy for the
experimental treatment is p1. At final sample size of N, the null hypotheses will
be rejected, and the treatment determined more effective than standard of care, if
P( p > p0 |data on N patients) > θT. Rather than waiting until all N patient responses
have been observed, the trial computes the probability of attaining this result in
cohorts of arbitrary size. The trial is terminated early if the probability falls below
θL (treatment determined ineffective early) or above θU (treatment determined
effective early). Similar to Simon’s two-stage design, the predictive probability
design optimizes over N, θU, θL, and θT to obtain specified Type I and Type II error
54 Inferential Frameworks for Clinical Trials 995
rates and minimum sample size. By controlling Type I and Type II error, the design
ensures good frequentist properties while making decisions based on straightfor-
ward posterior probabilities.
The BOP2 design operates in a similar manner to Thall et al. (1995) (TSE) but
tunes the posterior probability cutoffs at each interim cohort in order to control Type
I error at some predefined threshold while maximizing power (i.e., minimizing Type
II error). The posterior probability cutoffs become more stringent as the trial pro-
gresses, requiring more evidence of treatment efficacy. Like TSE, BOP2 can monitor
multiple endpoints, such as safety and efficacy.
To illustrate BOP2, consider a hypothetical Phase II single-arm study with a
maximum of 50 patients. Each patient will be evaluated for a binary treatment
response and a binary toxicity response. BOP2 requires a user-specified null hypoth-
esis probability of efficacy, toxicity, and efficacy AND toxicity. These probabilities
could be determined from historical controls and are here assumed P(Eff) ¼ 0.2, P
(Tox) ¼ 0.4, and P(Eff & Tox) ¼ 0.08. BOP2 then constructs a vague prior
distribution belonging to the Dirichlet class with these historical control probabilities
as the prior parameter means.
The power is computed at a user input value under the alternative hypothesis, here
specified as P(Eff) ¼ 0.4, P(Tox) ¼ 0.2, and P(Eff & Tox) ¼ 0.08. A Type I error rate,
here 0.05, is chosen as well as interim monitoring at 10, 20, 30, and 40 patients. BOP2
then seeks probability stopping thresholds to maximize power. These probability
thresholds are converted into stopping criteria in terms of number of toxicities and
number of responses at each interim cohort. The stopping boundaries are listed in
Table 4. This design obtains a power of 0.95 to reject the null hypothesis and conclude
that the treatment is more efficacious and less toxic than the null hypothesis.
One can also assume particular efficacy and toxicity values and calculate the
probability of stopping, the probability of claiming the treatment is acceptable, and
the expected sample size. These are known as the operating characteristics for the
trial and are contained in Table 5. Higher efficacy and lower toxicity probabilities
lead to lower chance of early stopping and higher chance of claiming acceptable. At
the null value of Pr(Eff) ¼ 0.2 and Pr(Tox) ¼ 0.4 (second row), the probability of
claiming acceptable is 3.57%, below the specified 5% threshold. At the alternative
value of Pr(Eff) ¼ 0.4 and Pr(Tox) ¼ 0.2, the power of the test is 95.36%.
The dynamic performance of the BOP2 design can be visualized with animations.
A still image from one such animation is contained in Fig. 5. Patients are monitored
in cohorts of size 10 up to a maximum of 50. At each interim cohort, a Go/No Go
decision is made based on the number of responses (left plot) and toxicities (right
Table 4 Stopping rules for Cohort size STOP IF # responses < ¼ OR # toxicities > ¼
a BOP2 trial design with the
10 0 6
null hypothesis of P
(Eff) ¼ 0.2, P(Tox) ¼ 0.4, 20 3 10
and P(Eff & Tox) ¼ 0.08, 30 5 13
5% Type I error 40 8 16
50 12 18
996 J. P. Long and J. J. Lee
Table 5 Operating characteristics for a BOP2 trial design with the null hypothesis of P(Eff) ¼ 0.2,
P(Tox) ¼ 0.4, and P(Eff & Tox) ¼ 0.08, 5% Type I error, and the alternative hypothesis of P
(Eff) ¼ 0.4, P(Tox) ¼ 0.2, and P(Eff & Tox) ¼ 0.08, 95% power
Pr Pr Pr(Eff & Early stopping Claim acceptable Sample
(Eff) (Tox) Tox) (%) (%) size
0.2 0.2 0.04 66.63 15.9 32.6
0.2 0.4 0.08 87.17 3.57 25.3
0.2 0.6 0.12 99.93 0 13.9
0.4 0.2 0.08 3.59 95.36 48.9
0.4 0.4 0.16 61.85 20.89 34.3
0.4 0.6 0.24 99.73 0.03 14.9
plot). This study did not stop early because the number of responses and toxicities
stayed within the green regions.
The flexibility of the Bayesian and frequentist inferential frameworks enables adaptation
and development of new statistical methodology to address challenges and opportunities
in modern medicine. Here two recent directions for trial design are reviewed with an
emphasis on how the inferential frameworks are impacting design decisions.
The advent of widely available genetic tests enables selective targeting of drugs to
specific patient subpopulations. For example, cancer patients may be screened for
dozens of genetic mutations and then given a treatment which produces optimal
results for their genetic profile. Whereas traditional clinical trials focused on answer-
ing a single question (is this drug effective for a given patient population), newer trial
designs seek to find optimal matches between patient profiles and particular drugs.
54 Inferential Frameworks for Clinical Trials 997
Clinical trials typically collect data on several outcomes, such as safety and
efficacy, thus offering several metrics on which to compare treatments. This can
make decisions about identifying the “best” treatment difficult, as one drug may be
more effective but also more toxic. A simple and common solution is to define a
maximum toxicity threshold and then select as superior the treatment with accept-
able toxicity and highest efficacy. Alternatively a new drug may undergo a
noninferiority trial for efficacy and then be deemed superior based on safety
(Mauri and D’Agostino Sr 2017).
Decision theory offers a more nuanced alternative. In decision theory, a utility
function determined a priori by the clinician is used to summarize potentially
complex treatment effects as a single number. Murray et al. (2016) discuss utility
functions in the context of cancer patients where treatment efficacy may be
recorded as complete response, partial response, stable disease, and progressive
disease (four categories) while nonfatal toxicities may be recorded as none, minor,
and major (three categories). Each patient has one of 13 possible responses to
treatment (four possible efficacies x three toxicity levels + death). The clinician
then defines a desirability, or utility, of each of these possible 13 responses. Murray
et al. (2016) recommend assigning a utility score of 100 to the best possible
response (complete response with no toxicity) and 0 to the worst outcome
(death). The utility is then computed for each patient enrolled in the trial. The
distribution of utilities in treatment and control groups can be compared using
either frequentist or Bayesian methods. Hypothesis testing can be used to deter-
mine if the experimental treatment convincingly delivers better average utility than
standard of care.
The AWARD-5 trial used a utility function to find the optimal dose of dulaglutide
for treating type 2 diabetes patients (Skrivanek et al. 2014). The trial combined four
safety and efficacy measures (glycosylated hemoglobin A1c versus sitagliptin,
weight, pulse, and diastolic blood pressure) into a clinical utility index (CUI) with
larger values indicating more favorable profile. The trial computed posterior distri-
butions for CUI at various dose levels and recommended a dose based on these
distributions.
Inferential frameworks enable clinicians to conduct clinical trials and draw princi-
pled conclusions from the resulting data. Four essential aspects of inferential frame-
works are:
1. Assumptions relating to the data collection of the sample from the population:
Both frequentist and Bayesian methods will produce valid results when the
sample is selected in an unbiased manner. Random sampling of patients,
54 Inferential Frameworks for Clinical Trials 999
Cross-References
References
Agresti A, Franklin CA (2009) Statistics: the art and science of learning from data. Prentice Hall,
Upper Saddle River
Alexander BM, Cloughesy TF (2018) Platform trials arrive on time for glioblastoma. Oxford
University Press US
Alexander BM et al (2018) Adaptive global innovative learning environment for glioblastoma:
GBM AGILE. Clin Cancer Res 24(4):737–743
Barker A et al (2009) I-SPY 2: an adaptive breast cancer trial design in the setting of neoadjuvant
chemotherapy. Clin Pharmacol Ther 86(1):97–100
Bayes T (1763) LII. An essay towards solving a problem in the doctrine of chances. By the late Rev.
Mr. Bayes, FRS communicated by Mr. Price, in a letter to John Canton, AMFR S. Philos Trans
R Soc Lond 53:370–418
Berger JO (1985) Statistical decision theory and Bayesian analysis. Springer Science & Business
Media
Berger JO (2003) Could fisher, Jeffreys and Neyman have agreed on testing? Stat Sci 18(1):1–32
Berger JO, Wolpert RL (1988) The likelihood principle. IMS
Berry DA (2015) The brave New World of clinical cancer research: adaptive biomarker-driven trials
integrating clinical practice with clinical research. Mol Oncol 9(5):951–959
Berry SM et al (2010) Bayesian adaptive methods for clinical trials. CRC press
Berry SM et al (2015) The platform trial: an efficient strategy for evaluating multiple treatments.
JAMA 313(16):1619–1620
Biswas S et al (2009) Bayesian clinical trials at the University of Texas MD Anderson cancer center.
Clin Trials 6(3):205–216
Carpenter B et al (2017) Stan: a probabilistic programming language. J Stat Softw 76(1)
Casella G, Berger RL (2002) Statistical inference. Duxbury Pacific Grove, Belmont
Chen F (2009) Bayesian modeling using the MCMC procedure. Proceedings of the SAS Global
Forum 2008 Conference. SAS Institute Inc., Cary
Gelman A et al (2013) Bayesian data analysis. Chapman and Hall/CRC
Goodman SN (1999) Toward evidence-based medical statistics. 2: the Bayes factor. Ann Intern Med
130(12):1005–1013
Herbst RS et al (2015) Lung Master Protocol (Lung-MAP) – a biomarker-driven protocol for
accelerating development of therapies for squamous cell lung cancer: SWOG S1400. Clin
Cancer Res 21(7):1514–1524
Hobbs BP, Landin R (2018) Bayesian basket trial design with exchangeability monitoring. Stat Med
37(25):3557–3572
Hobbs BP et al (2018) Controlled multi-arm platform design using predictive probability. Stat
Methods Med Res 27(1):65–78
Hyman DM et al (2015) Vemurafenib in multiple nonmelanoma cancers with BRAF V600
mutations. N Engl J Med 373(8):726–736
Jeffreys H (1946) An invariant form for the prior probability in estimation problems. Proc R Soc
Lond A Math Phys Sci 186(1007):453–461
Johnson VE (2013) Revised standards for statistical evidence. Proc Natl Acad Sci 110(48):19313–
19317
Johnson VE, Cook JD (2009) Bayesian design of single-arm phase II clinical trials with continuous
monitoring. Clin Trials 6(3):217–226
Jüni P et al (2001) Assessing the quality of controlled clinical trials. BMJ 323(7303):42–46
Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90(430):773–795
Kim ES et al (2011) The BATTLE trial: personalizing therapy for lung cancer. Cancer Discov 1
(1):44–53
Le Tourneau C et al (2009) Dose escalation methods in phase I cancer clinical trials. J Natl Cancer
Inst 101(10):708–720
Lee JJ, Chu CT (2012) Bayesian clinical trials in action. Stat Med 31(25):2955–2972
54 Inferential Frameworks for Clinical Trials 1001
Lee JJ, Liu DD (2008) A predictive probability design for phase II cancer clinical trials. Clin Trials
5(2):93–106
Lin Y, Shih WJ (2001) Statistical properties of the traditional algorithm-based designs for phase I
cancer clinical trials. Biostatistics 2(2):203–215
Little RJ (2006) Calibrated Bayes: a Bayes/frequentist roadmap. Am Stat 60(3):213–223
Liu S, Lee JJ (2015) An overview of the design and conduct of the BATTLE trials. Chin Clin Oncol
4(3)
Liu S, Yuan Y (2015) Bayesian optimal interval designs for phase I clinical trials. J R Stat Soc Ser C
Appl Stat 64(3):507–523
Mandrekar SJ et al (2015) Improving clinical trial efficiency: thinking outside the box. American
Society of Clinical Oncology educational book. American Society of Clinical Oncology. Annual
Meeting
Mauri L, D’Agostino RB Sr (2017) Challenges in the design and interpretation of noninferiority
trials. N Engl J Med 377(14):1357–1367
Mossman D, Berger JO (2001) Intervals for posttest probabilities: a comparison of 5 methods. Med
Decis Mak 21(6):498–507
Mullard A (2015) NCI-MATCH trial pushes cancer umbrella trial paradigm. Nature Publishing
Group
Murray TA et al (2016) Utility-based designs for randomized comparative trials with categorical
outcomes. Stat Med 35(24):4285–4305
O’Quigley J, Chevret S (1991) Methods for dose finding studies in cancer clinical trials: a review
and results of a Monte Carlo study. Stat Med 10(11):1647–1664
O’Quigley J et al (1990) Continual reassessment method: a practical design for phase 1 clinical
trials in cancer. Biometrics:33–48
Plummer M (2003) JAGS: a program for analysis of Bayesian graphical models using Gibbs
sampling. In: Proceedings of the 3rd international workshop on distributed statistical computing.
Austria, Vienna
Redig AJ, Jänne PA (2015) Basket trials and the evolution of clinical trial design in an era of
genomic medicine. J Clin Oncol 33(9):975–977
Redman MW, Allegra CJ (2015) The master protocol concept. Seminars in oncology. Elsevier
Renfro L, Sargent D (2016) Statistical controversies in clinical research: basket trials, umbrella
trials, and other master protocols: a review and examples. Ann Oncol 28(1):34–43
Rosenbaum PR, Rubin DB (1984) Sensitivity of Bayes inference with data-dependent stopping
rules. Am Stat 38(2):106–109
Simon R (1989) Optimal two-stage designs for phase II clinical trials. Control Clin Trials 10
(1):1–10
Simon R (2017) Critical review of umbrella, basket, and platform designs for oncology clinical
trials. Clin Pharmacol Ther 102(6):934–941
Simon R et al (2016) The Bayesian basket design for genomic variant-driven phase II trials.
Seminars in oncology. Elsevier
Skrivanek Z et al (2014) Dose-finding results in an adaptive, seamless, randomized trial of once-
weekly dulaglutide combined with metformin in type 2 diabetes patients (AWARD-5). Diabetes
Obes Metab 16(8):748–756
Smith TL et al (1996) Design and results of phase I cancer clinical trials: three-year experience at
MD Anderson Cancer Center. J Clin Oncol 14(1):287–295
Spiegelhalter DJ et al (1996) BUGS: bayesian inference using Gibbs sampling. Version 0.5,
(version ii). http://www.mrc-bsu.cam.ac.uk/bugs. 19
Spiegelhalter DJ et al (2004) Bayesian approaches to clinical trials and health-care evaluation.
Wiley
Storer BE (1989) Design and analysis of phase I clinical trials. Biometrics 45(3):925–937
Thall PF et al (1995) Bayesian sequential monitoring designs for single-arm clinical trials with
multiple outcomes. Stat Med 14(4):357–379
Tidwell RSS et al (2019) Bayesian clinical trials at The University of Texas MD Anderson Cancer
Center: an update. Clin Trials:1740774519871471
1002 J. P. Long and J. J. Lee
Wasserstein RL, Lazar NA (2016) The ASA’s statement on p-values: context, process, and purpose.
Am Stat 70(2):129–133
Wasserstein RL et al (2019) Moving to a world beyond “p< 0.05”. Taylor & Francis
Wilson EB (1927) Probable inference, the law of succession, and statistical inference. J Am Stat
Assoc 22(158):209–212
Woodcock J, LaVange LM (2017) Master protocols to study multiple therapies, multiple diseases,
or both. N Engl J Med 377(1):62–70
Zhou X et al (2008) Bayesian adaptive design for targeted therapy development in lung cancer – a
step toward personalized medicine. Clin Trials 5(3):181–193
Zhou H et al (2017) BOP2: bayesian optimal design for phase II clinical trials with simple and
complex endpoints. Stat Med 36(21):3302–3314
Dose Finding for Drug Combinations
55
Mourad Tighiouart
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004
Dose Finding to Estimate the Maximum Tolerated Dose Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1006
Trial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007
Operating Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1008
Application to the CisCab Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1009
Attributable Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1011
Dose-Toxicity Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012
Dose Allocation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013
Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015
Phase I/II Dose Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017
Stage I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017
Stage II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1017
Discrete Dose Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023
Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1023
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1028
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1028
Abstract
We present early phase cancer clinical trial designs for drug combinations
focusing on continuous dose levels. For phase I trials, the goal is to estimate
the maximum tolerated dose (MTD) curve in the two-dimensional Cartesian
plane. Parametric models are used to describe the relationship between the
doses of the two agents and the probability of dose limiting toxicity (DLT).
Trial design proceeds using cohorts of two patients receiving doses according
to univariate escalation with overdose control (EWOC) or continual reassessment
method (CRM). The maximum tolerated dose curve is estimated as a function of
M. Tighiouart (*)
Cedars-Sinai Medical Center, Los Angeles, CA, USA
e-mail: [email protected]
Bayes estimates of the model parameters. In the case where some DLTs can be
attributed to one agent but not the other, we describe how these parametric
designs can be extended to account for an unknown fraction of attributable
DLTs. For treatments where efficacy is resolved after few cycles of therapy, it is
standard practice to perform single or randomized phase II trials using the MTD(s)
obtained from a phase I trial. In our setting, we show how the MTD curve is
carried out into a phase II trial where patients are allocated to doses likely to have
high probability of treatment efficacy using a Bayesian adaptive design. The
methodology is illustrated with an application to an early phase trial of cisplatin
and cabazitaxel in advanced stage prostate cancer patients with visceral metastasis.
Finally, we describe how these methods are adapted to the case of a pre-specified
set of discrete dose combinations.
Keywords
Dose fiinding · Drug combinations · MTD · DLT · EWOC · CRM · Adaptive
designs · Attributable toxicity · Efficacy · Cubic splines
Introduction
Early phase cancer clinical trials are small studies aimed at identifying tolerable
doses with promising signal for efficacy. These trials use drug combinations of
cytotoxic, biologic, immunotherapy, and/or radiotherapy agents to better target
different signaling pathways simultaneously and reduce potential tumor resistance
to chemo- or targeted therapy. However, most of these trials are designed to estimate
the maximum tolerated dose (MTD) of a single agent for fixed dose levels of the
other agents. This approach may provide a single safe dose for the combination, but
it may be suboptimal in terms of therapeutic effects. Statistical designs that allow
more than one drug to vary during the trial have been studied extensively in the last
decade (see, e.g., Thall et al. 2003; Wang and Ivanova 2005; Yin and Yuan 2009a, b;
Braun and Wang 2010; Wages et al. 2011; Shi and Yin 2013; Tighiouart et al. 2014b,
2016, 2017b; Riviere et al. 2014; Mander and Sweeting 2015). Some of these
designs are aimed at identifying a single MTD, whereas others can recommend
more than one MTD combination and even an infinite number of MTDs Tighiouart
et al. (2014b, 2016, 2017b). Most of these methods use a parametric model for the
dose-toxicity relationship
where x ¼ (x1, . . . , xk) is the dose combination of k drugs, F is a known link function,
T is the indicator of dose limiting toxicity (DLT), and ξ ℝd is an unknown
parameter. Let S be the set of all dose combinations available in the trial. The
MTD is defined as the set C of dose combinations x such that the probability of
DLT for a patient given dose combination x equals to a target probability of DLT θ:
55 Dose Finding for Drug Combinations 1005
An alternative definition of the MTD is the set of dose combinations x that satisfy
|F (x, ξ) θ| δ since the set C in (2) may be empty. This can happen, for example,
when S is finite and the MTD is not part of the dose combinations available in the
trial. The threshold parameter δ is referred to as 100 δ point window in Braun
and Wang (2010) and is pre-specified by the clinician. In general, the above methods
proceed by treating successive cohorts of patients with dose escalation starting from
the lowest dose combination and the model parameters and estimated probabilities of
toxicities are sequentially updated. Dose allocation to the next cohort of patients is
carried out by minimizing the risk of exceeding the target probability of DLT θ
according to some loss function. In section “Dose Finding to Estimate the Maximum
Tolerated Dose Curve” of this chapter, we present a drug combination design based
on escalation with overdose control (EWOC) (Babb et al. 1998; Tighiouart et al.
2005, 2012a, b, 2014a, 2017a; Tighiouart and Rogatko 2010, 2012; Chen et al.
2012a; Wheeler et al. 2017; Diniz et al. 2019). We will focus on drug combination of
two agents with continuous dose levels, and the goal of the trial is to estimate the
MTD curve. The design proceeds by treating consecutive cohorts of two patients
receiving different dose combinations determined using univariate EWOC. In sec-
tion “Attributable Toxicity,” we extend model (1) to account for an unknown fraction
of attributable DLTs. This may arise when combining drugs with different mecha-
nisms of action such as Taxotere and metformin. The design is similar to the one in
section “Dose Finding to Estimate the Maximum Tolerated Dose Curve” except that
the estimated doses for the next cohort of patients use the continual reassessment
method criteria (CRM) (O’Quigley et al. 1990; Faries 1994; Goodman et al. 1995;
O’Quigley and Shen 1996; Piantadosi et al. 1998).
In section “Phase I/II Dose Finding” of this chapter, we show how the estimated
MTD curve from a phase I trial can be used in a phase II study with the goal of
determining a dose combination along the MTD curve with maximum probability of
efficacy. This setting corresponds to a phase I/II cancer clinical trial design where the
MTD is first determined in a phase I trial and then is used in a phase II trial to
evaluate treatment efficacy. Such situations occur when response evaluation takes
few cycles of therapy or the phase I and II patient populations are different. In the
case where both toxicity and efficacy are resolved within one or two cycles of
therapy, sequential designs that update the probabilities of DLT and efficacy are
used instead, and the goal is to determine a tolerable dose combination with
maximum probability of treatment response. Finally, we show how the methods
described in sections “Dose Finding to Estimate the Maximum Tolerated Dose
Curve” and “Attributable Toxicity” can be adapted to the setting of a discrete set
of dose combinations in section “Discrete Dose Combinations.” Properties of these
designs are evaluated by presenting operating characteristics derived under a large
number of practical scenarios. For phase I trials, summary statistics of safety and
precision of the estimate of the MTD curve are calculated. For the phase II trial,
Bayesian power and type I error probabilities are provided under scenarios favoring
the alternative and null hypotheses, respectively.
1006 M. Tighiouart
Model
where T is the indicator of DLT, T ¼ 1 if a patient given the dose combination (x, y)
exhibits DLT within one cycle of therapy, and T ¼ 0 otherwise, x [Xmin, Xmax] is
the dose level of agent A1, y [Ymin, Ymax] is the dose level of agent A2, and F is a
known cumulative distribution function. Here, Xmin, Xmax and Ymin, Ymax are the
lower and upper bounds of the continuous dose levels of agents A1 and A2, respec-
tively. Suppose that the doses of agents A1 and A2 are standardized to be in the
interval [0, 1] using the transformations h1(x) ¼ (x Xmin)/(Xmax Xmin),
h2( y) ¼ (y Ymin)/(Ymax Ymin), and the interaction parameter η3 > 0.
We will assume that that the probability of DLT increases with the dose of any one
of the agents when the other one is held constant. A necessary and sufficient
condition for this property to hold is to assume η1, η2 > 0. The MTD is defined as
any dose combination (x, y) such that
The target probability of DLT θ is set relatively high when the DLT is a reversible
or nonfatal condition and low when it is life threatening. We reparameterize model
(3) in terms of parameters clinicians can easily interpret. One way is to use ρ10, the
probability of DLT when the levels of drugs A1 and A2 are 1 and 0, respectively; ρ01,
the probability of DLT when the levels of drugs A1 and A2 are 0 and 1, respectively;
and ρ00, the probability of DLT when the levels of drugs A1 and A2 are both 0. It can
be shown that
8
1
< η0 ¼ F ðρ00 Þ
>
η1 ¼ F1 ðρ10 Þ F1 ðρ00 Þ ð5Þ
>
:
η2 ¼ F1 ðρ01 Þ F1 ðρ00 Þ
Using (3), the definition of the MTD in (4), and reparameterization (5), we obtain
the MTD curve C as a function of the model parameters ρ00, ρ01, ρ10, and η3 and
target probability of DLT θ as
( 1 )
1 1 1
F ð θ Þ F ð ρ 00 Þ F ð ρ 10 Þ F ð ρ 00 Þ x
C¼ ðx , y Þ : y ¼ 1 : ð6Þ
F ðρ01 Þ F1 ðρ00 Þ þ η3 x
This reparameterization allows the MTD curve to lie anywhere within the dose
range [Xmin, Xmax] [Ymin, Ymax]. If there is strong a priori belief that ΓA1 |A2 ¼0, the
MTD of drug A1 when the level of drug A2 is equal to Ymin is in the interval [Xmin,
55 Dose Finding for Drug Combinations 1007
Xmax] and ΓA2 |A1 ¼0, the MTD of drug A2 when the level of drug A1 is equal to Xmin is
in the interval [Ymin, Ymax], then the reparameterization ρ00, ΓA1 |A2 ¼0, ΓA2 |A1 ¼0, η3
is more convenient (see Tighiouart et al. 2014b for more details on this
reparameterization).
A prior distribution on the model parameters is placed as follows. ρ01, ρ10, and η3
are independent a priori with ρ01 beta(a1, b1), ρ10 beta(a2, b2), and conditional
on (ρ01, ρ10), ρ00/min(ρ01, ρ10) beta(a3, b3). The prior distribution on the interac-
tion parameter η3 is a gamma with mean a/b and variance a/b2. If Dk ¼ {(xi, yi, Ti)} is
the data after enrolling k patients to the trial, the posterior distribution of the model
parameters is
k
Y
π ðρ00 , ρ01 , ρ10 , η3 Þ / ðGðρ00 , ρ01 , ρ10 , η3 ; xi , yi ÞÞT i ð1 Gðρ00 , ρ01 , ρ10 , η3 ; xi , yi ÞÞ1T i
i¼1
where
Gðρ00 , ρ01 , ρ10 , η3 ; xi , yi Þ ¼ F F1 ðρ00 Þ þ F1 ðρ10 Þ F1 ðρ00 Þ xi
ð7Þ
þ F1 ðρ01 Þ F1 ðρ00 Þ yi þ η3 xi yi Þ:
Features of the posterior distribution are estimated using WinBUGS (Lunn et al.
2000) and JAGS (Plummer 2003).
Trial Design
(i) The first two patients receive the same dose combination (x1, y1) ¼ (x2, y2) ¼ (0,
0) and let D2 ¼ {(x1, y1, T1), (x2, y2, T2)}.
(ii) In the second cohort, patients 3 and 4 receive doses (x3, y3) and (x4, y4),
respectively,
where
y3 ¼ y1, x4 ¼ x2, x3 is the α-th percentile of
π ΓA1 jA2 ¼y1 jD2 , and y4 is the α-th percentile of π ΓA2 jA1 ¼x2 jD2 . Here,
π ΓA1 jA2 ¼y1 jD2 is the posterior distribution of the MTD of drug A1 given that
the level of drug A2 is y1, given the data D2.
(iii) In the i-th cohort of two patients, if i is even, then patient (2i 1) receives dose
(x2i1, y2i3), and patient 2i receives dose (x2i2, y2i), where x2i1 ¼
1008 M. Tighiouart
Π1
ΓA jA ðαjD2i2 Þ and y2i ¼ Π1
ΓA jA ðαjD2i2 Þ . If i is odd, then patient
1 2 ¼y2i3 2 1 ¼x2i2
(2i 1) receives dose (x2i 3, y2i 1), and patient 2i receives dose (x2i, y2i2),
where y2i1 ¼ Π1 ΓA jA ¼x ðαjD2i2 Þ and x2i ¼ Π1 ΓA jA ¼y ðαjD2i2 Þ . Here,
2 1 2i3 1 2 2i2
Π1
Γ ðαjDÞ denotes the inverse cdf of the posterior distribution
A1 jA2 ¼y
π ΓA1 jA2 ¼y jD .
(iv) Repeat step (iii) until N patients are enrolled to the trial subject to the following
stopping rule.
Stopping Rule
Enrollment to the trial is suspended for safety if P (P (T ¼ 1|(x, y) ¼ (0,
0)) > θ + ξ1|data) > ξ2, i.e., if the posterior probability that the probability of DLT
at the minimum available dose combination in the trial exceeds the target probability
of DLT is high. The design parameters ξ1 and ξ2 are chosen to achieve desirable
model operating characteristics.
At the end of the trial, we estimate the MTD curve using (6) as
( 1 )
1 1 1
F ð θ Þ F ðbρ 00 Þ F ðbρ 10 Þ F ðbρ 00 Þ x
Cest ¼ ðx , y Þ : y ¼ 1 1
, ð8Þ
F ðb ρ01 Þ F ðb ρ00 Þ þ b η3 x
ρ00 , b
where b ρ01 , b
ρ10 ,and b
η are the posterior medians given the data DN.
Operating Characteristics
The performance of this design is evaluated for a prospective trial by assessing the
safety of the trial and efficiency of the estimated MTD curve under various plausible
scenarios elicited by the clinician in collaboration with the statistician.
Safety
For trial safety, the percent of DLTs across all patients and all simulated trials is
reported in addition to the percent of trials with an excessive DLT rate, for example,
greater than θ + 0.1. The latter is an estimate of the probability that a prospective trial
will result in a high rate of DLTs for a given scenario.
Efficiency
Uncertainty about the estimated MTD curve is evaluated by the pointwise average
bias and percent selection. For i ¼ 1, . . ., m, let Ci be the estimated MTD curve and
Ctrue be the true MTD curve, where m is the number of simulated trials. For every
point (x, y) Ctrue, let
1=2
ðiÞ
dðx,yÞ ¼ signðy0 yÞ min fðx ,y Þ:ðx ,y Þ Ci g ðx x Þ2 þ ðy y Þ2 , ð9Þ
55 Dose Finding for Drug Combinations 1009
where y0 is such that (x, y0 ) Ci. This is the minimum relative distance of the point
(x, y) on the true MTD curve to the estimated MTD curve Ci. Let
m
1 X ðiÞ
dðx,yÞ ¼ d : ð10Þ
m i¼1 ðx,yÞ
Equation (10) can be interpreted as the pointwise average bias in estimating the
MTD.
Let Δ(x, y) be the Euclidean distance between the minimum dose combination (0,
0) and the point (x, y) on the true MTD curve and 0 < p < 1. Let
m
1 X ðiÞ
Pðx,yÞ ¼ I jdðx,yÞ j pΔðx, yÞ : ð11Þ
m i¼1
This is the pointwise percent of trials for which the minimum distance of the point
(x, y) on the true MTD curve to the estimated MTD curve Ci is no more than
(100 p)% of the true MTD. This statistic is equivalent to drawing a circle with
center (x, y) on the true MTD curve and radius pΔ(x, y) and calculating the percent of
trials with MTD curve estimate Ci falling inside the circle. This will give us the
percent of trials with MTD recommendation within (100 p)% of the true MTD for
a given tolerance p. This is interpreted as the pointwise percent selection for a given
tolerance p.
The algorithm described in section “Trial Design” was used to design the first part of a
phase I/II trial of the combination cisplatin and cabazitaxel in patients with prostate
cancer with visceral metastasis. A recently published phase I trial of this combination
by Lockhart et al. (2014) identified the MTD of cabazitaxel/cisplatin as 15/75 mg/m2.
This trial used a “3 + 3” design exploring three pre-specified dose levels 15/75, 20/75,
and 25/75. In part 1 of the trial, nine patients were evaluated for safety, and no DLT
was observed at 15/75 mg/m2. In part 2 of the study, 15 patients were treated at 15/
75 mg/m2, and 2 DLTs were observed. Based on these results and other preliminary
efficacy data, it was hypothesized that there exists a series of active dose combinations
which are tolerable and active in prostate cancer. Cabazitaxel dose levels will be
selected in the interval [10, 25], and cisplatin dose levels were selected in the interval
[50, 100] administered intravenously. The plan is to enroll N ¼ 30 patients and
estimate the MTD curve. The target probability of DLT is θ ¼ 0.33, and a logistic
link function for F () in (3) was used. DLT is resolved within one cycle (3 weeks) of
treatment. Although the algorithm dictates that the first two patients receive dose
combination 10/50 mg/m2, the clinician Dr. Posadas preferred to start with 15 mg/m2
cabazitaxel and 75 mg/m2 cisplatin since this combination was tolerable based on the
results of the published phase I trial and a number of patients he treated at this
1010 M. Tighiouart
combination. The prior distributions were calibrated so that the prior mean probability
of DLT at the dose combination 15/75 mg/m2 equals the target probability of DLT.
Specifically, informative priors were used for the model parameters ρ01, ρ10 beta
(1.4, 5.6), and conditional on ρ01, ρ10, ρ00/min(ρ01, ρ10) beta(0.8, 7.2) and a vague
prior for η3 with mean 20, and variance 540 was used so that E(P (DLT |(15;
75))) 0.33 a priori. Operating characteristics were derived by simulating
m ¼ 2000 trial replicates under various scenarios for the true MTD curve. Figure 1
shows the true and estimated MTD curve obtained using (6) with the parameters ρ00,
ρ01, ρ10, and η3 replaced by their posterior median averaged across all 2000 simulated
trials. Scenario A shown on the left panel of Fig. 1 is a case where the true MTD curve
passes through a point very close to the dose combination (15, 75) identified as the
MTD from the previous trial. Scenario B shown on the right panel is a case where the
MTD curve is way above this dose combination. In each case, the estimated MTD
curves are very close to the true MTD curves. This is also evidenced by the pointwise
bias and percent selection (graphs included in the supplement). The trial was also safe
since the percent of trials with DLT rate above θ + 0.1 were 3.5% for the scenario on
the left and 5.0% for the scenario on the right.
Figures 2 and 3 show the pointwise average bias and percent selection for
tolerances p ¼ 0.05, 0.1 under scenarios A and B. In both cases, the absolute average
bias is less than 0.05, which corresponds to 5% of the standardized dose range of
either agent. We conclude that the pointwise average bias is practically negligible.
The pointwise percent selection when p ¼ 0.05 varies between 40% and 90% under
scenario A for most doses and between 60% and 75% under scenario B. These are
reasonable percent selections comparable to dose combination phase I trials. Other
scenarios were included in the clinical protocol.
Fig. 1 True and estimated MTD curve under two different scenarios for the MTD curve. The gray
diamonds represent the last dose combination from each simulated trial along with a 90% confi-
dence region
55 Dose Finding for Drug Combinations 1011
Fig. 2 Pointwise average bias (left) and percent selection (right) under scenario A
Fig. 3 Pointwise average bias (left) and percent selection (right) under scenario B
Attributable Toxicity
In section “Dose Finding to Estimate the Maximum Tolerated Dose Curve” and Eq.
(3), a DLT event is assumed to be caused by drug A1 or drug A2, or both. In some
applications, some DLTs can be attributed to one agent but not the other. For
example, in a drug combination trial of Taxotere, a known cytotoxic agent, and
metformin, a diabetes drug, in advanced or metastatic breast cancer patients, the
clinician expects that some DLTs can be attributable to either agent or both. For
example, a grade 3 or 4 neutropenia can only be attributable to Taxotere and not
metformin. In this section, we present a dose combination trial design that accounts
for an unknown fraction of attributable DLTs.
1012 M. Tighiouart
Dose-Toxicity Model
Let Fα() and Fβ () be parametric models for the probability of DLT of drugs A1 and
A2, respectively. We specify the joint dose-toxicity relationship using the Gumbel
copula model (see Murtaugh and Fisher 1990) as
where x and y are the standardized dose levels of drugs A1 and A2, respectively, δ1
and δ2 are the binary indicators of DLT attributed to drugs A1 and A2, respectively,
and γ is the interaction coefficient. Similar to section “Dose Finding to Estimate the
Maximum Tolerated Dose Curve,” we assume that the probability of DLT
π ¼ 1 π (0,0) increases with the dose of any one of the agents when the other one
is held constant. A sufficient condition for this property to hold is to assume that
Fα() and Fβ () are increasing functions with α > 0 and β > 0. We take Fα(x) ¼ xα
and Fβ ( y) ¼ yβ. Using (12), if the DLT is attributed exclusively to drug D1, then
eγ 1
π ðδ1 ¼1,δ2 ¼0Þ ¼ xα 1 yβ xα ð1 xα Þyβ 1 yβ γ : ð13Þ
e þ1
If the DLT is attributed exclusively to drug D2, then
eγ 1
π ðδ1 ¼0,δ2 ¼1Þ ¼ yβ ð1 xα Þ xα ð1 xα Þyβ 1 yβ γ : ð14Þ
e þ1
If the DLT is attributed to both drugs D1 and D2, then
eγ 1
π ðδ1 ¼1,δ2 ¼1Þ ¼ xα yβ þ xα ð1 xα Þyβ 1 yβ γ : ð15Þ
e þ1
Equation (13) represents the probability that A1 causes a DLT and drug A2 does
not cause a DLT. This can happen, for example, when a type of DLT of Taxotere,
such as grade 4 neutropenia, is observed. However, this type of DLT can never be
observed with metformin. This can also happen when the clinician attributes a grade
4 diarrhea to Taxotere but not to metformin in the case of a low dose level of this later
even though both drugs have this common type of side effect. The fact that dose level
y is present in Eq. (13) is a result of the joint modeling of the two marginals and
accounts for the probability that drug A2 does not cause a DLT. This later case is, of
course, based on the clinician’s judgment. Equations (14) and (15) can be interpreted
similarly. The probability of DLT is
π ¼ ProbðDLTjx, yÞ ¼ π ðδ1 ¼1,δ2 ¼0Þ þ π ðδ1 ¼0,δ2 ¼1Þ þ π ðδ1 ¼1,δ2 ¼1Þ ¼
eγ 1 ð16Þ
xα þ yβ xα yβ xα ð1 xα Þyβ 1 yβ γ :
e þ1
55 Dose Finding for Drug Combinations 1013
The MTD is any dose combination (x, y) such that Prob(DLT|x, y) ¼ θ. It
follows that the MTD set C(α, β, γ) is
8 2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi3β1 9
> >
< ð1 xα κÞ ð1 xα κÞ2 4κðxα θÞ =
Cðα, β, γ Þ ¼ ðx , y Þ : y ¼ 4 5 ,
>
: 2κ >
;
ð17Þ
where
eγ 1
κ ¼ xα ð1 xα Þ :
eγ þ 1
Let T be the indicator of DLT, T ¼ 1 if a patient treated at dose combination (x, y)
experiences DLT within one cycle of therapy that is due to either drug or both, and
T ¼ 0 otherwise. Among patients treated with dose combination (x, y) who exhibit
DLT suppose that an unknown fraction η of these patients has a DLT with known
attribution, i.e., the clinician knows if the DLT is caused by drug A1 only, or drug A2
only, or both drugs A1 and A2. Let A be the indicator of DLT attribution when T ¼ 1.
It follows that for each patient treated with dose combination (x, y), there are five
possible toxicity outcomes: {T ¼ 0}, {T ¼ 1, A ¼ 0}, {T ¼ 1, A ¼ 1, δ1 ¼ 1, δ2 ¼ 0},
{T ¼ 1, A ¼ 1, δ1 ¼ 0, δ2 ¼ 1}, and {T ¼ 1, A ¼ 1, δ1 ¼ 1, δ2 ¼ 1}. Using Eqs. (13),
(14), (15), and (16) and Fig. 4, the likelihood function is
" #T i
n
Y Ai
ðδ1i ,δ2i Þ 1Ai
Lðα, β, γ, ηjdataÞ ¼ ηπ i ðπ i ð1 ηÞÞ ð1 π i Þ1T i , ð18Þ
i¼1
Features of the posterior distribution are estimated using JAGS (Plummer 2003).
(i) The first two patients receive the same dose combination (Xmin, Ymin).
(ii) In the i-th cohort of two patients,
• If i is even, patient (2i 1) receives dose combination (x2i 1, y2i 1), where
d ðDLTju, y Þ θ , and y2i1 ¼ y2i3. For ethical
x2i1 ¼ argmin Prob 2i3
u
reason, if a DLT was observed in the previous cohort of two patients and
was attributable to drug A1, then x2i1 is further restricted to be no more than
x2i3. Patient 2i receives dose combination (x2i, y2i), where y2i ¼
Here, we used univariate CRM instead of EWOC to estimate the next dose for
computational efficiency. A comparison of the two methods in drug combination
setting can be found in Diniz et al. (2017).
Simulation Studies
Dose levels of drugs A1 and A2 are standardized to be in the interval [0.05, 0.30], and
we consider three scenarios for the true MTD curve shown by the black dashed
curves in Fig. 5. We evaluate the effect of toxicity attribution in these three scenarios
using four different values for η: 0, 0.1, 0.25, and 0.4. These values are reasonable
because higher values of η in practice are very rare. Data are randomly generated as
follows. For a given dose combination (x, y), a binary indicator of DLT T is generated
from a Bernoulli distribution with probability of success computed using Eq. (16). If
{T ¼ 1}, we generate the attribution outcome A using a Bernoulli distribution with
probability of success η. If {T ¼ 1, A ¼ 1}, we attribute the DLT to drug A1, A2, or to
both with equal probabilities. We assume that the model parameters α, β, γ, and η are
independent a priori. We assign vague prior distributions to α, β, and γ as in Yin and
Yuan (2009a), where α Uniform(0.2, 2), β Uniform(0.2, 2), and γ Gamma
(0.1, 0.1). The prior distribution for the fraction of attributable toxicities η is set to be
Uniform(0, 1). Using these prior distributions, the true parameter values for each
scenario are as follows: in scenario 1, α ¼ β ¼ 0.9 and γ ¼ 1; in scenario 2,
α ¼ β ¼ 1.1 and γ ¼ 1; and last, in scenario 3, α ¼ β ¼ 1.3 and γ ¼ 1. For each
scenario, m ¼ 1000 trials will be simulated. The target risk of toxicity is fixed at
θ ¼ 0.3, the sample size is n ¼ 40, and the values for ξ1 and ξ2 will be 0.05 and 0.8,
respectively. Figure 5 shows the estimated MTD curves for each scenario as a
function of η. In general, increasing the value of η until 0.4 corresponds to estimated
MTD curves closer to the true MTD curve.
Table 1 shows the average percent of toxicities as well as the percent of trials with
toxicity rates greater than θ + 0.05 and θ + 0.1 for scenarios 1–3. In general, we
Fig. 5 Estimated MTD curves for m ¼ 1000 simulated trials. The black dashed curve represents
the true MTD curve, the gray dashed lines represent the contours at θ 0.05 and θ 0.10, and the
solid curves represent the estimated MTD curves for each value of η
1016 M. Tighiouart
Fig. 6 Pointwise percent of MTD recommendation for m ¼ 1000 simulated trials. Solid lines
represent the pointwise percent of MTD recommendation when p ¼ 0.2, and dashed lines represent
the pointwise percent of MTD recommendation when p ¼ 0.1
observe that increasing the fraction of toxicity attributions η reduces the average
percent of toxicities and percent of trials with toxicity rates greater than θ + 0.05 and
θ + 0.10. These results show that the design is safe in the sense that the probability
that a prospective trial will result in an excessive rate of toxicity (greater than
θ + 0.10) is less than 5%. Figure 6 shows the pointwise percent of MTD recommen-
dation of the three proposed scenarios for each value of η. In general, increasing the
value of η increases the pointwise percent of MTD recommendation, reaching up to
80% of correct recommendation when p ¼ 0.2 and up to 70% of correct recommen-
dation when p ¼ 0.1. Based on these simulation results, we conclude that in
continuous dose setting, the approach of partial toxicity attribution generates safe
trial designs and efficient estimation of the MTD. Further details about the approach
and computer codes can be found in Jimenez et al. (2019).
55 Dose Finding for Drug Combinations 1017
In this section, we describe a phase I/II design with the objective of determining a
tolerable dose level that maximizes treatment efficacy. For treatments where efficacy
is ascertained in a relatively short period of time such as one or two cycles of therapy,
sequential designs for updating the joint probability of toxicity and efficacy and
estimating the optimal dose have been studied extensively in the literature (for single
agent trials, see, e.g., Murtaugh and Fisher (1990), Thall and Russell (1998), Braun
(2002), Ivanova (2003), Thall and Cook (2004), Chen et al. (2015), and Sato et al.
(2016), and for dose combination trials Yuan and Yin (2011), Wages and Conaway
(2014), Cai et al. (2014), Riviere et al. (2015), and Clertant and Tighiouart (2017)).
For treatments where response evaluation takes few cycles of therapy, it is standard
practice to perform a two-stage design where a maximum tolerable dose (MTD) of a
new drug or combinations of drugs is first determined, and then this recommended
phase II dose is studied in stage II and evaluated for treatment efficacy, possibly
using a different population of cancer patients from stage I (see Rogatko et al. 2008;
Le Tourneau et al. 2009; Chen et al. 2012b for a review of such a paradigm). For
drug combination phase I trials, more than one MTD can be recommended at the
conclusion of the trial and choosing a single MTD combination for efficacy study
may result in a failed phase II trial since other MTDs may present higher treatment
efficacy. Hence, adaptive or parallel phase II trials may be more suitable for
searching an optimal dose combination that is well tolerable with desired level of
efficacy.
Stage I
Stage II
For every dose combination (x, y) Cest, let x be the unique vertical projection of (x,
y) on the interval [X1, X2]. Next, denote by z [0, 1] the standardized dose of x
[X1, X2] using the transformation z ¼ h3(x) ¼ (x X1)/(X2 X1). In the sequel, we
will refer to z as dose combination since there is a one-to-one transformation
mapping z [0, 1] to (x, y) Cest, x [X1, X2], y [Y1, Y2]. We model the
probability of treatment response given dose combination z in Cest as
1018 M. Tighiouart
k
X 3
f ðz; ψ Þ ¼ β0 þ β1 z þ β2 z2 þ β j z κ j þ, ð21Þ
j¼3
where ψ ¼ (β, κ), β ¼ (β0, . . ., βk), κ ¼ (κ3, . . ., κ;k) with κ3 ¼ 0. Let Dm ¼ {(zi, Ei),
i ¼ 1, . . ., m} be the data after enrolling m patients in the trial, where Ei is the
response of the i-th patient treated with dose combination zi, and let π(ψ) be a prior
density on the parameter ψ. The posterior distribution is
m
Y
π ðψjDm Þ / ½Fð f ðzi ; ψ ÞÞ Ei ½1 Fð f ðzi ; ψ ÞÞ 1Ei π ðψ Þ: ð22Þ
i¼1
Trial Design
(i) Randomly assign n1 patients to dose combinations z1 , . . . , zn1 equally spaced
along the MTD curve Cest so that each combination is assigned to one and only
one patient.
(ii) Obtain a Bayes estimate ψ b of ψ given the data Dn1 using (22).
(iii) Generate n2 dose combinations from the standardized density F ( f (z; ψ b )), and
assign them to the next cohort of n2 patients.
(iv) Repeat steps (ii) and (iii) until a total of n patients have been enrolled to the trial
subject to pre-specified stopping rules.
i.e., Maxz [P (F ( f (z; ψ)) > p0|Dj)] < δ0 where δ0 is a small pre-specified threshold.
In cases where the investigator is interested in stopping the trial early for superiority,
the trial can be terminated after j patients are evaluable for efficacy if Maxz [P (F ( f
(z; ψ)) > p0|Dj)] > δ1 where δ1 δu is a pre-specified threshold and the
corresponding dose combination z ¼ argmaxu{P (F ( f (u; ψ)) > p0|Dj)} is selected
for future randomized phase II or III studies.
Fig. 7 True and estimated efficacy curve under six scenarios favoring the null and alternative hypotheses
55 Dose Finding for Drug Combinations 1021
that the probability of response equals to p0 at one dose combination only for scenarios
(d), (e), and (f).
Operating Characteristics
For each scenario favoring the alternative hypothesis, we estimate the Bayesian
power as
M
1 X
Power I ½Maxz fPðFð f ðz; ψ i ÞÞ > p0 jDn,i Þg > δu , ð24Þ
M i¼1
1X
PðFð f ðz; ψ i ÞÞ > p0 jDn,i Þ I F f z; ψ i,j L > p0 , ð25Þ
L j¼1
where ψ i,j, j ¼ 1, . . ., L is an MCMC sample from the i-th trial. For scenarios
favoring the null hypothesis, (24) is the estimated Bayesian type I error probability.
The optimal or target dose from the i-th trial is
We also report the estimated efficacy curve by replacing ψ in (20) by the average
posterior medians across all simulated trials
M
1 X
PðFð f ðz; ψ i ÞÞ > p0 jDn,i Þ: ð28Þ
M i¼1
The estimated efficacy curves shown in black dashed lines in Fig. 7 computed
using Eq. 27 are fairly close to the true probability of efficacy curve in all scenarios
except for scenario (c) near the lower edge of the MTD curve. The mean posterior
probability of efficacy curve shown in red dashed line computed using Eq. 28 is 80%
or more at dose combinations where the true probability of efficacy is maximized for
scenarios (a, b) and close to 80% for scenario (c). Similar conclusions can be drawn
for scenarios favoring the null hypothesis where the maximum of the mean posterior
probability of efficacy is less than 50%. Figure 8 is the estimated density of the target
dose z defined in Eq. 26 under scenarios favoring the alternative hypothesis (a–c),
and the shaded region corresponds to dose combinations with corresponding true
1022 M. Tighiouart
Fig. 8 Estimated density of the target dose combination under three scenarios favoring the
alternative hypothesis
probability of efficacy greater than p0 ¼ 0.15. The mode of these densities is close to
the target doses. Moreover, the estimated probabilities of selecting a dose with true
probability of efficacy greater than p0 ¼ 0.15 vary between 0.90 and 0.96 across the
three scenarios. The Bayesian power for scenarios (a–c) and type I error probability
for scenarios (d–f) estimated using Eq. 24 using a threshold δu ¼ 0.8 are reported in
Table 2. Power varies between 0.81 and 0.92, and the type I error probability varies
between 0.10 and 0.19. The coverage probability in the last column of Table 2 is the
estimated probabilities of selecting a dose with true probability of efficacy greater
55 Dose Finding for Drug Combinations 1023
than p0 ¼ 0.15. We conclude that the design has good operating characteristic in
identifying tolerable dose combinations with maximum benefit rate. We refer the
reader to Tighiouart (2019) for sensitivity analysis regarding n1, n2, δu, and other
values of p0 and effect size.
s
(a) Let ΓA1 ¼ [ ðx, yt Þ : x ¼ argmin d x j , yt Þ, Cest Þ ,
t¼1 xj
(
r
o
ΓA2 ¼ [ ðxt , yÞ : y ¼ argmin d xt , y j Þ, Cest Þ , and Γ0 ¼ ΓA1 \ ΓA2 :
t¼1 yj
(b) Let Γ ¼ Γ0\{(x, y) : P(| P(DLT| (x, y)) θ| >δ1| Dn) > δ2}.
The set Γ0 in (a) consists of dose combinations closest to the MTD curve obtained
by first minimizing the Euclidean distances across the levels of drug A1 and then
across the levels of drug A2. Doses in Γ0 that are either likely to be too toxic or
subtherapeutic are excluded in (b). The design parameter δ1 is selected after consul-
tation with a clinician. The parameter δ2 is selected after exploring a large number of
practical scenarios when designing a trial. In our experience with the sample sizes
and scenarios used in Wang and Ivanova (2005), we found that δ2 ¼ 0.3, 0.35 result
in good design operating characteristics.
Illustration
We consider five scenarios studied in Wang and Ivanova (2005) and shown in
Table 3. The sample size for the first four scenarios is n ¼ 54 and n ¼ 60 for the
1024 M. Tighiouart
last scenario. The target probability of DLT is θ ¼ 0.2, and the prior distributions for
ρ00, ρ01, ρ10, and η3 are described in section “Dose Finding to Estimate the Maxi-
mum Tolerated Dose Curve” with hyperparameters ai ¼ bi ¼ 1, i ¼ 1, . . ., 3. A tight
gamma(1,1) prior was put on the interaction parameter η3 since the model in Wang
and Ivanova (2005) has two parameters with no interaction coefficient. We assess the
performance of the method by simulating m ¼ 2000 trials and calculating the
accuracy index introduced in Cheung (2011).
PK
k¼1 Δk pn,k
AI n ¼ 1 K PK , ð29Þ
k¼1 Δk
where n is the trial sample size, K is the number of discrete doses available in the
trial, pn,k is the probability of selecting dose k in a trial with n patients, and Δk is a
distance measure between the true probability of DLT pk at dose k and the target
probability of DLT θ. It can be shown that AIn < 1 and higher values of AIn are
desirable. We also report a measure of percent selection defined as follows. For a
given scenario, let Γδ ¼ {(xi, yj): |P (DLT |(xi, yj)) θ| < δ} be the set of true MTDs
where the threshold parameter δ is fixed by the clinician. Let Γi the set of estimated
MTDs at the end of the i-th trial as described in section “Discrete Dose Combina-
tions,” i ¼ 1, . . ., m. The percent of MTDs selection is
55 Dose Finding for Drug Combinations 1025
m
1 X
%Selection ¼ I ðΓi Γδ Þ: ð30Þ
m i¼1
Model-based designs for drug combinations in early phase cancer clinical trials have
been studied extensively in the last decade. For phase I trials, these methods are
designed to estimate one or more MTDs for use in future phase II trials. It is
important to note that designs that recommend more than one MTD for efficacy
studies should be used as this may decrease the likelihood of a failed phase II trial. In
this chapter, we focused on dose finding using two drugs with continuous dose
levels. For a phase I trial design, consecutive cohorts of two patients were treated
simultaneously with different dose combinations to better explore the space of doses.
The method was studied extensively in Tighiouart et al. (2014b, 2016, 2017b) and
Diniz et al. (2017) via extensive simulations and was shown to be safe in general
with high percent of MTD recommendation. We also showed how this was applied
to design the first part of the CisCab trial using a relatively small sample size and
calibrate prior distributions of the model parameters. In practice, active involvement
of the clinician is required at the design stage of the trial to facilitate prior calibration
and to specify scenarios with various locations of the true MTD set of doses.
1026
It is well known that optimal treatment protocols use drug combinations that have
nonoverlapping toxicities. However, cancer drugs with nonoverlapping toxicities of
any grade are rare. In this chapter, we described situations where the clinician is able
to attribute the DLT to one or more drugs in an unknown fraction of patients by
extending the previous statistical models. This is practically useful when the two
drugs do not have many overlapping toxicities (see, e.g., Miles et al. (2002)) for
some examples of drug combination trials with these characteristics. We showed by
simulations that as the fraction of attributable toxicities increases, the rate of DLT
decreases, and there is a gain in the precision of the estimated MTD curve. In cases
where we expect a high percent of overlapping DLTs, designs that do not distinguish
between drug attribution listed in the introduction and described in section “Dose
Finding to Estimate the Maximum Tolerated Dose Curve” may be more appropriate.
It is also important to note that the method relies on clinical judgment regarding DLT
attribution.
In the second part of the chapter, we showed how the estimated MTD curve from
a phase I trial is carried to a phase II trial for efficacy study using Bayesian adaptive
randomization. This design can be viewed as an extension of the Bayesian adaptive
design comparing a finite number of arms (Berry et al. (2011)) to comparing an
infinite number of arms. In particular, if the dose levels of the two agents are discrete,
then methods such as the ones described in Thall et al. (2003), Wang and Ivanova
(2005), and Wages (2016) can be used to identify a set of MTDs in stage I, and the
trial in stage II can be done using adaptive randomization to select the most
efficacious dose. Unlike phase I/II designs that use toxicity and efficacy data
simultaneously and require a short period of time to resolve efficacy status, the use
of a two-stage design is sometimes necessary in practice if it takes few cycles of
therapy to resolve treatment efficacy or if the populations of patients in phases I and
II are different. In fact, for the CisCab trial described in section “Phase I/II Dose
Finding,” efficacy is resolved after three cycles (9 weeks) of treatment, and patients
in stage I must have metastatic, castration-resistant prostate cancer, whereas patients
in stage II must have visceral metastasis. The uncertainty of the estimated MTD
curve in stage I is not taken into account in stage II of the design in the sense that the
MTD curve is not updated as a result of observing DLTs in stage II. This is a
limitation of this approach since patients in stage II may come from a different
population and may have different treatment susceptibility relative to patients in
stage I. This problem is also inherent to single agent two-stage designs where the
MTD from the phase I trial is used in phase II studies and safety is monitored
continuously during this phase. Due to the small sample size, methods to estimate
the MTD curves for each subpopulation in the phase I trial (Diniz et al. 2018) may
not be appropriate. An alternative design would account for first, second, and third
cycle DLT in addition to efficacy outcome at each cycle. In addition, the nature of
DLT (reversible vs. nonreversible) should be taken into account since patients with a
reversible DLT are usually treated for that side effect and kept in the trial with dose
reduction in subsequent cycles. For the CisCab trial, a separate stopping rule using
Bayesian continuous monitoring for excessive toxicity is included in the clinical
protocol.
1028 M. Tighiouart
Cross-References
Acknowledgments This work is supported in part by the National Institute of Health Grant
Number R01 CA188480-01A1 and the National Center for Research Resources, Grant
UL1RR033176, and is now at the National Center for Advancing Translational Sciences, Grant
UL1TR000124, P01 CA098912, and U01 CA232859-01.
References
Babb J, Rogatko A, Zacks S (1998) Cancer phase I clinical trials: efficient dose escalation with
overdose control. Stat Med 17:1103–1120
Berry SM, Carlin BP, Lee JJ, Muller P (2011) Bayesian adaptive methods for clinical trials.
Chapman & Hall, Boca Raton
Braun TM (2002) The bivariate continual reassessment method: extending the CRM to phase I trials
of two competing outcomes. Control Clin Trials 23:240–256
Braun TM, Wang SF (2010) A hierarchical Bayesian design for phase I trials of novel combinations
of cancer therapeutic agents. Biometrics 66:805–812
Cai C, Yuan Y, Ji Y (2014) A Bayesian dose finding design for oncology clinical trials of
combinational biological agents. Appl Stat 63:159–173
Chen Z, Tighiouart M, Kowalski J (2012a) Dose escalation with overdose control using a quasi-
continuous toxicity score in cancer phase I clinical trials. Contemp Clin Trials 33:949–958
Chen Z, Zhao Y, Cui Y, Kowalski J (2012b) Methodology and application of adaptive and
sequential approaches in contemporary clinical trials. J Probability Stat 2012:20
Chen Z, Yuan Y, Li Z, Kutner M, Owonikoko T, Curran WJ, Khuri F, Kowalski J (2015) Dose
escalation with over-dose and under-dose controls in phase I/II clinical trials. Contemp Clin
Trials 43:133–141
Cheung YK (2011) Dose-finding by the continual reassessment method, 1st edn. Chapman & Hall,
Boca Raton
Clertant M, Tighiouart M (2017) Design of phase I/II drug combination cancer trials using
conditional continual reassessment method and adaptive randomization. In: JSM Proceedings,
Biopharmaceutical Section. Alexandria, VA: American Statistical Association 1332–1349
Diniz MA, Quanlin-Li, Tighiouart M (2017) Dose Finding for Drug Combination in Early Cancer
Phase I Trials Using Conditional Continual Reassessment Method. J Biom Biostat 8: 381.
https://doi.org/10.4172/2155-6180.1000381
Diniz MA, Kim S, Tighiouart M (2018) A Bayesian adaptive design in cancer phase I trials using
dose combinations in the presence of a baseline covariate. J Probab Stat 2018:11
Diniz MA, Tighiouart M, Rogatko A (2019) Comparison between continuous and discrete doses for
model based designs in cancer dose finding. PLoS One 14:e0210139
Faries D (1994) Practical modifications of the continual reassessment method for phase I cancer
clinical trials. J Biopharm Stat 4:147–164
Goodman S, Zahurak M, Piantadosi S (1995) Some practical improvements in the continual
reassessment method for phase I studies. Stat Med 14:1149–1161
Ivanova A (2003) A new dose-finding design for bivariate outcomes. Biometrics 59:1001–1007
55 Dose Finding for Drug Combinations 1029
Jimenez JL, Tighiouart M, Gasparini M (2019) Cancer phase I trial design using drug combinations
when a fraction of dose limiting toxicities is attributable to one or more agents. Biom J 61
(2):319–332
Le Tourneau C, Lee JJ, Siu LL (2009) Dose escalation methods in phase I cancer clinical trials.
J Natl Cancer Inst 101:708–720
Lockhart AC, Sundaram S, Sarantopoulos J, Mita MM, Wang-Gillam A, Moseley JL, Barber SL,
Lane AR, Wack C, Kassalow L, Dedieu JF, Mita A (2014) Phase I dose-escalation study of
cabazitaxel administered in combination with cisplatin in patients with advanced solid tumors.
Investig New Drugs 32:1236–1245
Lunn DJ, Thomas A, Best N, Spiegelhalter D (2000) WinBUGS – a Bayesian modelling frame-
work: concepts, structure, and extensibility. Stat Comput 10:325–337
Mander A, Sweeting M (2015) A product of independent beta probabilities dose escalation design
for dual-agent phase I trials. Stat Med 34:1261–1276
Miles D, Von Minckwitz GJ, Seidman AD (2002) Combination versus sequential single-agent
therapy in metastatic breast cancer. Oncologist 7:13–19
Murtaugh PA, Fisher LD (1990) Bivariate binary models of efficacy and toxicity in dose-ranging
trials. Commun Stat Theory Methods 19:2003–2020
O’Quigley J, Shen LZ (1996) Continual reassessment method: a likelihood approach. Biometrics
52:673–684
O’Quigley J, Pepe M, Fisher L (1990) Continual reassessment method: a practical design for phase I
clinical trials in cancer. Biometrics 46:33–48
Piantadosi S, Fisher JD, Grossman S (1998) Practical implementation of a modified continual
reassessment method for dose-finding trials. Cancer Chemother Pharmacol 41:429–436
Plummer M (2003) JAGS: a program for analysis of Bayesian graphical models using Gibbs
sampling. 3rd International Workshop on Distributed Statistical Computing (DSC 2003);
Vienna, Austria. 124
Riviere M, Yuan Y, Dubois F, Zohar S (2014) A bayesian dose-finding design for drug combination
clinical trials based on the logistic model. Pharm Stat 13:247–257
Riviere MK, Yuan Y, Dubois F, Zohar S (2015) A Bayesian dose-finding design for clinical trials
combining a cytotoxic agent with a molecularly targeted agent. J R Stat Soc Ser C 64:215–229
Rogatko A, Gosh P, Vidakovic B, Tighiouart M (2008) Patient-specific dose adjustment in the
cancer clinical trial setting. Pharm Med 22:345–350
Sato H, Hirakawa A, Hamada C (2016) An adaptive dose-finding method using a change-point
model for molecularly targeted agents in phase I trials. Stat Med 35:4093–4109
Shi Y, Yin G (2013) Escalation with overdose control for phase I drug-combination trials. Stat Med
32:4400–4412
Thall PF, Cook JD (2004) Dose-finding based on efficacy toxicity trade-offs. Biometrics 60:684–693
Thall PF, Russell KE (1998) A strategy for dose-finding and safety monitoring based on efficacy
and adverse outcomes in phase I/II clinical trials. Biometrics 54:251–264
Thall PF, Millikan RE, Mueller P, Lee SJ (2003) Dose-finding with two agents in phase I oncology
trials. Biometrics 59:487–496
Tighiouart M (2019) Two-stage design for phase I/II cancer clinical trials using continuous-dose
combinations of cytotoxic agents. J R Stat Soc Ser C 68(1):235–250
Tighiouart M, Rogatko A (2010) Dose finding with escalation with overdose control (EWOC) in
cancer clinical trials. Stat Sci 25:217–226
Tighiouart M, Rogatko A (2012) Number of patients per cohort and sample size considerations
using dose escalation with overdose control. J Probab Stat 2012:16
Tighiouart M, Rogatko A, Babb JS (2005) Flexible Bayesian methods for cancer phase I clinical
trials. Dose escalation with overdose control. Stat Med 24:2183–2196
Tighiouart M, Cook-Wiens G, Rogatko A (2012a) Escalation with overdose control using ordinal
toxicity grades for cancer phase I clinical trials. J Probab Stat 2012:18. https://doi.org/10.1155/
2012/317634
Tighiouart M, Cook-Wiens G, Rogatko A (2012b) Incorporating a patient dichotomous characteristic
in cancer phase I clinical trials using escalation with overdose control. J Probab Stat 2012:10
1030 M. Tighiouart
Tighiouart M, Liu Y, Rogatko A (2014a) Escalation with overdose control using time to toxicity for
cancer phase I clinical trials. PLoS One 9:e93070
Tighiouart M, Piantadosi S, Rogatko A (2014b) Dose finding with drug combinations in cancer
phase I clinical trials using conditional escalation with overdose control. Stat Med 33:3815–
3829
Tighiouart M, Li Q, Piantadosi S, Rogatko A (2016) A Bayesian adaptive design for combination of
three drugs in cancer phase I clinical trials. Am J Biostat 6:1–11
Tighiouart M, Cook-Wiens G, Rogatko A (2017a) A Bayesian adaptive design for cancer phase I
trials using a flexible range of doses. J Biopharm Stat 31:1–13
Tighiouart M, Li Q, Rogatko A (2017b) A Bayesian adaptive design for estimating the maximum
tolerated dose curve using drug combinations in cancer phase I clinical trials. Stat Med 36:280–
290
Wages NA (2016) Identifying a maximum tolerated contour in two-dimensional dose finding. Stat
Med 36:242–253
Wages NA, Conaway MR (2014) Phase I/II adaptive design for drug combination oncology trials.
Stat Med 33:1990–2003
Wages NA, Conaway MR, O’Quigley J (2011) Continual reassessment method for partial ordering.
Biometrics 67:1555–1563
Wang K, Ivanova A (2005) Two-dimensional dose finding in discrete dose space. Biometrics
61:217–222
Wheeler GM, Sweeting MJ, Mander AP (2017) Toxicity-dependent feasibility bounds for the
escalation with overdose control approach in phase I cancer trials. Stat Med 36:2499–2513
Yin GS, Yuan Y (2009a) A latent contingency table approach to dose finding for combinations of
two agents. Biometrics 65:866–875
Yin GS, Yuan Y (2009b) Bayesian dose finding by jointly modelling toxicity and efficacy as time-
to-event outcomes. J R Stat Soc Ser C Appl Stat 58:719–736
Yuan Y, Yin G (2011) Bayesian phase I/II adaptively randomized oncology trials with combined
drugs. Ann Appl Stat 5:924–942
Middle Development Trials
56
Emine O. Bayman
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1032
Single-Arm Versus Two-Arm Phase II Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1032
Frequentist Two-Stage Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033
Pitfalls with Conventional Frequentist Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035
Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035
How to Construct Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036
Noninformative Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036
Beta-Binomial Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037
Bayesian Phase II Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1038
Predictive Probability Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1039
Oncology Example with the PP Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1042
Frequentist Two-Stage Design Versus Bayesian PP Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1043
Bayesian Phase I–II Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044
Abstract
Phase I trials are the first application of the new treatment on humans. The main
goal of the phase I trial is to establish the safety of the new treatment and
determine the maximum tolerable dose for use in subsequent phase II clinical
trial. When moved from phase I to phase II trial, the focus shifts from toxicity
(safety) to efficacy. In phase II trials, the aim is to decide whether the new
treatment is sufficiently promising relative to the standard therapy so that the
new treatment can be included in a large-scale phase III clinical trial.
E. O. Bayman (*)
University of Iowa, Iowa City, IA, USA
e-mail: [email protected]
In this chapter, first frequentist one-arm two-stage phase II clinical trials will
be introduced. Then, a brief background for Bayesian trials will be provided.
Finally, one-arm Bayesian design using predictive probability approach will be
explained. Calculations or software to implement the examples will also be
provided when available.
Keywords
Phase II · Clinical trial · Bayesian · Predictive probability · Two-stage design
Introduction
After one or more successful phase I trials have been completed, phase II trials may
be initiated. Phase II clinical trials are aimed to decide whether a new treatment is
sufficiently promising, relative to the standard therapy, to include in large-scale
randomized clinical trials. Phase II trials provide a bridge between small phase I
trials, where maximum tolerated dose is determined, and large-scale randomized
phase III trials. Compared to phase I trials, phase II trials run on larger groups of
patients, generally 40 to 200 patients. Compared to phase III trials, phase II trials
tend to use surrogate markers as earlier endpoints (tumor shrinkage within first few
weeks instead of survival at 5 years) to shorten the study duration. Generally, sample
size is not large enough to have sufficient power. The design provides decision
boundaries, a probability distribution for the sample size at termination, and operat-
ing characteristics under fixed-response probabilities with the new treatment.
There are three basic requirements for any clinical trial: (1) the trial should
examine an important research question; (2) the trial should use rigorous methodol-
ogy to answer the question of interest; and (3) the trial must be based on ethical
considerations and assure that risks to subjects are minimized. Because of the small
sample size, meeting these requirements in early-phase clinical trials can be more
challenging compared to phase III trials. Therefore, the importance of study planning
is magnified in these settings.
Phase II studies could be single arm, where only the treatment of interest is tested, or
two-arm with a concurrent control group. One-arm designs are used more frequently
to expedite the phase II clinical trials and will be presented here.
The main goal of phase II studies is to provide assessment of efficacy of the
treatment of interest. Accordingly, the goal is to determine if new treatment is
sufficiently promising to justify inclusion in large-scale randomized trials. Other-
wise, ineffective treatments should be screened out. In addition, the safety profile of
the treatment of interest is further characterized in phase II trial. Generally a binary
primary endpoint of favorable/unfavorable outcome (efficacy/no efficacy) is used in
56 Middle Development Trials 1033
Most commonly used frequentist two-stage designs are Gehan’s design (Gehan
1961), Simon’s optimal design, and minimax design (Simon 1989). The optimal
design minimizes the expected sample size under the null hypothesis, and sample
size. The user needs to specify the fixed target response rate for the new therapy ( p1)
and the existing treatment ( p0) along with the type I (α) and type II error (β) rates to
obtain the sample sizes and stopping boundaries for each stage for two-stage
designs. In this setting, type I error rate can be interpreted as the probability of
finding the treatment of interest as efficacious and recommending for further study
when it is not in fact efficacious. Similarly, type II error is the probability of finding
the treatment of interest as not efficacious and not recommending for further study
when it is in fact efficacious. Therefore, in phase II trials, it is more important to
control type II error rate than the type I error rate so that efficacious treatments are
not missed. Type I and II error rates are larger and around 10% in phase II trials.
Because of the small sample size, when available, exact methods are preferred
(Jung 2013). More complex phase II designs with more than one interim monitoring
also exist (Yuan et al. 2016).
At the end of the first stage, frequentist two-stage designs allow early termination
of the trial for futility, if the interim data indicate that the new treatment is not
effective (Lee and Liu 2008). Both Simon and minimax designs can be implemented
online for pre-specified inputs using the NIH Biometric Research Program website:
https://linus.nci.nih.gov/brb/samplesize/otsd.html.
Example 1 Let the current favorable response rate with the standard therapy be
30% ( p0 ¼ 0.3). New treatment is expected to increase the favorable response rate to
50% ( p1 ¼ 0.5). The outcome will be recorded as favorable versus unfavorable for
each patient. Both type I and type II error rates will be kept below 10%. Null and
alternative hypothesis for this study can be written as follows: H0: p0.3; H1: p0.5.
The stopping boundaries for each of the two stages can be calculated for both
optimum and minimax designs from the website provided above.
For the optimum design, 22 (n1) patients should be enrolled in the first stage. If
there are 7 (r1) or less favorable outcomes, out of 22 patients, the trial should be
stopped early due to futility, and the new treatment should be declared as ineffective.
If there are more than 7 favorable outcomes, 24 more patients should be enrolled in
the second stage so that the overall sample size of the study is 46 (n). If there are 17
(r) or less favorable outcomes, out of 46 patients, the new treatment will be declared
as ineffective. If there are more than 17 favorable outcomes, null hypothesis will be
rejected and the new treatment will be declared as effective (Fig. 1).
1034 E. O. Bayman
n = 46
The probability of early termination under the null hypothesis, PET( p0), is the
probability of observing 0 to 7 favorable outcomes out of the first 22 subjects where
the probability of favorable outcomes is 0.3. Let x1 be the number of favorable
outcomes out of the first n1 patients in the first stage. The probability of early
termination can be calculated in R with the following codes: PET( p0) ¼ Pr
(x1 r1 | H0) ¼ sum(dbinom((0:7), 22, 0.3)) ¼ 0.67.
The expected sample size of the study under the null hypothesis is the combina-
tion of early termination at the end of the first stage or not terminating and enrolling
all 46 patients. Therefore, it can be written as E(N|p) ¼ [n1 PET( p0)] + [n (1 –
PET( p0))] ¼ [22 0.67] + [46 (1–0.67)] ¼ 29.9.
A good two-stage frequentist design would have type I and type II error rates
lower than the initial constraints (0.1 for each in our example), high probability of
early termination, and small expected sample size under the null hypothesis (Lee and
Liu 2008).
In this study, there was only 1 patient with PFR at 9 months out of the 26 patients
enrolled in the first stage. Therefore, no more patients were enrolled in stage 2 and
the study was stopped early for futility. It was concluded that sorafenib is ineffective
on PFR at 9 months (Ray-Coquard et al. 2012).
Multistage designs have better statistical properties than single-stage designs because
they allow users to incorporate the interim data in decision-making (Lee and Liu 2008).
The two-stage design also has limitations. In the extreme case scenario, consider the
Example 1 presented above: p0 ¼ 0.3, p1 ¼ 0.5, n1 ¼ 22, r1 ¼ 7, n ¼ 46, and r ¼ 17.
Let’s assume 8 favorable outcomes were observed at the first stage of the study, and 24
more patients should be enrolled in the second stage. To be able to show the efficacy at
the end of the second stage, nine more patients should have favorable outcome (r ¼ 17).
If the number of favorable outcomes out of the next 16 patients is 0, it is impossible to
observe 9 more favorable outcomes for the next 8 patients to declare efficacy at the end
of the second stage of the study (Fig. 2). However, investigators cannot stop the study at
this point under this design. In other words, eight more patients should be enrolled in this
study even if their results will be very likely to be unfavorable and the overall results of
the study will not change. Therefore, more flexible designs that allow users to incorpo-
rate interim data at multiple stages of the study are needed.
Bayesian Methods
0/16 y ?/8
x=8
Stage I Stage II
n1= 22 n2 = 24
N = 46
statistical analysis plan should be predefined in Bayesian designs. Similarly, the prior
information should be identified in advance and justified.
A study design can be a stand-alone Bayesian design or a hybrid approach with
Bayesian and frequentist approaches for different outcomes. The FDA Guidance for
the Use of Bayesian Statistics in Medical Device Clinical Trials requires the
frequentist properties of Bayesian procedures to be investigated (FDA 2010).
To implement most of the commonly used frequentist designs for phase II trials,
the clinician must specify a single value of the patient’s favorable outcome rate, p0,
to the standard therapy. In many cases there is uncertainty regarding p0. In contrast to
frequentist design, the parameter of interest, p0, is considered as a random variable
with a prior distribution with density π(p0) in Bayesian designs. Both in planning the
phase II trial and interpreting its results, a more realistic approach should explicitly
account for the clinician’s uncertainty regarding p0.
The design and conduct of phase II clinical trials would benefit from statistical
methods that can incorporate external information into the design process. With the
Bayesian design, the prior information and uncertainty can be quantified into a
probability distribution. This prior information can be updated and easily
implemented in a sequential design strategy.
Bayesian inference requires a joint distribution of the unknown parameters p and
the data y. This is usually specified through a prior distribution π(p) over the
parameter space θ and a likelihood, the conditional distribution of y, the data,
given p the parameters. Bayesian inference about p is through the posterior distri-
bution, the conditional distribution of p given y.
As data accumulate, the prior distribution is updated, and the posterior distribu-
tion from the previous step becomes the prior distribution. Therefore, there is a
continuous learning as data accumulates with the Bayesian approach:
If there is a historical data available for the standard therapy, they may be incorpo-
rated formally into the trial design and subsequent statistical inferences. If there is no
such data that exists, clinical experience and a clinician’s current belief regarding the
efficacy of the standard therapy may be represented by a probability distribution on
p0. This prior probability distribution can be elicited from subjective opinions of the
experts in the field (Chaloner and Rhame 2001) or the subjective opinion of the
investigator. In this case the Bayesian approach becomes even more appropriate.
When the prior distribution and the posterior distribution are from the same family,
this is called conjugacy. For example, the beta prior distribution is a conjugate
family for the binomial likelihood (Gelman et al. 2004), and the normal
56 Middle Development Trials 1037
distribution with a known variance is a conjugate to itself (Chen and Peace 2011).
When the prior distribution of p is not conjugate, the posterior distribution should
be calculated numerically. It is often mathematically convenient to use conjugate
family of distributions so that the posterior distribution follows a known paramet-
ric form. Most real applied problems cannot be solved by conjugate prior
distributions.
If there is some information about the distribution, parameters of the prior
distribution can be derived. For example, binomial endpoint is commonly used
in phase II clinical trials in terms of favorable versus unfavorable outcomes. In
such case, because of the conjugacy, using beta prior distribution would make
calculations easier (Gelman et al. 2004). Let’s assume from the historical data that
the median and the upper confidence bound are known for the favorable outcome
rates. Using these two values, a search algorithm can be used to find the parameters
of the prior distribution for the favorable outcome rate. It is advised to use a larger
standard error to add some uncertainty to the prior distribution (Lynch 2007).
When conducting Bayesian analyses, it is recommended to use different prior
distributions as a sensitivity analysis to assess the robustness of the results
(Gelman et al. 2004).
Another approach is to make statistical inferences from a posterior distribution
based on simulation (Chen and Peace 2011). Modern computational methods can be
used to calculate posterior distributions. For example, WinBUGS is a popular
software specifically developed for Bayesian analyses which can also be used to
easily implement Markov chain Monte Carlo methods to generate a random sample
from any posterior distribution. A large class of prior distributions can be specified in
WinBUGS. R packages such as R2WinBUGS (Sturtz 2005) allow users to use
WinBUGS within R which are relatively easy framework for many analyses.
Additionally, there are stand-alone R packages, such as MCMCpack (Martin et al.
2011), that can be used for Bayesian analyses. The MCMCpack can be used for an
extensive list of statistical models such as hierarchical longitudinal models and
multinomial logit model (Chen and Peace 2011).
Bayesian inferences are made based on the posterior distribution. It should also be
noted that there is no p-value in the Bayesian analysis. Instead, the 95% (or 1 – type I
error rate) credible interval of the posterior distribution can be used to evaluate the
strength of evidence of the results.
Beta-Binomial Example
Assume p is the favorable outcome rate of the new treatment, and the interest is to
test the following hypothesis in a one-arm phase II clinical trial: H0: p p0; H1:
p>p1.
Let Y1, Y2, . . ., Yn denote patient responses to the new treatment with each Yi ¼ 1
or 0 as success or failure, respectively. Xn ¼ Y1 + Y2 + . . . + Yn denotes the total
number of favorable outcomes out of the n subjects treated. Xn follows a binomial
distribution with parameters n and p. Because of the conjugacy of the beta prior
distribution for the binomial likelihood, it is common to use a beta prior
1038 E. O. Bayman
distribution for the favorable response rate, p. Let the prior distribution for p follow
a beta distribution with parameters a and b:
Γða þ bÞ a1
π ð pÞ ¼ p ð1 pÞb1
ΓðaÞ ΓðbÞ
Efficacy, safety, and cost of the proposed therapy are assessed at phase II trials
(Stallard 1998). The data can be assessed only at two stages in traditional
frequentist phase II clinical trials. In contrast, Bayesian methods allow users to
examine the interim data by updating the posterior probability of parameters and
make relevant predictions and decisions at multiple stages. At each stage, the
posterior distribution can be used to draw inferences concerning the parameter of
interest. Accordingly, at each stage, there are three possible actions (Lee and Liu
2008):
I: Stop the study because of futility and declare that the new drug is not promising.
II: Stop the study because of efficacy and declare that the new drug is promising.
III: Continue with phase II study until the next inspection or the maximum sample size
is reached.
56 Middle Development Trials 1039
Nmax
ð1
PðyejyÞ ¼ pðye jpÞpðpjyÞdp
0
ð1 !
m Γða þ b þ nÞ
¼ py ð1 pÞmy paþx1 ð1 pÞbþnx1 dp
y Γða þ xÞΓðb þ n xÞ
0
ð1
m! Γða þ b þ nÞ
¼ pyþaþx1 ð1 pÞbþnxþmy1 dp
y!ðm yÞ! Γða þ xÞΓðb þ n xÞ
0
Bi ¼ Probð P > p0 j x, Y ¼ iÞ
Then, this Bi value will be compared to the threshold value, θT. If Bi is greater than
θT, for that realization of Y ¼ i, it is expected that the new treatment will be
efficacious at the end of the trial. The predictive probability is the weighted average
of the positive trial (Bi>θT) and the trial continuing until Nmax patients are enrolled
(Lee and Liu 2008). The predictive probability approach looks at the strength of
evidence for concluding efficacy at the end of the trial, based on the current evidence
in terms of the prior information and the data. The decisions to stop the study early
due to efficacy/futility or continue because the current data are not conclusive will
depend on this PP. If the PP is high, it is expected that the new treatment will be
efficacious at the end of the study, given the current data. On the other hand, low PP
indicates that the new treatment may not have sufficient activity by the end of the
study. To prevent any ambiguity, lower (QL) and upper (QU) stopping thresholds
should be pre-specified:
X
m
PP ¼ PrðY ¼ ijxÞx I
Pr p>p0 jx, Y ¼i >θT
i¼0
Xm
¼ PrðY ¼ i jxÞI ðBi >θT Þ
i¼0
The decision of early stopping or continuing the trial will be based on the
following thresholds.
56 Middle Development Trials 1041
If PP<θL: with the given current information, it is unlikely that the response rate
will be larger than the p0 at the end of the trial. Stop for futility and reject H1.
If PP>θU: the current data suggest that, if the same trend continues, it is highly
likely that the treatment will be efficacious at the end of the trial. Stop for efficacy
and reject H0.
If θL<PP<θU: continue to the next stage until reaching Nmax patients.
Both lower (θL) and the upper (θU) stopping thresholds set between 0 and 1. It is
advised to stop early if the drug is not promising. Therefore, θL is chosen to be closer
to 0. In contrast, if the drug is promising, it is better not to stop the trial early.
Therefore, θU is chosen as close to 1.
p~ Beta (0.65, 0.35) X = 12 p|x ~ Beta (12.65, 4.35) p|x, y ~ Beta (12.65 +y, 18.35 - y)
n = 16 m = 14
Nmax = 30
A table showing each possible number of favorable outcomes for the future 14
patients can be created (Table 1).
Note that, for example, to calculate the Prob(Y ¼ 11 | X ¼ 12) in column 2, the
beta-binomial distribution should be used, beta-binomial(i ¼ 11, m ¼ 14, 12.65,
4.35). dbetabinom.ab function in VGAM package in R can be used to calculate this
probability: dbetabinom.ab(11, 14, shape1 ¼ 12.65, shape2 ¼ 4.35). Similarly, to
calculate the probability of p>0.65 given x ¼ 12 and i ¼ 11 presented in column 3,
beta distribution should be used: Beta(23.65, 12.35). In R, 1pbeta(0.65, 23.65,
7.35) function can be used. The last column is an indicator function showing whether
Bi is greater than the threshold of θT ¼ 0.9.
For values of Y between 0 and 10, Bi is less than 0.9. In other words, if 0 to 10
favorable outcomes are observed, out of the future 14 patients, the null hypothesis
will be failed to be rejected, and it will be concluded that the new treatment is not
1042 E. O. Bayman
effective. On the other hand, if number of favorable outcomes out of the future 14
patients is 11 or more, new treatment will be deemed effective.
Finally, the PP for this table can be calculated as 0.1832 + 0.1706 + 0.1209 +
0.0509 ¼ 0.5256. This PP value is >θL ¼ 0.1. Therefore, the trial cannot be stopped
for futility. It is less than θU ¼ 0.95. Thus, it cannot be stopped for efficacy.
According to this PP value, based on the interim data, the study should continue
because the evidence is insufficient to draw a definitive conclusion in either stopping
early for futility or efficacy yet.
This predictive probability value can be calculated using the R package
“ph2bayes” with the following R code:
For example, for the oncology example presented earlier, the 9-month PFR rate was
assumed 12.7% (p0 ¼ 0.127) for the standard therapy group. It was expected that
sorafenib will increase the 9-month PFR rate to 31.7% ( p1 ¼ 0.317). Same example
can also be handled with the Bayesian PP approach. Let both type I and type II error
rates to be 10%. In addition to the existing frequentist assumptions, let’s assume the
data will start to be monitored after data from the first 14 (Nmin ¼ 14) patients are
observed and Nmax ¼ 43. Let the response rate to follow a beta prior distribution with
56 Middle Development Trials 1043
parameters 0.127 and 0.873 (1–0.127). Note that this distribution has a prior mean of
0.127 (0.127/(0.127 + 0.873)) and worth only one patient (0.127 + 0.873).
The corresponding rejection regions for this study would be 0/14, 1/24, 2/29, 3/
33, 4/37, 5/39, 6/41, 7/42, and 8/43. The trial will stop for futility, the first time the
number of favorable outcomes falls into the rejection region. In the actual study,
Ray-Coquard et al. observed 1 patient with PFR at 9 months, out of the 26 enrolled
patients (Ray-Coquard et al. 2012). Consistent with the frequentist design, Bayesian
design would also recommend stopping at this point. Indeed, with the PP approach,
the study would have been stopped after 14 patients if there was no favorable
outcome or after only 24, instead of 26, patients if there was only 1 patient with
favorable outcome.
Rejection regions can be calculated using the following R code in “ph2bye”
package:
In addition, following Lee and Liu’s (Lee and Liu 2008) approach, M.D. Ander-
son Cancer Center group developed a software (https://biostatistics.mdanderson.org/
SoftwareDownload/) to implement further calculations to determine stopping
regions for futility and search for Nmax, θL, θT, and θU. Similarly, to be able to use
the software, users should specify the minimum (Nmin) and maximum sample sizes
(Nmax), response rates of the standard therapy (p0) and the new treatment ( p1)
groups, type I and type II error rates, and the parameters of the beta prior distribution
for the success in the standard therapy group (Berry et al. 2011). It is recommended
to start monitoring the data after the first ten patients (Nmin ¼ 10) have been treated
and outcomes were observed. The software can be used for different Nmax values and
be searched for θL and θU space to generate designs satisfying both type I and type II
error rate criteria (Berry et al. 2011).
In contrast to looking at the data only at the end of the first and second stages with
the frequentist two-stage design, the Bayesian PP approach allows users to assess
the data continuously after outcomes from at least ten patients were observed. In
other words, for the extreme case presented in Fig. 2, the frequentist design
would not be stopped even if no favorable outcome is observed for the first 16
patients in the second stage of the study. Bayesian design would allow users to
stop the study much earlier, in fact, after the first six unfavorable outcomes in the
second stage of the study for futility. The Bayesian PP approach allows more
frequent monitoring; therefore it is more flexible than the frequentist two-stage
design. The Bayesian PP design enables users to stop at any time if the accumu-
lating evidence does not support the new treatment’s efficacy over the standard
therapy.
1044 E. O. Bayman
It is more common to focus to the outcome of toxicity only in phase I trials and for
the outcome of efficacy in phase II trials. However, it is also possible and may be
more efficient to include a bivariate binary outcome of efficacy and toxicity in a
single phase I–II clinical trial. Readers can learn more about Bayesian phase I–II
trials on Yuan, Nguyen, and Thall (Yuan et al. 2016).
Key Facts
Cross-References
References
Berry SM, Carlin BP, Lee JJ, Muller P (2011) Bayesian adaptive methods for clinical trials.
Chapman & Hall/CRC Biostatistics Series, vol 38. CRC Press, Boca Raton
Biswas S, Liu DD, Lee JJ, Berry DA (2009) Bayesian clinical trials at the University of Texas M. D.
Anderson Cancer Center. Clin Trials (London, England) 6:205–216. https://doi.org/10.1177/
1740774509104992
56 Middle Development Trials 1045
Chaloner K, Rhame FS (2001) Quantifying and documenting prior beliefs in clinical trials. Stat Med
20:581–600. https://doi.org/10.1002/sim.694
Chen DG, Peace KE (2011) Clinical trial data analysis using R. CRC, Boca Raton, pp 1–357
Gehan EA (1961) The determination of the number of patients required in a preliminary and a
follow-up trial of a new chemotherapeutic agent. J Chronic Dis 13:346–353
Geller NL (2004) Advances in clinical trial biostatistics. Biostatistics 13:1–52
Gelman A, Carlin JB, Stern HS, Rubin D (2004) Bayesian data analysis. Chapman & Hall/CRC
texts in statistical science, Boca Rotan, Florida, 3rd edn
FDA (2010) Guidance for the use of bayesian statistics in medical device clinical trials. U.S.
Department of Health and Human Services Food and Drug Administration Center for Devices
and Radiological Health Division of Biostatistics Office of Surveillance and Biometrics . http://
www.fda.gov/downloads/MedicalDevices/DeviceRegulationandGuidance/Guidance
Documents/ucm071121.pdf
Jung S-H (2013) Randomized phase II cancer clinical trials. Chapman & Hall/CRC Biostatistics
Series. CRC Press/Taylor & Francis Group, Boca Raton
Lee JJ, Liu DD (2008) A predictive probability design for phase II cancer clinical trials. Clin Trials
(London, England) 5:93–106. https://doi.org/10.1177/1740774508089279
Lynch SM (2007) Introduction to applied Bayesian statistics and estimation for social scientists.
Statistics for social and behavioral sciences. Springer, New York
Martin AD, Quinn KM, Park JH (2011) MCMCpack: Markov Chain Monte Carlo in R. J Stat Softw 42
Ray-Coquard I et al (2012) Sorafenib for patients with advanced angiosarcoma: a phase II Trial
from the French Sarcoma Group (GSF/GETO). Oncologist 17:260–266. https://doi.org/10.
1634/theoncologist.2011-0237
Simon R (1989) Optimal two-stage designs for phase II clinical trials. Control Clin Trials 10:1–10
Stallard N (1998) Sample size determination for phase II clinical trials based on Bayesian decision
theory. Biometrics 54:279–294
Sturtz SL, Ligges U, Gelman A (2005) R2WinBUGS: a package for running WinBUGS from R. J
Stat Softw 12:1–16. https://doi.org/10.18637/jss.v012.i03
Yuan Y, Nguyen HQ, Thall PF (2016) Bayesian designs for phase I-II clinical trials. Chapman &
Hall/CRC Biostatistics Series. CRC Press, Taylor & Francis Group, Boca Raton
Zohar S, Teramukai S, Zhou Y (2008) Bayesian design and conduct of phase II single-arm clinical
trials with binary outcomes: a tutorial. Contemp Clin Trials 29:608–616. https://doi.org/10.
1016/j.cct.2007.11.005
Randomized Selection Designs
57
Shing M. Lee, Bruce Levin, and Cheng-Shiun Leu
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1048
Considerations for Designing Randomized Selection Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1050
Approaches to the Subset Selection Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1050
Acceptable Subset Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1051
Fixed Versus Random Subset Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1052
Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053
The Simon et al. (1985) Fixed Sample Size Procedure (SWE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053
Prescreening Using Simon’s Two-Stage Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054
The Steinberg and Venzon (2002) Design (SV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054
The Levin-Robbins-Leu Family of Sequential Subset Selection Procedures . . . . . . . . . . . . . . 1055
Other Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1058
Applications of Randomized Selection Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1060
Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065
Abstract
The general goal of a randomized selection design is to select one or more
treatments from several competing candidates to which patients are randomly
assigned, in such a way that selected treatment(s) are likely to be better than those
not selected. For example, if one treatment is clearly superior to all the others, we
may demand that the procedure select that treatment with high probability. The
experimental treatments could be different doses of a drug or intensities of a
behavioral intervention, different treatment schedules, modalities, or strategies, or
Keywords
Selection paradigm · Correct selection · Subset selection · Acceptable set · Phase
2 designs · Selection trials
Introduction
The goal of a randomized selection design is to select a truly best treatment (given a
suitable definition of “best”), or more generally to select a subset of b 1 treatments,
ideally containing the b truly best treatments, using methods that have certain
desirable operating characteristics. For example, assuming that one treatment is
truly better than all the rest to a prespecified degree, we may wish to select that
best treatment with a prespecified high probability of correct selection. Furthermore,
we may wish to select a treatment or a subset of treatments that are reasonably
“acceptable” (suitably defined) with prespecified high probability under any and all
circumstances, irrespective of the true efficacy differences among treatments. When
selecting treatments, the size of the subset b is typically fixed in advance – most often
being b ¼ 1 – but we may also wish to select varying numbers of “best” treatments in
a data-dependent manner. Alternatively, we may wish to select a subset of treatments
of varying size that achieves a prespecified high probability of containing the one
truly best treatment. We may even wish to produce a reliable ranking of the c
candidate treatments. These treatments can be different treatment schedules, doses,
or strategies, and may include the standard of care. The statistical theory of ranking
and selection, with its strong ties to multiple decision theory, provides an overarch-
ing framework for all such selection goals.
It may be helpful here to introduce a taxonomy of randomized selection designs.
We shall say that such a trial is a pure selection trial if its primary purpose is to select
(or identify or rank) treatments with goals as stated above with no further intention of
making statements of statistical significance. Indeed, there is no concern about the
57 Randomized Selection Designs 1049
type 1 error rate because one simply cannot commit a type 1 error when one doesn’t
declare efficacy differences statistically significant at the 0.05 level! To the contrary,
in pure selection designs we are singularly uninterested in the null hypothesis of no
difference between treatments. What we care about is making correct selections (or,
more generally, acceptable selections – see below) with high probability when there
are clinically meaningful and worthwhile differences even if we cannot “prove” that
they are so. But if there are no meaningful or worthwhile efficacy differences
among any of the candidates, then we are generally indifferent to which treatment
(or treatments) is (or are) selected, other things being equal such as side effects,
tolerability, or costs. This is the so-called indifference zone approach of
Robert Bechhofer (1954). We will have more to say about this approach in section
“Considerations for Designing Randomized Selection Trials.”
At this point the reader might be wondering how could one dare to conduct a
clinical experiment without some sort of tight control over type 1 error? It seems an
unfortunate tendency in some quarters reflexively to identify “clinical research” as
synonymous with “testing a null hypotheses at the 0.05 level of significance.” There
are good reasons regulatory agencies or journal editors insist on such a definition.
However, not every important question can or should be addressed this way. For
example, when kids in the schoolyard are choosing up sides for a baseball game, the
team captains are surely not interested in testing the null hypothesis that all the kids
have the same talent, controlling the type 1 error rate. To the contrary, the captains
want to select the kids best able to help their teams based on observations of their
performance. Similarly for choosing a winning horse at the race track or a profitable
portfolio for investment. As a further example, during the Ebola outbreak of 2014 in
West Africa, some argued (cogently, we believe) that the regulatory dictum – “we
must use a control group to test the hypothesis of no efficacy while controlling the
type 1 error rate, end of debate” – was misguided in the sense that the regulatory
mandate was perhaps not the most pressing or important question to settle at that
very moment. Rather, a pure randomized selection design of active treatments could
have been implemented rapidly to select the best candidate treatment option or
options, followed by careful rollout and watchful observation to see if patients
stopped dying. Best available supportive care, which in West Africa was no better
than no treatment at all given the resource poor environment, would not have been
required. It would seem the selection goal was urgent and important enough to set
aside the agnostic need to control the risk of a type 1 error. Even if the optimal
standard of care were available in West Africa and had some efficacy, more rapid
results might have been reached by including it in a pure randomized selection
design than to insist on its use to demonstrate another treatment’s significant
improvement over it. See Caplan et al. (2015a, b) for further discussion.
We continue the taxonomy of randomized selection trials in the following
sections. In section “Considerations for Designing Randomized Selection Trials”
some important considerations for the design of selection trials are introduced. In
section “Designs,” we discuss some specific pure selection designs that have been
proposed in the literature. In section “Other Designs,” we briefly mention some other
designs such as when selection procedures are used as a preliminary step for a
1050 S. M. Lee et al.
randomized controlled trial or when they are used to formally test whether better-
than-placebo treatments exist and if so, to select them. For simplicity in this chapter
we shall focus exclusively on the case of binary outcomes as the clinical endpoint of
interest, such as tumor response, and we shall use “response” or “success” synon-
ymously. A selection design for time to event outcomes is briefly mentioned in
section “Other Designs.” In section “Applications of Randomized Selection
Designs,” we present two examples of actual selection trials to illustrate some
practical implementation considerations, one using the prescreening selection
approach and the other using an adaptive Levin-Robbins-Leu procedure, both
discussed in section “Considerations for Designing Randomized Selection Trials.”
We conclude with a brief discussion in section “Discussion and Conclusion.”
selection approach (Gupta 1956, 1965), which selects a subset of treatments using a
fixed, prespecified sample size, the goal of which is to capture the one best treatment
in the subset with a prespecified high probability, no matter what the true response
probabilities are. To guarantee that, the size of the subset necessarily has to vary
randomly according to the observed data. For example, if there are only small
differences between success probabilities, all c treatments might have to be
“selected.” Clearly, the only role played by selection of subsets of size one or greater
in the Gupta approach is to assure capture of the best treatment among those selected
with high probability. More generally, to assure capture of b best populations with
high probability, subsets of size b or greater would have to be selected, with size
varying according to the observed data.
We shall not pursue the Gupta subset selection approach any further, referring the
reader instead to the book by Bechhofer et al. (1995), because in the sequel we shall
mean something very different by the term “subset selection.” Henceforth, by subset
selection we shall mean any procedure whose goal is explicitly to select subsets of b
best treatments, when b is fixed in advance. In section “Random Subset Size
Selection with LRL Procedures” we also briefly consider subset selection procedures
that identify subsets of random size, but still the goal will be to select best subsets of
treatments, albeit of varying size. Such subset selection procedures are called
random subset size procedures.
The most familiar goal for a pure randomized selection trial in the indifference zone
approach is to correctly identify the best treatment with high probability if a minimal
clinically meaningful and worthwhile difference exists between the best and the
second-best treatment. However, if the success probabilities for the several best
treatments are close to one another, assuring a high probability of correct selection is
neither meaningful nor possible. When several treatments are close to best in
efficacy, we may be indifferent as to which is selected, but we should still want a
high probability of selecting one of those near-best treatments if not technically the
best. This leads to the general notion of acceptable subset selection which offers
another resolution to the dilemma posed by ignorance of the true parameter values
and which, unlike Gupta’s approach, stays within the indifference zone approach.
We specify these ideas in some more detail next.
Precisely because we generally don’t know whether the treatment response
probabilities fall in the preference or indifference zone, we will want to know
whether or not a procedure will select, if not best subsets, then “acceptable” subsets,
with a prespecified high probability irrespective of the true population parameters.
We shall refer to such a property as acceptable subset selection, where the phrase,
“with pre-specified probability irrespective of the true population parameters”
should be understood. The following notions were introduced by Bechhofer et al.
(1968, p. 337, hereinafter BKS) and elaborated upon by Leu and Levin (2008b).
1052 S. M. Lee et al.
Fig. 1
For any given success odds vector w = (w1, . . ., wc ), where wj = pj / (1pj) and
design odds ratio θ 1, we define certain integers s and t given by functions of w and
θ, say s ¼ s(w,θ) and t ¼ t(w,θ), such that
and
wtþ1 wb =θ < wt
candidate treatments but the research budget allows for selecting two but not more. Or,
we want to screen for three promising candidates in a search process, neither more nor
less. Even though most frequently selection procedures aim to identify a single best
candidate, that design specification should still be justified.
Often, however, it will not be clear to investigators how best to specify a desired
subset size prior to the experiment. For practical reasons we may wish to constrain
the size of the selected subset prior to the experiment. For example, budgetary
constraints may force us to select at most a given number of treatments, but we
may be content to select fewer than that number. Or, evidence from pilot work may
suggest that at least another given number of promising treatments may exist among
the c candidates, in which case we may wish to identify at least that number of good
treatments. Such practical needs of clinical research call for random subset size
selection procedures. Random subset size selection procedures should have the
acceptable subset selection property.
Designs
The Simon et al. (1985) (hereinafter “SWE”) design is a classical fixed sample size
selection procedure to select a best treatment, such as discussed in Gibbons et al.
(1977). We also discuss an adaptation of the SWE design following a collection of
parallel, single-arm, Simon two-stage designs used for prescreening the candidate
treatments. The prescreening allows incorporating historical control information to
help assure that the selected treatments are better than the current standard of care
(Liu et al. 2006). Another extension of the SWE design by Steinberg and Venzon
(2002) allows for interim stopping and early termination by requiring a prespecified
difference between treatment arms in order to select the most promising treatment at
interim looks. The aforementioned designs select one treatment given a fixed sample
size. Finally, we discuss the Levin-Robbins-Leu (LRL) family of sequential
subset selection procedures, which aim to select subsets of one or more treatments.
Extensions allow the size of the selected subset to vary, possibly with prespecified
constraints on the (random) size of the selected subsets for practical purposes. Both
the fixed- and random-subset size LRL procedures offer acceptable subset selection
with high probability, irrespective of true treatment efficacies, while allowing
sequentially adaptive elimination and recruitment of treatments.
Suppose we are interested in identifying the one best treatment (b ¼ 1), i.e., the
treatment with highest success probability, from the c candidates. The SWE proce-
dure draws a fixed sample of size n for each of the c treatment arms, then selects the
treatment with the largest observed number of responses, or equivalently largest
proportion of responses. If there are ties among the largest, select one randomly as
the best (or, more realistically, select one on practical grounds such as side effects,
1054 S. M. Lee et al.
cost, etc.). Simon et al. (1985) provide formulas to calculate the probability of
correct selection (PCS) under the least favorable configuration, that is, the proba-
bility of correctly identifying the best treatment if a clinically meaningful difference
of Δ exists between the response probability of the best treatment (say p + Δ) and the
response probabilities of the remaining c–1 treatments (each assumed equal to p).
This configuration is called “least favorable” because any set of response probabil-
ities wherein the best differs from the others by Δ or more will have a PCS no smaller
than that for the least favorable configuration. For given c, p, and Δ, the sample size n
per arm may then be determined using the formulas by trial and error to achieve the
prespecified level of PCS. Table 3 of the SWE paper provides explicit sample sizes
per arm in the special case of an absolute difference in response rates of Δ ¼ 0.15 for
p ranging from 0.2 to 0.8 in increments of 0.1 with c ¼ 2, 3, or 4 treatments that
achieve a 90% PCS under the least favorable configuration. Alternatively, tables
provided in Gibbons et al. (1977) may be used.
Liu et al. (2006) describe an adaptation of the SWE procedure following pre-
screening of several candidate treatments using Simon’s two-stage design (Simon
1989). Patients are randomized to the c candidate treatments of interest and parallel
single-arm phase II trials are conducted using Simon’s two-stage design to screen out
nonpromising treatments. We then apply the SWE selection rule among the c0
treatments that passed the prescreening step. Note that the sample size for this
adaptation is based on the sample size calculation for the original Simon two-stage
trials and may not be adequate to guarantee a high probability of correct selection. Of
course, the number of treatments that will pass the prescreening step is not known in
advance and this complicates the calculation of the overall probability of correct
selection. One feature is clear though, which is that the overall PCS for the Liu et al.
proposal must be smaller than the probability of declaring a treatment promising in
the Simon two-stage design under the design alternative. This is because not only
must the best treatment pass the Simon prescreening, but in case other less effica-
cious treatments do too, the best must have more responses than the competitors. We
illustrate this feature in the example given in Applications of Randomized Selection
Designs.
The design by Steinberg and Venzon (2002) is an extension of the SWE approach. It
differs by allowing an interim look at the data to perform an early selection and
terminate the trial. The maximum sample size for the design is calculated using the
same method as the SWE procedure. In addition, let d be a prespecified integer such
that if the difference in the number of successes between the apparently best and
second-best treatment arm is at least d at the interim look, the procedure stops,
57 Randomized Selection Designs 1055
whereupon the leading treatment is selected as the best. Otherwise, the trial con-
tinues to complete its planned accrual. If there is no early stopping, the SWE
selection rule is applied. Steinberg and Venzon describe how to determine d to
limit the probability of making an incorrect selection in a two-arm trial. They also
provide a table with values of d that limit the probability of making an incorrect
selection to 0.5%, 1%, 5%, or 10%, assuming a difference of 0.10 or 0.15 in the
probability of response between the two treatment arms. These authors also propose
their interim stopping rule for use in the context of parallel Simon two-stage screening
trials, in effect allowing for early stopping in the design discussed in section “Pre-
screening Using Simon’s Two-Stage Design.” Early stopping takes place if the best
response tally exceeds the Steinberg and Venzon criterion d among the subset of
treatments that have passed the first stage of screening. If the criterion is not met,
enrollment continues to complete the second stage with the passing subset and a final
selection takes place as in section “Prescreening Using Simon’s Two-Stage Design.”
LRL Procedure N
The sampling proceeds vector-at-a-time, meaning that patients are assigned ran-
domly to each of the c treatments in blocks of size c. If desired, the blocking on c
ðnÞ
patients can also incorporate matching on prognostic factors. Let r j denote the
number of responses in treatment j after n rounds of c-tuplets have been randomized
ðnÞ ½n ½n
for j ¼ 1,. . ., c. We call the r j “response tallies.” In addition, let r 1 r 2
½n
r c denote the ordered response tallies, where the subscripts refer to ordering
the number of responses from the greatest ( j ¼ 1) to the least ( j ¼ c). Now let d be a
positive integer chosen in advance. Procedure N stops the first time that the
1056 S. M. Lee et al.
ðw wb Þd
Pw ½cs P1 d
:
ðbÞ wðbÞ
The sum in the denominator is over all possible b-tuples which we denote generi-
cally as (b) ¼ (i1,. . .ib) with integers 1 i1 < < ib c, and where the summands
use the convenient notation wdðbÞ ¼ wdi1 wdib . The above lower bound allows us to
choose d in designing selection experiments as follows. It is easy to see that the right-
hand side of the above inequality is minimized for any w in the preference zone,
Pref(b, c, θ) ¼ {w : wb/wb + 1 θ} at w ¼ w (θ, . . ., θ, 1, . . .1) for any positive
constant w. Then
ðcbÞ
b^X
b cb
Pw fcorrect selectiong θ = db
θdðbiÞ :
i¼0 i i
It follows that we can choose a value of d depending only on b, c, θ, and P*, say
d ¼ d(b, c, θ, P*), such that the right-hand side of the inequality is at least P* for any
1 1
c c
given P* satisfying < P < 1. We exclude the trivial case P ¼
b b
since no formal procedure is needed to achieve that lax goal.
Levin and Leu (2013) demonstrate that the lower bound formula holds in fact for
each adaptive member of the LRL family of sequential subset selection procedures
for any number of treatments c 7 and numerical evidence strongly supports the
conjecture that the inequality holds in complete generality as it does for the non-
adaptive procedure N .
subsets with probability at least P* for any and all true success probabilities. Thus
we can say that the LRL family of subset selection procedures has the acceptable
subset selection property mentioned in section “Acceptable Subset Selection.” See
Leu and Levin (2008b) and Levin and Leu (2016) for further details.
Other operating characteristics for LRL subset selection procedures such as the
expected number of rounds (number of vectors), Ew ½N N ðr, b, cÞ , expected total
number of tosses (sample size), expected number of failures, etc., are typically
obtained via simulation.
Other Designs
hypothesis of no efficacy (even with the preferred treatment in the selection stage)
with traditional control of the type 1 error rate, whereas the early-stage selection
feature is of secondary interest, used for the sake of seamless efficiency leading into
the larger study. Because such preliminary selection typically introduces some
degree of selection bias in the final evaluation of all the data, special statistical
adjustments must be made to account for that bias. For example, if the null hypoth-
esis were true, one would be capitalizing purely on chance by selecting the treatment
with the apparently best performance, thereby introducing a selection bias when
comparing that treatment to a placebo control. We shall not discuss preliminary
selection designs further here except to note some examples of such methods: see
Stallard and Todd (2003), Stallard and Friede (2008), Levy et al. (2006), Kaufmann
et al. (2009), Levin et al. (2011), and the various methods used in adaptive trial
design (see, e.g., Coffey et al. 2012, and references cited therein).
A sequential selection design featuring a response-adaptive randomized play-
the-winner rule with sequential elimination of inferior treatments was studied by
Coad and Ivanova (2005). They show a desirable savings in total sample size
compared with fixed sample size procedures together with a palpable increase in
the proportion of patients allocated to the superior treatment. They demonstrate
the practical benefits of their procedure using a three-treatment lung cancer study
and argue that these benefits extend also to dose-finding studies. The same
authors also studied the selection bias involved in the maximum likelihood
estimators of the success probabilities after the trial stops, using a key identity
between the bias and the expected reciprocal stopping time; see Coad and
Ivanova (2001).
A novel design called the comparative selection design was introduced by Leu
et al. (2011) which combines features of a pure selection design with a hypothesis
test. One supposes there are one or more active candidate treatments and one or more
placebo or other control arms (such as an attention control group or best available
standard of care). The primary goal is to test the null hypothesis that there does not
exist a better-than-placebo (BTP) subset of active treatments against the alternative
that there does exist a BTP subset of active treatments; and, if the null hypothesis is
rejected in favor of the alternative, to select one such BTP subset of active treat-
ments. The type 1 error rate may be controlled at conventional levels of statistical
significance and the probability of correctly selecting a BTP subset of active
treatments can be made arbitrarily high given a sufficiently wide separation between
the efficacy of the BTP active treatments and that of the placebo or other control
arms. We refer the reader to Leu et al. (2011) for further details of the comparative
selection design.
While we have focused exclusively on randomized selection designs with binary
outcomes in this chapter, a selection design has also been proposed for time-to-event
outcomes. We refer the reader to Liu et al. (1993) for further details and to Herbst
et al. (2010) for an application of the design to evaluate concurrent chemotherapy
with Cetuximab versus chemotherapy followed by Cetuximab in patients with lung
cancer.
1060 S. M. Lee et al.
Though randomized selection designs have much to offer, they have seldom been
applied in practice. Here we provide two examples of trials using such designs.
The first example is a recently published paper by Lustberg et al. (2010). The
study evaluated two schedules of the combination of mitomycin and irinotecan in
patients with esophageal and gastroesophageal adenocarcinoma with the goal of
selecting the most promising schedule. Patients were randomized (1:1) to either
6 mg/m2 mitomycin C on day 1 and 125 mg/m2 irinotecan on days 2 and 9 or 3 mg/
m2 mitomycin C on days 1 and 8 and 125 mg/m2 irinotecan on days 2 and 9. The
Simon two-stage design was used to prescreen the treatments with the primary
outcome being response to treatment. Each treatment arm was designed to detect a
20% difference with an alpha of 0.10 and a beta of 0.10 assuming response rates of
30% under the null and 50% under the alternative hypothesis. This required
enrolling 28 patients in the first stage of the Simon two-stage design. If 8 or more
responded then an additional 11 patients would be enrolled for a total of 39 patients
per arm. Treatment(s) with 16 or more responders would be considered worthy of
further evaluation and the SWE procedure would be used to select the most prom-
ising one.
Prescreening with Simon’s two-stage design followed by an application of the
SWE procedure can be considered as an overall selection procedure. Though the
probability of passing the screening step with Simon’s two-stage procedure is 0.900
in the example at the design alternative of a 50% response rate, the overall PCS is
only 0.888 for c ¼ 2 treatments assuming the inferior treatment has a response rate of
30%. This is because there is a non-negligible chance the inferior treatment will pass
the Simon prescreen (with probability 0.094) followed by a small chance that its
response tally will actually exceed (or equal) that of the better treatment, leading to
an incorrect selection (or a 50% chance of an incorrect selection in the case of a tie).
As the response rate of the inferior treatment approaches that of the superior
treatment, the PCS decreases. For example, with a 35% response rate, the overall
PCS is 0.855 and with a 40% response rate, the overall PCS falls to 0.780.
In the conduct of the actual study, only 6 mg/m2 mitomycin C on day 1 and
125 mg/m2 irinotecan on days 2 and 9 passed the screening and was considered
worthy for further evaluation. Thus, selection of the most promising treatment was
not needed.
The second example is a randomized selection trial the present authors designed
using the Levin-Robbins-Leu selection procedures. This study is currently enrolling
patients at Columbia University Irving Medical Center to evaluate three types of
garments (gloves and socks) for the prevention of taxane-induced peripheral neu-
ropathy, with the goal of selecting the most promising intervention and evaluating it
in a large randomized controlled clinical trial. Breast-cancer patients are being
randomized in triplets to cryotherapy (cold garments), compression therapy (tight-
fitting garments), or placebo (loose garments) with stratification for the chemother-
apy schedule. Previous smaller studies have shown that both cryotherapy and
compression may be efficacious at preventing taxane-induced peripheral neuropathy.
57 Randomized Selection Designs 1061
Table 1 Operating characteristics of the selection procedure for peripheral neuropathy prevention
Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5
(0.79, 0.65, (0.79, 0.65, (0.75, 0.65, (0.65, 0.65, (0.65, 0.65,
0.65) 0.40) 0.40) 0.40) 0.65)
P[cs] 0.838 0.912 0.831 0.498 0.336
P[as] 0.838 0.912 0.999 0.997 1.00
P 0.134 0.272 0.214 0.153 0.075
[N ¼ 45]
P[trunc] 0.303 0.163 0.231 0.324 0.489
P 0.322 0.514 0.435 0.343 0.201
[N < 60]
P 0.534 0.719 0.638 0.537 0.362
[N < 80]
Mean N 75.2 65.7 69.8 74.8 83.1
Med N 76 59 65 75 99
P[cs] is the probability of correct selection overall
P[as] is the probability of an acceptable selection overall
P[N ¼ 45] is the probability of reaching a decision after exactly 45 patients have been randomized
P[trunc] is the probability that the trial will be truncated before the second elimination time
P[N < 60] is the probability that the total number of patients will be less than 60 at stopping
P[N < 80] is the probability that the total number of patients will be less than 80 at stopping
Mean N is the mean of the distribution of the (random) total number of patients
Median N is the median of the distribution of the total number of patients
The operating characteristics of the design shown in Table 1 below were evalu-
ated by simulation studies using 100,000 replications per scenario. The characteris-
tics evaluated were the PCS, the probability of an acceptable selection, the
probability of stopping at the first look with N ¼ 45 patients, the probability of
truncation at N ¼ 100 patients, the probability of the trial concluding with a sample
size below 60 or below 80 (to assess accrual feasibility), and the mean and median
sample size.
At the end of the trial the sample proportion of patients with a change in FACT
NTX < 5 from baseline to week 12 will be reported for each arm. Additionally, the
likelihood of the response tallies for the intervention selected together with the first
runner-up will be calculated. This likelihood is given by
ð nÞ ð nÞ r ð nÞ nrðjnÞ
ðnÞ ðnÞ r
L pi , p j jr i , r j ¼ pi i ð1 pi Þnri p j j 1 p j ,
ðnÞ ðnÞ
where r i and r j are the observed tallies for the selected and first runner-up
intervention and where pi and pj are the respective true response probabilities. The
likelihood of the observed response tallies will also be calculated under the assump-
tion that we erred in our selection and
that the true probabilities
are those for the two
ðnÞ ðnÞ
interventions transposed, namely, L p j , pi jr i , r j . The likelihood ratio, or LR, is
57 Randomized Selection Designs 1063
the ratio of these two likelihoods. It can be shown that LR equals the true odds ratio
raised to the fourth power,
( )4
ðnÞ ðnÞ
L pi , p j jr i , r j pi =ð 1 pi Þ
LR ¼ ¼ ,
ðnÞ ðnÞ
L p j , pi jr i , r j p j= 1 p j
in the case where the trial ends meeting the selection criterion. In the case of
ðnÞ ðnÞ
truncation, the exponent 4 is replaced by r i r j . LR will be evaluated at the
ðnÞ
adjusted sample proportions of pi and pj, namely, r i þ 0:5 =ðn þ 1Þ and
ðnÞ
r j þ 0:5 =ðn þ 1Þ, respectively.
The likelihood ratio is an important measure of the weight of evidence in favor of
a correct selection after the trial concludes (see, e.g., Royall 1997 and 2000). For
example, it indicates strong evidence of correct selection if LR > 10 or only weak
evidence if it is near 1, and if the placebo arm should actually be the selected
intervention arm, there would presumably be either weak or even strong evidence
against either active intervention being the best. Thus the LR will play a crucial role
in deciding whether or not to mount a subsequent phase III trial.
We began this chapter with the somewhat provocative assertion that not every
clinical research problem can – nor should – be addressed with a study design that
tests a null hypothesis of no treatment differences while controlling the type 1 error
rate to conventional levels such as 0.01, 0.05, or 0.10. Selection problems, addressed
by randomized selection trials, are prime examples. While a pure selection design
can always be viewed as a multiple decision procedure that tests the null hypothesis
of no difference which rejects that hypothesis when one selects the best performing
treatment, this view misses the point, which is that when the goal is to select a best
treatment, one really doesn’t care about the null hypothesis. Indeed, the reason pure
selection designs can achieve good PCS (a.k.a. “power” in the hypothesis test
context) with smaller sample sizes than conventional phase III designs is exactly
1
c
because selection trials control the type 1 error rate only at level α ¼ 1/c (or
b
for selecting subsets of size b). Pure selection trials are precisely the right tool for the
job when a choice between competing alternatives must be made with no negative
consequences if the treatments are all of equal efficacy.
Some authors have raised concerns with the use of randomized selection designs
in clinical research apart from the hypothesis testing issue; see, e.g., Rubinstein et al.
(2009), echoed by Green et al. (2016). Referring to the original SWE design,
Rubinstein et al. (2009, p.1886) write,
1064 S. M. Lee et al.
The weakness in the original design is that it does not assure that the (sometimes nominally)
superior experimental regimen is superior to standard therapy. It was occasionally argued
that an ineffective experimental regimen could act as a control arm for the other regimen, but
the design was not constructed to be used in this way, since, as designed, one of the two
experimental regimens would always be chosen to go forward, even if neither was superior
to standard treatment.
To address this concern the authors suggest prescreeing the candidates with Simon’s
two-stage design, citing Liu et al. (2006). The concern presumes, of course, that
standard therapy is not one of the treatments considered for selection. Apart from
potential problems with the rate of accrual, there is no intrinsic reason why standard
treatment cannot be included among those to be studied in a selection trial, and this
option should be carefully considered in the planning stages of the trial. The concern
largely evaporates when standard treatments are included.
Whether or not standard treatment is included among the candidates, the stated
concern does raise an important question: If one is to give up on statements of
statistical significance in the selection paradigm, what then can be said about the
quality of the selected treatment(s) based on the accumulated data, a question that will
inevitably be asked once the trial ends? We believe that an assessment of the weight of
evidence using likelihood ratio methodology is perhaps the most appropriate answer,
such as was illustrated in the peripheral neuropathy selection trial discussed in section
“Applications of Randomized Selection Designs.” This approach is not only reason-
able insofar as it addresses the right question – How strong is the evidence in favor of
having selected the truly best treatment? – it also accords with the current trend away
from too-heavy reliance on null hypothesis significance testing and p-values. Weight-
of-evidence considerations are especially germane if standard treatment is included
among the candidate treatments, but even if not, likelihood ratios against historical
control parameters can be quite illuminating.
Such weight-of-evidence considerations complement the more traditional
frequentist response to the concern that (a) one has high confidence P* that the
selected treatments are acceptable because we use a procedure that has such an
operating characteristic (though acknowledging that that does not pertain to any
particular trial result); and (b) a descriptive review of the maximum likelihood
estimates of the treatment response probabilities, possibly together with an estimate
of the selection bias adhering thereto following the methods of Coad and Ivanova
(2001), can be revealing.
Nevertheless, we suspect some researchers will be unable to overcome the reflex
to test some hypothesis, in which case the preliminary selection design or the
comparative selection trial mentioned in section “Other Designs” may hold appeal.
With the former design, a confirmatory phase III trial ultimately assesses the efficacy
of the selected treatment against a control treatment. With the latter design, given
several active treatments and possibly several control treatments, it seems quite
natural to test whether there is a better-than-placebo active treatment (or a subset
of them) and even more reasonable to then wish to correctly identify it (or them) with
a high probability of correct (or acceptable) selection.
57 Randomized Selection Designs 1065
Finally, in this chapter we have focused on the LRL family of sequential subset
selection procedures because we believe its flexibility, adaptive features, and acceptable
subset selection property make it an attractive option for randomized selection trials.
References
Bechhofer RE (1954) A single-sample multiple decision procedure for ranking means of normal
populations with known variances. Ann Math Stat 25:16–39
Bechhofer RE, Kiefer J, Sobel M (1968) Sequential identication and ranking procedures. University
of Chicago Press, Chicago
Bechhofer RE, Santner TJ, Goldsman DM (1995) Design and analysis of experiments for statistical
selection, screening, and multiple comparisons. Wiley, New York
Caplan A, Plunkett C, Levin B (2015a) Selecting the right tool for the job (invited paper). Am J
Bioeth 15(4):4–10. (with open peer commentaries, pp. 33-50)
Caplan A, Plunkett C, Levin B (2015b) The perfect must not overwhelm the good: response to open
peer commentaries on “selecting the right tool for the job”. Am J Bioeth 15(4):W8–W10
Coad DS, Ivanova A (2001) Bias calculations for adaptive urn designs. Seq Anal 20:91–116
Coad DS, Ivanova A (2005) Sequential urn designs with elimination for comparing K3
treatments. Stat Med 24:1995–2009
Coffey CS, Levin B, Clark C, Timmerman C, Wittes J, Gilbert P, Harris S (2012) Overview, hurdles,
and future work in adaptive designs: perspectives from an NIH-funded workshop. Clin Trials 9
(6):671–680
Gibbons JD, Olkin I, Sobel M (1977) Selecting and ordering populations: a new statistical
methodology. Wiley, New York
Green S, Benedetti J, Smith A, Crowley J (2016) Clinical trials in oncology, 3rd edn. Chapman and
Hall/CRC Press, Boca Raton
Gupta SS (1956) On a decision rule for a problem in ranking means, mimeograph series 150,
Institute of Statistics. University of North Carolina, Chapel Hill
Gupta SS (1965) On some multiple decision (selection and ranking) rules. Technometrics 7:225–
245
Herbst RS, Kelly K, Chansky K, Mack PC, Franklin WA, Hirsch FR, Atkins JN, Dakhil SR, Albain
KS, Kim ES, Redman M, Crowley JJ, Gandara DR (2010) Phase II selection design trial of
concurrent chemotherapy and cetuximab versus chemotherapy followed by cetuximab in
advanced-stage non-small-cell lung cancer: Southwest Oncology Group study S0342. J Clin
Oncol 28(31):4747–4754
Kaufmann P, Thompson JLP, Levy G, Buchsbaum R, Shefner J, Krivickas LS, Katz J, Rollins Y,
Barohn RJ, Jackson CE, Tiryaki E, Lomen-Hoerth C, Armon C, Tandan R, Rudnicki SA,
Rezania K, Sufit R, Pestronk A, Novella SP, Heiman-Patterson T, Kasarskis EJ, Pioro EP,
Montes J, Arbing R, Vecchio D, Barsdorf A, Mitsumoto H, Levin B, for the QALS Study Group
(2009) Phase II trial of CoQ10 for ALS finds insufficient evidence to justify phase III. Ann
Neurol 66:235–244
Leu C-S, Levin B (2008a) A generalization of the Levin-Robbins procedure for binomial subset
selection and recruitment problems. Stat Sin 18:203–218
Leu C-S, Levin B (2008b) On a conjecture of Bechhofer, Kiefer, and Sobel for the Levin-Robbins-
Leu binomial subset selection procedures. Seq Anal 27:106–125
Leu C-S, Levin B (2017) Adaptive sequential selection procedures with random subset sizes. Seq
Anal 36(3):384–396
Leu C-S, Cheung Y-K, Levin B (2011) Chapter 15, Subset selection in comparative selection trials.
In Bhattacharjee M, Dhar SK, Subramanian S (eds) Recent advances in biostatistics: false
discovery, survival analysis, and other topics. Series in biostatistics 4:271–288. World Scientific
1066 S. M. Lee et al.
Levin B, Leu C-S (2013) On an inequality that implies the lower bound formula for the probability
of correct selection in the Levin-Robbins-Leu family of sequential binomial subset selection
procedures. Seq Anal 32(4):404–427
Levin B, Leu C-S (2016) On lattice event probabilities for Levin-Robbins-Leu subset selection
procedures. Seq Anal 35(3):370–386
Levin B, Thompson JLP, Chakraborty B, Levy G, MacArthur RB, Haley EC (2011) Statistical
aspects of the TNK-S2B trial of Tenecteplase versus Alteplase in acute ischemic stroke: an
efficient, dose-adaptive, seamless phase II/III design. Clin Trials 8:398–407
Levy G, Kaufmann P, Buchsbaum R, Montes J, Barsdorf A, Arbing R, Battista V, Zhou X,
Mitsumoto H, Levin B, Thompson JLP (2006) A two-stage design for a phase II clinical trial
of coenzyme Q10 in ALS. Neurology 66:660–663
Liu PY, Moon J, LeBlanc M (2006) Phase II selection designs. In: Crowly J, Ankerst DP (eds)
Handbook of statistics in clinical oncology, 2nd edn. Chapman and Hall/CRC, Boca Raton, pp
155–164
Liu PY, Dahlberg S, Crowley J (1993) Selection designs for pilot studies based on survival.
Biometrics 49:391–398
Lustberg MB, Bekaii-Saab T, Young D et al (2010) Phase II randomized study of two regimens of
sequentially administered mitomycin C and irinotecan in patients with unresectable esophageal
and gastroesophageal adenocarcinoma. J Thorac Oncol 5:713–718
Royall R (1997) Statistical evidence: a likelihood paradigm. Chapman and Hall, London
Royall R (2000) On the probability of observing misleading statistical evidence. J Am Statist Assoc
95(451):760–768
Rubinstein L, Crowley J, Ivy P, LeBlanc M, Sargent D (2009) Randomized phase II designs. Clin
Cancer Res 15(6):1883–1890
Simon R (1989) Optimal two-stage designs for phase II clinical trials. Control Clin Trials 10:1–10
Simon R, Wittes RE, Ellenberg SE (1985) Randomized phase II clinical trials. Cancer Treat Rep
69:1375–1381
Stallard N, Friede T (2008) A group-sequential design for clinical trials with treatment selection.
Stat Med 27(29):6209–6227
Stallard N, Todd S (2003) Sequential designs for phase III clinical trials incorporating treatment
selection. Stat Med 22(5):689–703
Steinberg SE, Venzon DJ (2002) Early selection in a randomized phase II clinical trial. Stat Med
21:1711–1726
Futility Designs
58
Sharon D. Yeatts and Yuko Y. Palesch
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1068
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1069
Superiority Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1070
Single-Arm Futility Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1071
Case Study: Creatine and Minocycline in Early Parkinson Disease . . . . . . . . . . . . . . . . . . . . . . . 1073
Concurrently Controlled Futility Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074
Case Study: Deferoxamine in Intracerebral Hemorrhage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1075
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076
Sample Size Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1076
Interim Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1078
Protocol Adherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1079
Sequential Futility Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1080
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1080
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081
Abstract
Limited resources require that interventions be evaluated for an efficacy signal in
Phase II prior to initiation of large and costly confirmatory Phase III clinical trials.
The standard concurrently controlled superiority design is not well-suited for
this evaluation. Because the Phase II superiority design is often underpowered to
detect clinically meaningful improvements, investigators are left to make
subjective decisions in the face of a nonsignificant test result. The futility design
reframes the statistical hypothesis in order to discard interventions which do not
demonstrate sufficient promise. The alternative hypothesis is that the effect is less
than some minimally worthwhile threshold. In this way, the trial can be
appropriately powered to evaluate whether the intervention is worth pursuing in
Phase III and thus provides a clear “no go” signal. We briefly describe the
superiority design in order to compare and contrast with the futility design. We
then describe both the single-arm and concurrently controlled futility designs and
present case studies of each. Lastly, we discuss some key considerations related to
sample size calculation and interim analysis.
Keywords
Phase II · Futility design · Single-arm futility design · Concurrently controlled
futility design · Calibration control
Introduction
Each phase of clinical testing has its own objectives, and the optimal trial design
should be tailored to the research question at hand. Phase I trials are typically
designed to identify the dose (or range of doses) which has desired properties,
usually related to safety, and there is a growing body of statistical literature describ-
ing various dose-finding/dose-ranging designs, such as the continuous reassessment
method (Garrett-Mayer 2006). In Phase II, the selected doses are evaluated for an
efficacy signal, in addition to further assessment of safety. Those with sufficient
promise then proceed to a confirmatory evaluation of efficacy in Phase III, for which
the randomized controlled clinical trial is generally regarded to be the gold standard.
Sacks et al. (2014) conducted a retrospective evaluation of marketing applications
submitted to the US Food and Drug Administration for new molecular entities and
found that only 50% were approved on first submission. Similarly, Hwang et al.
(2016) found that 54% of 640 novel therapeutics entering confirmatory testing
between 1998 and 2008 failed and the failure was related to efficacy in 57% of
these. This success rate appears to differ by therapeutic area, and the data is
conflicting. Hwang et al. (2016) report a failure rate of nearly 70% in cancer, whereas
Sacks et al. (2014) report a first round approval rate of 72% in oncology. Djulbegovic
et al. (2008) evaluated 624 Phase III clinical trials completed by the National Cancer
Institute cooperative groups between 1955 and 2000 and found that only 30% of
randomized comparisons were statistically significant and 29% were inconclusive
(defined as “equal chance that standard treatment better than experimental or vice
versa”). Among trials evaluating treatments for acute ischemic stroke, Kidwell et al.
(2001) reported that 23% were considered positive by the reporting authors, but only
3% yielded a positive response on a prespecified primary endpoint at the typical level
of significance of 0.05. Chen and Wang (2016) reviewed 430 drugs considered for
the treatment of stroke between 1995 and 2015 and found that 70% were
discontinued.
Given the disappointing performance of candidate treatments in Phase III clinical
trials, there is a need to better screen therapies prior to the implementation of
58 Futility Designs 1069
expensive Phase III clinical trials (Brown et al. 2011; Sacks et al. 2014; Levin 2015).
Clinical trial conduct requires extensive resources financially (for research personnel
efforts and infrastructure support) and in terms of patients with the condition of
interest. The available resources are clearly limited, and properly vetting interven-
tions in Phase II allows the finite resources available to be targeted toward
confirming efficacy in those with most promise.
The standard concurrently controlled Phase II trial design is often powered to
detect very large effect sizes in order to keep the sample size feasible, and hence, it
can be criticized as an underpowered Phase III trial (Levin 2015). Failure to find
significance may be the result of inadequate power at effect sizes which are still
clinically meaningful. The trial results, then, do not provide a clear “go/no go” signal
as to whether the intervention should move forward for confirmatory efficacy
testing. Consequently, even when the outcome analysis fails to achieve statistical
significance, the standard Phase II design assumes that the intervention will move
forward.
Rather than evaluating whether an intervention has sufficient promise, the futility
design seeks to discard an intervention which clearly lacks sufficient promise. The
statistical implication of this distinction is impactful – the futility design can be
appropriately powered to declare futility (and hence provide a clear “no go” signal)
when an intervention has little or no effect.
Background
The methodologic basis for the futility design stems from the field of cancer clinical
trials. To eliminate ineffective therapies from future development, a single-arm
clinical trial would be conducted in order to compare the resulting outcome to
some minimally acceptable level (Herson 1979).
In recent years, the futility design has received increased attention, particularly in
the field of neurology, as a mechanism to weed out interventions which are not
sufficiently promising. The IMS I Investigators (2004) adapted the futility design to
the acute ischemic stroke treatment with the single-arm futility trial evaluating the
effect of intravenous plus intra-arterial tPA. In 2005, Palesch et al. applied this
methodology to six past Phase III trials in ischemic stroke and found that the futility
design could have prevented three such trials for which the treatment was ultimately
determined to be ineffective. The NINDS NET-PD Investigators used the single-arm
futility design to test whether creatine or minocycline (2006), as well as Co-Q10 or
GPI-1485 (2007), warranted definitive confirmatory testing for Parkinson disease.
Kaufmann et al. (2009) conducted a concurrently controlled, adaptive, two-stage
selection and futility design to evaluate the promise of coenzyme Q10 in
amyotrophic lateral sclerosis (ALS).
As we have stated previously, the traditional concurrently controlled Phase II
clinical trial, designed to evaluate the efficacy signal in a test for superiority, can be
criticized as an underpowered Phase III trial. We first briefly review the superiority
setting in order to demonstrate this point. We then introduce the single-arm futility
1070 S. D. Yeatts and Y. Y. Palesch
design and discuss its advantages and disadvantages. Finally, we describe the
concurrently controlled futility design.
Superiority Setting
H 0 : π tx π ctrl ¼ 0
H A : π tx π ctrl 6¼ 0
A Type I error is the rejection of a true null hypothesis, which here means that the
treatment arms are declared different when in fact they are not. The commonly used
term “false positive” reflects both the statistical framework and the conclusion about
the intervention; the investigators declare a positive finding (a difference between the
treatments) when none exists. In the superiority setting, the level of significance,
which reflects our willingness to make this error, is typically set at 0.05. The
scientific community may be more or less willing to tolerate this error, depending
on its consequences. The type, safety profile, and cost of the intervention or other
considerations may factor into the general willingness to accept the conclusion that a
treatment is efficacious when it is not.
A Type II error is the failure to reject a false null hypothesis, which here means
that the statistical test fails to conclude a difference when in fact one exists. Again,
the commonly used term “false negative” reflects both the statistical framework and
the conclusion about the intervention. The investigators consider the trial to be
negative (are unable to declare a difference between the treatments) despite a
nonzero treatment effect. The willingness to accept such an error is typically set at
0.2 or less; in other words, the statistical power of the trial is set to 0.8 or greater.
Again, the scientific community may be more or less willing to tolerate this error,
depending on the same factors as for the Type I error.
In order to justify the criticism of a Phase II superiority design as an underpow-
ered Phase III trial, consider a hypothetical Phase II trial intended to evaluate the
efficacy signal associated with a new treatment for intracerebral hemorrhage. As the
binomial proportion has maximum variance at 0.5, we assume this to be the control
proportion to represent the worst case scenario. A concurrently controlled superiority
trial of 300 subjects, 150 in each arm, has 81% power to detect an improvement of 16
percentage points. In stroke, recent confirmatory trials have been designed to detect a
minimum clinically relevant difference of 10 percentage points; the design has only
58 Futility Designs 1071
In the single-arm futility design, all subjects enrolled are treated with the interven-
tion, in order to compare the outcome against a prespecified reference value. The
design is intended to establish whether the outcome on the intervention represents
less than some minimally clinically relevant improvement over the prespecified
reference value, which would lead us to declare the intervention futile. The alterna-
tive hypothesis, then, represents futility, and conversely, the null hypothesis assumes
that the intervention is not futile. Let πtx represent the true proportion of subjects with
good outcome on the intervention, and let π0 represent this clinically relevant
improvement over the reference; we now refer to this improvement as the futility
threshold. The statistical hypotheses are written as shown:
H 0 : π tx π 0
H A : π tx < π 0
A Type I error is still the rejection of a true null hypothesis; in the context of
futility, this means that the treatment response is declared to be less than the
threshold when in fact it is not. The commonly used term “false positive” here
reflects the statistical framework but does not well describe our conclusions about
the intervention; the investigators declare a negative finding (the intervention is
futile) when it is not. The prespecified level of significance should take into account
both the consequences of this error and the phase of the study. The consequence,
here, is that a useful intervention may be unnecessarily discarded. We want to
minimize the chance of abandoning effective therapies, certainly, but the community
may be more willing to tolerate a Type I error in the futility context than in the
superiority context, where the result of such an error is that patients are unnecessarily
exposed to an ineffective therapy. In addition, the sample size associated with Phase
II trials is expected to be relatively small, at least in comparison to the confirmatory
setting. Balancing these needs, a 0.10 level of significance has been suggested
(Tilley et al. 2006). Note that the alternative hypothesis is necessarily one-sided,
as we wish to discard only interventions for which the response is less than the
threshold, and the level of significance should be allocated as such.
1072 S. D. Yeatts and Y. Y. Palesch
A Type II error is still the failure to reject a false null hypothesis; in the context of
futility, this means that the statistical test fails to conclude that the treatment is futile
despite a treatment response which is less than the threshold. Again, the commonly
used term “false negative” reflects the statistical framework but not our conclusions
about the intervention; although the response is less than the specified threshold, the
intervention is not declared futile. The consequence is that an ineffective therapy will
be moved forward for a definitive efficacy evaluation. As our objective is to discard
ineffective therapies, we want to limit the chance of this error; however, because
additional testing is required before declaring the intervention efficacious, our
tolerance for the Type II error can be greater than that for the Type I error.
There is a great efficiency to this single-arm approach in terms of sample size
savings. Let us re-envision the previously described concurrently controlled superi-
ority design as a single-arm futility design. Assume, as before, that the literature
suggests a good outcome proportion of 0.5 associated with the control; further
assume that 10 percentage points can be assumed to be the minimum worthwhile
improvement required to warrant further investigation. The statistical hypothesis,
then, is written as shown:
H 0 : π tx 0:6
:
H A : π tx < 0:6
Calculating the sample size in order to achieve 80% power for declaring futility
when there is no improvement associated with the intervention (i.e., when πtx ¼ 0.5
under the alternative hypothesis), we find that a total sample size of 110 subjects is
required to evaluate futility using a one-sided 0.10 level of significance. This is in
stark contrast to the superiority approach, which was underpowered to detect
clinically relevant improvements in good outcome with a sample size of 300
subjects.
The advantages to this single-arm approach are self-evident. All subjects will
receive the intervention of interest, which may increase a potential participant’s
willingness to enroll. Perhaps more importantly, the trial can be appropriately
powered at what might be considered a typical Phase II sample size. Comparison
of the outcome proportion against a fixed value yields very real savings in terms of
the sample size. This fixed value can be derived from the literature available on
outcomes associated with the control intervention, often referred to as historical
control data. Decision-making based on the use of such historical control data is also
considered to be the primary drawback of the single-arm approach.
Chalmers et al. (1972) summarized the arguments against randomization, both
practical and ethical, but ultimately concluded that randomization is necessary to
evaluate efficacy and toxicity, even in the early stages of evaluation. Without a
concurrent control, one cannot be sure that the historical response would apply to the
enrolled population. Estey and Thall (2003) describe this problem as “treatment-
trial” confounding. Clinical management, imaging availability, or outcome ascer-
tainment may have changed over time, thus altering outcomes. Subtle differences in
58 Futility Designs 1073
eligibility criteria may result in slightly different populations across trials; such
differences in baseline characteristics, whether known or unknown, may impact
outcomes. When historical data are used for comparison, the observed effect reflects
a combination of these trial-specific effects and the true treatment effect, and the
specific contribution of each to the observed effect cannot be determined (Estey and
Thall 2003). It has also been suggested that outcome is altered simply because of
participation in a trial, a phenomenon sometimes referred to as the Hawthorne effect
and sometimes referred to as the placebo effect. Because of these limitations, a
concurrently controlled design may be preferred.
Pocock (1976) argues that, if it exists, “acceptable” historical control data should
not be ignored and presents a case for formally incorporating both randomized and
historical controls and describes the associated statistical inference. His definition of
acceptability is based on six conditions related to consistency and comparability of
the treatment, eligibility criteria, evaluation, baseline characteristics, investigators,
and trial conduct. The historical control data typically available would not meet most
of the six conditions.
To address concern over the applicability of the specified reference value in a
single-arm design, Herson and Carter (1986) proposed the use of a calibration
control, a small group of randomized concurrent controls. The calibration control
arm is not directly compared to the intervention arm but used to evaluate the
relevance of the historical control data in the current study. The utility of the
calibration control arm is demonstrated by the NINDS NET-PD Investigators in
the case study below.
H 0 : μ 7:46
H A : μ > 7:46
The single-arm futility analysis was conducted as planned, and there was not
sufficient evidence to declare either creatine or minocycline futile. However, the
mean change observed in the calibration control arm (8.39) was less than anticipated
based on the historical control data (10.65), and as a result, the futility threshold was
not consistent with 30% less progression than control. However, the investigators
had planned for this possibility during the design phase. A series of prespecified
sensitivity analyses were undertaken using the calibration control data to update the
historical control response in various ways, and the conclusions were not substan-
tively altered. This example demonstrates the potential concern over evaluating
futility using historical control data and highlights the potential utility of a concur-
rent control, whether for calibration, as described above, or for direct comparison, as
introduced below.
H0 : π tx π ctrl δ
HA : π tx π ctrl < δ
H 0 : π tx π ctrl 0:10
H A : π tx π ctrl < 0:10
Calculating the sample size in order to achieve 80% power for declaring futility
when there is no improvement associated with the intervention (i.e., when
πtx πctrl ¼ 0 under the alternative hypothesis), we find that a total sample size of
451 subjects is required to evaluate futility using a one-sided 0.10 level of signifi-
cance. Although larger than one might expect for a Phase II trial, this sample size
would yield only 57% power to detect an absolute 10% improvement in the two-
tailed superiority design and is a dramatic increase over the 110 subjects required for
the single-arm futility design. However, the inclusion of a concurrent control group
for direct comparison with the intervention avoids the pitfalls associated with the use
of historical control data to derive the futility threshold and allows for concurrent
estimation of the treatment effect.
H 0 : π tx π ctrl 0:12
H A : π tx π ctrl < 0:12
In order to evaluate futility with 80% power when the two treatments have the
same proportion, using a one-sided 0.10 level of significance, 253 subjects are
required. The sample size was inflated to account for loss to follow-up, consent
withdrawal, etc., resulting in a maximum sample size of 294 subjects. At the
conclusion of the trial, the observed good outcome rate in the control arm was
slightly higher than anticipated (34%, vs. the anticipated 28%). The power of the
trial to declare futility is affected by the discrepancy between the observed control
outcome and the assumed control outcome, particularly in binary outcome studies.
1076 S. D. Yeatts and Y. Y. Palesch
However, the futility threshold is not dependent on the assumed control response,
and the statistical test for futility is based on the observed response in both groups.
Therefore, the concurrent control approach allows the design to compensate, to some
extent, for the drawbacks of the single-arm approach, albeit with a larger sample
size.
Analysis
Recall the sample size calculation for comparing two independent proportions in a
superiority design. Let z1α=2 and z1 β represent the corresponding quantiles from
the standard normal distribution. Assume the true control response to be πctrl, and let
πtx be derived from the minimum clinically important improvement (ε) over the
assumed control response, such that πtx πctrl ¼ ε. The sample size required to
achieve power (1 β), using a two-sided α level of significance, and assuming equal
allocation to the treatment arms, is defined according the formula below:
2 !
z1α=2 þ z1β
n¼2 ðπ ctrl ð1 π ctrl Þ þ π tx ð1 π tx ÞÞ
e2
The sample size calculation for the futility design follows the same algebraic
formulation but reflects the key components of the futility design:
Using the same notation as above, and letting δ represent the futility threshold, the
sample size required to achieve power (1 β) for evaluating the futility hypothesis is
defined as shown below:
2 !
z1α þ z1β
n¼2 ðπ ctrl ð1 π ctrl Þ þ π tx ð1 π tx ÞÞ
ðe δÞ2
Another distinguishing feature of the futility calculation is the effect size for
which the trial is powered. In the superiority setting, a trial is designed to achieve
adequate power to declare superiority under the assumption that some minimum
clinically relevant difference ε exists. In the futility setting, however, a trial is
designed to achieve adequate power to declare futility under the assumption that
any potential improvement in outcomes is less than an effect which is minimally
worthwhile from a clinical perspective. A scenario where there is no improvement
associated with treatment, for instance, would be considered truly futile, and so one
might assume ε ¼ 0 for the power calculation. One might instead wish to target a
scenario where there is a small but clinically uninteresting improvement in out-
comes. In either case, the trial would have more than adequate power to declare
futility if the treatment decreases good outcomes.
The formulas provided here assume equal allocation to each treatment arm but are
easily modified to allow for an unequal allocation ratio.
We previously mentioned that the design parameters of the superiority design
could be revised in order to improve operating characteristics. Considering instead
a one-sided superiority hypothesis (i.e., HA : πtx πctrl > 0), a total sample size of
451 subjects yields 81% power to detect a 10% absolute improvement under a one-
sided 0.10 level of significance. Given that the one-sided superiority design and the
futility design have approximately equivalent operating characteristics under con-
sistent assumptions, Levin (2012) argues that the futility approach is more consis-
tent with Phase II objectives and the possible conclusions are more concrete. As
described in the table below, the form of the statistical hypothesis has a very real
implication for the resulting inference. You may recall from introductory statistics
that there are only two plausible hypotheses, and we retain our belief in the null
hypothesis, unless the data overwhelmingly contradict it, leading us to believe in
the alternative hypothesis. In the one-sided superiority setting, a nonsignificant test
result leads to a statement that “there is insufficient evidence to conclude that the
intervention is better” and therefore requires that we accept as plausible that the
intervention does not have a positive effect. In the futility setting, however, a
nonsignificant test result leads to a statement that “there is insufficient evidence to
conclude that the treatment effect is less than δ” and allows us to accept as
plausible that the effect of the intervention is at least minimally worthwhile. In
either case, the confidence interval may be used to evaluate which effect sizes
remain plausible based on the available data, but in the futility setting, the decision-
making process does not rely on such post hoc evaluations of the resulting
confidence interval.
1078 S. D. Yeatts and Y. Y. Palesch
Interim Analysis
the operating characteristics of such an analysis should be evaluated during the study
design phase. The usual alpha- and beta-spending functions applicable to superiority
designs are also applicable in the context of futility designs, but the function which is
considered optimal for superiority may not be so for futility. The O’Brien-Fleming
spending function (1979) spends alpha more conservatively than its Pocock coun-
terpart (1977), which means that it is more difficult to terminate under O’Brien-
Fleming. This may be desirable in the context of the superiority design, where the
consequence of such termination is to declare the intervention efficacious, thereby
making it available as a treatment option to the target population. In the context of
the futility design, however, one might wish to be more liberal in terms of early
stopping, given that the consequence of such termination is to declare that the
intervention is not sufficiently promising to warrant further study.
Consider a concurrently controlled futility design, sized at 451 subjects in order to
achieve 80% power to declare futility against a 10% absolute improvement, with a
prespecified interim analysis to be conducted after 50% of subjects have completed
the primary follow-up period. The O’Brien-Fleming boundary would call for termi-
nation only if the estimated treatment effect were greater than 3.6 percentage points
in the wrong direction (in favor of the control arm); under the alternative hypothesis
of no difference between the arms, the trial would have a 30% likelihood of
terminating early to declare the intervention futile. The Pocock boundary, on the
other hand, would call for termination if the estimated treatment effects were greater
than 0.2 percentage points in the wrong direction; under the alternative hypothesis of
no difference, the trial would have a 49% likelihood of terminating early to declare
the intervention futile.
When considering interim analysis, investigators should be aware that treatment
effect estimates can be unstable with small sample sizes and tend to stabilize over
time as outcomes accrue. Decision-making in the interim, when the effect estimates
may yet be unstable, could lead to termination on a random high or low. Bassler et al.
(2010) conducted a systematic review and meta-analysis and concluded that trials
which are terminated early for benefit yield biased estimates of treatment effect.
Early termination yields a smaller than anticipated sample size and a correspond-
ingly imprecise estimate of the treatment effect. In the superiority design, a precise
estimate of the treatment effect is desirable, especially when the intervention is
shown to be efficacious. One could argue, however, that concern over the impreci-
sion of the effect estimate can be overcome if there is truly overwhelming evidence
of benefit. In the futility design, a precise estimate of the treatment effect may not be
as important in the face of a futility declaration.
Protocol Adherence
One can easily envision the futility design as a second stage in a sequential early
phase trial, following either a dose-finding or dose-selection stage. Levy et al. (2006)
developed a two-stage selection and futility design to sequentially select the dose of
Coenzyme Q10 and subsequently to evaluate the futility of the selected dose, in
ALS. The trial design is briefly described here; the interested reader is referred to
Levy et al. (2006) for details. The first (selection) stage was designed and conducted
according to statistical selection theory, with the sample size determined in order to
yield a high probability that the superior dose would be selected. Subjects were
randomly allocated to one of three treatment arms (one of two active doses or a
concurrent placebo), and at the conclusion of this stage, the preferred active dose was
selected. The second stage was designed and conducted according to the futility
design. Subjects were randomly allocated to one of two treatment arms (the active
dose selected in the first stage or a concurrent placebo). At the conclusion of the
second stage, the futility analysis compared the active dose selected in the first to the
concurrent placebo, using the subjects randomized to those arms in both stage 1 and
stage 2. Levy et al. (2006) note that a bias is introduced into the final futility
evaluation because the best dose was selected in the first stage and that same data
are used in the second stage analysis, and their methodology includes an appropriate
bias correction.
The futility design can be used to provide a clear “no go” signal for evaluating
whether an intervention shows sufficient promise to warrant confirmatory testing.
The single-arm futility design yields dramatic sample size savings, but the need to
derive a fixed reference value can be difficult. This drawback can be overcome using
either a calibration control or a concurrent control, to directly compare against the
intervention. The statistical hypotheses, as stated in the futility design, are more in
keeping with the Phase II objective than the hypotheses of the superiority design.
Because the alternative hypothesis is used to describe futility, a statistically signif-
icant finding indicates that the intervention does not warrant confirmatory efficacy
testing, whereas a nonsignificant finding suggests that the intervention should be
moved forward for further evaluation.
58 Futility Designs 1081
Key Facts
The phase II trial is often used to evaluate whether an intervention has sufficient
efficacy signal to warrant confirmatory testing. The typical design, a superiority
design powered to detect large effect sizes, can be criticized as an underpowered
Phase III trial. The futility design reverses the statistical hypotheses, in order to weed
out interventions which do not warrant further testing. The design provides a clear
“no go” signal as to whether an intervention should be moved to Phase III efficacy
evaluation.
Cross-References
References
Bassler D, Briel M, Montori VM, Lane M, Glasziou P, Zhou Q, Heels-Ansdell D, Walter SD,
Guyatt GH, The STOPIT-2 Study Group (2010) Stopping randomized trials early for benefit and
estimation of treatment effects: systematic review and meta-regression analysis. J Am Med
Assoc 303:1180–1187
Brown SR, Gregory WM, Twelves CJ, Buyse M, Collinson F, Parmar M, Seymour MT, Brown JM
(2011) Designing phase II trials in cancer: a systematic review and guidance. Br J Cancer
105:194–199
Chalmers TC, Block JB, Lee S (1972) Controlled studies in clinical cancer research. NEJM
287:75–78
Chen X, Wang K (2016) The fate of medications evaluated for ischemic stroke pharmacotherapy
over the period 1995–2015. Acta Pharm Sin B 6:522–530
Djulbegovic B, Kumar A, Soares HP, Hozo I, Bepler G, Clarke M, Bennett CL (2008) Treatment
success in cancer: new cancer treatment successes identified in phase 3 randomized controlled
trials conducted by the National Cancer Institute-Sponsored Cooperative Oncology Groups,
1955–2006. Arch Intern Med 168(6):632–642
Estey EH, Thall PF (2003) New designs for phase 2 clinical trials. Blood 102:442–448
Garrett-Mayer E (2006) The continual reassessment method for dose-finding studies: a tutorial.
Clin Trials 3:57–71
Herson J (1979) Predictive probability early termination plans for phase II clinical trials. Biometrics
35:775–783
Herson J, Carter SK (1986) Calibrated phase II clinical trials in oncology. Stat Med 5:441–447
Hwang TJ, Carpenter D, Lauffenburger JC, Wang B, Franklin JM, Kesselheim AS (2016) Failure of
investigational drugs in late-stage clinical development and publication of trial results. JAMA
Intern Med 176:1826–1833
IMS Investigators (2004) Combined intravenous and intra-arterial recanalization for acute ischemic
stroke: the Interventional Management of Stroke Study. Stroke 35:904–911
Kauffman P, Thompson JL, Levy G, Buchsbaum R, Shefner J, Krivickas LS, Katz J, Rollins Y,
Barohn RJ, Jackson CE, Tiryaki E, Lomen-Hoerth C, Armon C, Tandan R, Rudnicki SA,
Rezania K, Sufit R, Pestronk A, Novella SP, Heiman-Patterson T, Kasarskis EJ, Pioro EP,
Montes J, Arbing R, Vecchio D, Barsdorf A, Mitsumoto H, Levin B, QALS Study Group (2009)
1082 S. D. Yeatts and Y. Y. Palesch
Phase II trial of CoQ10 for ALS finds insufficient evidence to justify phase III. Ann Neurol
66:235–244
Kidwell CS, Liebeskind DS, Starkman S, Saver JL (2001) Trends in acute ischemic stroke trials
through the 20th century. Stroke 32:1349–1359
Levin B (2012) Chapter 8: Selection and futility designs. In: Ravina B, Cummings J, McDermott
MP, Poole M (eds) Clinical trials in neurology. Cambridge University Press, Cambridge
Levin B (2015) The futility study – progress over the last decade. Contemp Clin Trials 45:69–75
Levy G, Kaufmann P, Buchsbaum R, Montes J, Barsdorf A, Arbing R, Battista V, Zhou X,
Mitsumoto H, Levin B, Thompson JLP (2006) A two-stage design for a phase II clinical trial
of coenzyme Q10 in ALS. Neurology 66:660–663
NINDS NET-PD Investigators (2006) A randomized, double-blind, futility clinical trial of creatine
and minocycline in early Parkinson disease. Neurology 66:664–671
NINDS NET-PD Investigators (2007) A randomized clinical trial of coenzyme Q10 and GPI-1485
in early Parkinson disease. Neurology 68:20–28
O’Brien PC, Fleming TR (1979) A multiple testing procedure for clinical trials. Biometrics
35:549–556
Palesch YY, Tilley BC, Sackett DL, Johnston KC, Woolson R (2005) Applying a phase II futility
study design to therapeutic stroke trials. Stroke 36:2410–2414
Parkinson Study Group (1989) Effect of deprenyl on the progression of disability in early
Parkinson’s disease. N Engl J Med 321:1364–1371
Pocock SJ (1976) The combination of randomized and historical controls in clinical trials. J Chronic
Dis 29:175–188
Pocock SJ (1977) Group sequential methods in the design and analysis of clinical trials. Biometrika
64:191–199
Sacks LV, Shamsuddin HH, Yasinskaya YL, Bouri K, Lanthier ML, Sherman RE (2014) Scientific
and regulatory reasons for delay and denial of FDA approval of initial applications for new
drugs, 2000–2012. JAMA 311:378–384
Selim M, Foster LD, Moy CS, Xi G, Hill MD, Morgenstern LB, Greenberg SM, James ML, Singh
V, Clark WM, Norton C, Palesch Y, Yeatts SD, on behalf of the iDEF Investigators (2019)
Deferoxamine mesylate in patients with intracerebral haemorrhage (i-DEF): a multicenter,
placebo-controlled, randomized, double-blind phase 2 trial. Lancet Neurol 18(5):428–438
Tilley BC, Palesch YY, Kieburtz K, Ravina B, Huang P, Elm JJ, Shannon K, Wooten GF,
Tanner CM, Goetz GC, on behalf of the NET-PD Investigators (2006) Optimizing the ongoing
search for new treatments for Parkinson disease: using futility designs. Neurology 66:628–633
Interim Analysis in Clinical Trials
59
John A. Kairalla, Rachel Zahigian, and Samuel S. Wu
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084
Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1084
Types of Data Used in Interim Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1086
Applications of Interim Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1087
Methods of Interim Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1089
Non-comparative Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1089
Comparative Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1090
Planning Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1092
Oversight and Maintaining Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094
Applications and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1096
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1098
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1099
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1100
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1100
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1100
Abstract
Modern randomized controlled trials often involve multiple periods of data
collection separated by interim analyses, where the accumulated data is analyzed
and findings are used to make adjustments to the ongoing trial. Various endpoints
can be used to influence these decisions, including primary or surrogate outcome
data, safety data, administrative data, and/or new external information. Example
uses of interim analyses include deciding if there is evidence that a trial should be
stopped early for safety, efficacy, or futility or if the treatment allocation ratios
J. A. Kairalla (*) · S. S. Wu
University of Florida, Gainesville, FL, USA
e-mail: johnkair@ufl.edu; sw45@ufl.edu
R. Zahigian
Vertex Pharmaceuticals, Boston, MA, USA
e-mail: [email protected]
should be modified to optimize trial efficiency and better align the risk-benefit
ratio. Additionally, a decision could be made to lengthen or shorten a trial based
on observed information. To avoid unwanted bias, studies known as adaptive
design clinical trials pre-specify these decision rules in the study protocol.
Extensive simulation studies are often required during study planning and
protocol development in order to characterize operating characteristics and
validate testing procedures and parameter estimation. Over time, researchers
have gained a better understanding of the strengths and limitations of employing
interim analyses in their clinical studies. In particular, with proper planning and
conduct, adaptive designs incorporating interim analyses can provide great ben-
efits in flexibility and efficiency. However, an increase in infrastructure for
development and planning is needed to successfully implement adaptive designs
and interim analyses and allow their potential advantages to be achieved in
clinical research.
Keywords
Adaptive design · Early stopping · Flexible design · Futility · Interim analysis ·
Interim monitoring · Group sequential · Safety monitoring · Nuisance parameter ·
Sample size
Introduction
This chapter describes the concept of interim analysis (IA, also generally referred to
as interim monitoring) in clinical trials and how they are used to enhance and
optimize the conduct of clinical studies. A specific focus is placed on the use of
interim analyses in a class of clinical trials known as adaptive designs (ADs). The
chapter begins with an overview including definitions, a brief history, and motiva-
tions. It then describes the type of data used in IAs to inform decision-making and
various possible applications. Possible study adjustments based on IA data include
sample size re-estimation (SSR), early stopping, safety monitoring, and treatment
arm modification. Following descriptions of planning considerations, oversight, and
results reporting, various examples are summarized. Discussion topics include
highlighting the emergence of Bayesian methodology in clinical trials with IAs
and describing some logistical barriers that must be addressed in order for clinical
research to benefit from interim decision-making.
Evidence of efficacy and safety of new interventions is usually provided through the
conduct and analysis of randomized controlled trials (RCTs). Traditionally, RCTs are
largely inflexible: many design components such as meaningful treatment effects,
outcome variability, patient population, and primary endpoint are specified and fixed
59 Interim Analysis in Clinical Trials 1085
before trial enrollment begins. The trial is then sized and conducted with statistical
power (such as 80%) and an allowable type I error rate (such as 5% for 2-sided tests)
in mind for the given set of study assumptions, with analysis conducted once all of
the information has been collected. However, if study assumptions are incorrectly
specified, the trial may produce inaccurate or ambiguous results, with significant
time and resources largely wasted. Additionally, ignoring accruing safety informa-
tion during study conduct could lead to important ethical concerns. For these
reasons, various forms of IAs are included in many modern trial plans. The US
Food and Drug Administration (FDA) defines an IA as “any examination of data
obtained from subjects in a trial while that trial is ongoing. . . [including] . . .baseline
data, safety outcome data, pharmacokinetic, pharmacodynamic or other biomarker
data, or efficacy outcome data” (FDA 2018). The idea of IAs was described in the
1967 Greenberg Report (Heart Special Project Committee 1988), which highlighted
the potential benefits to stopping a trial early for efficacy or futility. The Greenberg
Report also illustrated the necessity of an independent Data Monitoring Committee
(DMC, also known as Data Safety and Monitoring Board or similar) to evaluate
interim data and provide recommendations. In a trial with IAs, the information that is
collected partway through a trial is used to inform a trial’s future in some manner.
This added flexibility can be very appealing to researchers, stakeholders and spon-
sors of the trial, regulatory agencies, and study participants. Accumulating data can
inform about efficacy, event rates, variability, accrual rates, protocol violations,
dropout rates, and other useful study elements. Using this information, various
study decisions can be made, including closing a study for safety, efficacy, or futility,
updating intervention allocation ratios, altering treatment dosing and regimens,
changing primary endpoints, re-estimating the sample size, or altering the study
population of interest.
A well-known fact of repeated testing of hypotheses is a potentially inflated
overall type I error rate: each time the data is evaluated, there is an additional chance
of making a false-positive conclusion (Armitage et al. 1969). In 1977, Pocock
proposed repeated significance testing with equally sized groups of sequentially
evaluated subjects using a fixed, but reduced nominal significance level to control
the overall type I error rate (Pocock 1977). This began the development and
implementation of a popular class of study designs called group sequential methods
(GSMs).
An alternative approach comes from the idea of partitioning a trial into distinct
stages separated by IAs. Each stage is analyzed separately, and the data is combined
using a pre-specified combination function (Bauer and Kohne 1994). This method
can be implemented for both ADs and flexible designs, which allow for planned or
unplanned study modifications between stages, including early stopping, trial exten-
sions, and many other stagewise modifications, without type I error rate inflation
(Proschan and Hunsberger 1995).
Adaptive designs (ADs) are an important class of RCTs that incorporate IAs in
which accumulating data is used to inform how the study should proceed using only
pre-specified modification rules (FDA 2018). A beneficial property of this rules-
based approach is that study operating characteristics (e.g., power, type I error rate,
1086 J. A. Kairalla et al.
expected sample size) can be exhaustively explored under various scenarios before
trial implementation by undertaking sensitivity analyses based either on known
theory or simulation studies (Kairalla et al. 2012). An implementation of ADs that
has experienced much attention and methodological development over the last
30 years is sample size re-estimation (SSR). Most SSR procedures aim to stabilize
the study power by updating study planning assumptions using observed data, such
as using an interim estimate of the treatment effect (Cui et al. 1999), or modifying the
sample size based on an updated nuisance parameter: a planning parameter, such as
variance, that is not of primary concern but that affects the statistical properties of a
study (Wittes and Brittain 1990). SSR procedures may or may not include early
stopping features in addition to repowering a trial. Methodological development for
ADs as a broad category is significant, and implementation frequency is increasing.
Years of statistical development and discussion followed before regulatory guidance
documents were released in Europe and the United States (EMA 2007; FDA 2018).
When correctly implemented, IAs can lead to improvements in resource and
statistical efficiency since there is potentially a higher chance of correctly detecting
a treatment effect if one exists or stopping a trial early to save resources if a
conclusion is clear. There are also ethical benefits for early stopping and safety
monitoring using IAs. These advantages motivate the remainder of this chapter as
they highlight the potential benefits to consider when planning a clinical trial using
IAs.
evaluate the accumulated data and use it to modify initial assumptions. The most
common nuisance parameters are estimates of variance in continuous outcome
settings and estimates of control group event rates in binary outcome settings.
While most methods use comparative information by taking advantage of group
assignment (such as using residual variance), non-comparative information can also
be used to incorporate nuisance parameter updates (Gould and Shih 1992).
Safety:
If an intervention is not proven to be relatively safe, it will not be approved by the
FDA or other regulatory bodies regardless of its efficacy for a given endpoint. Early
phase clinical trials have safety endpoints, such as dose-limiting toxicities, as
primary outcomes. Confirmatory trials also include safety and adverse event mon-
itoring, with both explicit rules and ad hoc safety considerations taken into account
at IAs and at final evaluations.
Study and Surrogate Endpoints: Generally, the main study endpoint in a late-
phase RCT is intervention efficacy. Decisions made at IAs often involve comparative
outcome data among the different treatment arms. In this case, a decision to
terminate a trial is reflected directly in the trial’s efficacy outcome, e.g., a study
could be stopped if there is significant evidence that a new treatment is superior to
standard therapy. A surrogate endpoint is a response variable that is assumed to
correlate directly to the primary endpoint. With scientific rationale and research
justification, it may be possible to use this surrogate outcome as a short-term
substitute for the primary outcome of interest at IAs to reduce study timelines. For
example, the minimal residual disease is a quantifiable value of residual blood cancer
that is correlated with relapse risk; it has been considered as a possible surrogate for
event-free survival, which would require waiting until enough patients have relapses
or other events to conduct an IA.
External Information: Information gathered outside of an enrolling trial (e.g.,
from a similar trial) may alter knowledge of expected study outcomes or safety
information. Rather than permanently stopping the ongoing trial, changes can be
made at interim periods. Unplanned IAs may result, often with study amendments
justifying the changes and resulting statistical properties. In order to maintain
validity, it is important that internal comparative data are not considered when
making such trial modifications. Some designs (e.g., flexible designs) can handle
this naturally by analyzing the sequential cohorts separately.
sample size re-estimation, and adaptive seamless designs are applications of IAs in a
confirmatory setting that will be discussed.
Safety Monitoring: Accumulating data can inform investigators about potential
safety concerns of a particular intervention or dose. Adaptations at IAs are often
planned with both efficacy and safety in mind. For example, exploratory dose-
finding trials have adaptations planned on safety and toxicity. Additionally, in
confirmatory trials, interventions will not gain regulatory approval unless they
have been shown to have an acceptable risk-benefit ratio. It is necessary in the
planning phase to consider the minimum amount of data required in order to obtain
sufficient safety information; this must be accounted for when planning GSMs (and
other IA plans) that may stop early for efficacy (FDA 2018).
Futility Monitoring: Futility stopping in a RCT is an appealing option that can
improve overall clinical research efficiency by stopping a trial when there is statis-
tical evidence that a trial is unlikely to show efficacy if allowed to continue. Futility
rules can be binding or nonbinding; as futility stopping does not increase the type I
error rate, both can be appropriate so long as they are accounted for transparently in
the statistical analysis (FDA 2018). In non-binding cases, results can be summarized
and presented with recommendations to the DMC, which makes decisions regarding
trial alterations or continuations. One class of futility monitoring rules is based on
repeatedly testing the alternative hypothesis at a fixed significance level (such at
0.005) and stopping for futility if the alternative hypothesis is rejected at any point
(Anderson and High 2011; Fleming et al. 1984). An alternative approach is stochas-
tic curtailment based on conditional power arguments (Lachin 2005). Here, evidence
to stop for futility is based on a low probability of correctly detecting a statistically
significant result at the end of the study, given the current data.
Once interim data is collected, it is used to inform trial modifications that can better
achieve study goals. Methodological results and analytic descriptions of IAs in RCTs
are extensive and will not be comprehensively reviewed here. Additionally, compli-
cated and novel designs are frequently proposed that do not neatly fall into one of the
below categories. However, some key design classes are highlighted to summarize
the various approaches to using interim data in RCTs.
Non-comparative Designs
To account for the fact that the pooled variance estimate is not independent of the
treatment effect, adjustments are made based on the planned treatment effect (Friede
and Kieser 2011; Gould and Shih 1992). Another application of non-comparative
ADs is in looking at outcome data (such as event rates) for a particular biomarker
group in order to assess and optimize an enrichment strategy (FDA 2018).
Comparative Designs
Adaptations that are made based on comparative data, or data that uses information
about treatment assignment, can affect the overall type I error rate more drastically
than adaptations based on non-comparative data and thus must be justified and
accounted for in statistical methods (FDA 2018). There are many described ADs
using comparative data, with some broad classes summarized here.
Group Sequential Methods: GSM clinical trials involve several prospectively
planned IAs on sequentially enrolled groups of subjects and involve a decision
about whether a trial should stop early based on observed interim treatment
outcomes. At an IA, a trial can be stopped for efficacy or futility, and the type I
error inflation inherent to repeated uncorrected significance testing is controlled
through developed stopping bounds. As mentioned in the “Background and Moti-
vation” section, the first proposed bounds involved a fixed, but adjusted nominal
significance level at all testing points (Pocock 1977). Other notable GSM stopping
bounds include the popular O’Brien-Fleming bounds that start conservative and
become more liberal as a study accrues more information (O’Brien and Fleming
1979) and the more flexible GSMs that utilize α-spending functions to allow for
flexible number and timing of analyses (Lan and DeMets 1983). GSMs have ethical
and efficiency advantages by reducing expected sample size versus fixed sample
designs, as well as versus other forms of IAs such as adaptive combination tests and
SSRs (Jennison and Turnbull 2006; Tsiatis and Mehta 2003). However, several
issues must be considered during planning and before the decision is made to stop
early. For one, when using flexible bounds such as those described by Lan and
DeMets (1983), the decision to perform an analysis should be specified by calendar
time or fractions of available information rather than influenced by observed trends.
Additionally, efficacy stopping should be rule-based and involve transparent
reporting that includes the stopping bounds considered at each IA. Finally, it is
important to consider how much additional precision, as well as secondary outcome
and safety information, is lost by stopping a trial early. To account for this, one
approach is to allow the first IA only after a minimum fraction of the planned sample
has been evaluated.
Adaptive Combination Tests: In RCTs with IAs, outcomes for independent
cohorts of participants can be evaluated across time, with standardized test statistics
calculated separately in each cohort. By combining these test statistics in a pre-
determined way, type I error rate can be controlled regardless of whether or not the
trial is following pre-defined adaptation rules (Bauer and Kohne 1994; Proschan and
Hunsberger 1995). P-values or independent test statistics from different stages can
59 Interim Analysis in Clinical Trials 1091
Planning Considerations
Limitations: Although ADs with IAs provide many advantages, there are limitations
that must be considered. For example, terminating a trial early is appealing to
sponsors because it saves money and resources and can result in effective treatments
being available more quickly. However, evidence collected from a smaller trial is not
as precise or reliable; a larger trial allows for more information to be collected on
subgroups as well as secondary endpoints and important safety data.
59 Interim Analysis in Clinical Trials 1093
Studies with ADs, and IAs more generally, are not a cure for inadequate planning;
in fact, they generally require much more up-front planning than fixed sample
studies. This planning process will likely be lengthier and involve more complicated
logistical considerations, offsetting some of the time advantages. The added effi-
ciency and flexibility must justify the increase in study complexity and the accom-
panying difficulties in interpretability. Derived analytic methods and numeric
justifications (e.g., simulations) must be used in ADs to avoid bias and type I error
inflation, which may be compromised if unplanned IAs arise. Rigidity of an AD
implementation plan is a difficultly in practice for complicated RCTs involving
hundreds or thousands of subjects across many sites. Any deviations must be
documented and subsequent study properties ascertained as well as possible given
the actual procedures followed.
Additionally, timing should be considered when contemplating the usefulness of
IAs in RCTs. Designs work best when there is a predictable accrual rate and
outcomes are known relatively quickly. If there is a fast accrual and outcomes are
not known until years of follow-up are completed, then any advantages of efficiency
in the study design will be mitigated by the fact that full enrollment is complete
before IA results are known. It is important that the design fits the setting and
expected outcome time frames, and the usefulness and feasibility of an IA plan
must be carefully considered when comparing potential trial designs in the planning
stage (Bhatt and Mehta 2016).
Estimation Bias: Much focus when discussing statistical methods for IAs is on
hypotheses testing and control of type I error rate. However, any publication or
results reporting when a trial is complete would include information about the
observed treatment effect, a figure which will be widely cited for a high-profile
study. Biased treatment effects from naïve estimates are a concern for any IA plan,
especially those involving early stopping. Large fluctuations of early-stage estimated
treatment effects could induce stopping, leading to bias and possible overestimation
(Bassler et al. 2010). This is also true for estimation of secondary endpoints that are
not involved directly in stopping rules but are correlated with the primary endpoint
(FDA 2018). Methods can be incorporated prospectively into a trial plan to adjust for
potential estimation bias for some designs, but this is an area of research not as well
understood or developed as control of type I error rates (Shimura 2019). The extent
of potential bias should be explored methodologically or via simulations, with bias
corrections considered and estimates and confidence intervals presented with inter-
pretational caution (FDA 2018; Kimani et al. 2015).
Information Sharing and Operational Bias: Comparative results at IAs during
trial conduct should be carefully guarded. However, even revealing study decisions
based on comparative interim data to those involved with the conduct or manage-
ment of a trial can lead to substantial bias and unpredictable trial complications. For
example, if a statistical plan is known in detail, and a study design changes in a
transparent manner such as increasing the sample size to a particular number, then it
may be possible for investigators to speculate, infer, or back-calculate treatment
effect results. Among other issues, this could compromise study integrity by affect-
ing enrollment and retention for patients currently enrolled and cause hesitancy with
1094 J. A. Kairalla et al.
sponsors to further support the trial. To limit this bias, the DMC (or other tasked
body who evaluates the interim data) should include statistical expertise and a clear
understanding of the specified design being implemented by a team independent
from those directly involved with the conduct of the trial (Bhatt and Mehta 2016;
FDA 2018). Data coordinating centers are useful in creating separation between
those directly conducting a trial on a day-to-day level and those responsible for
analyzing and reporting IA findings to the DMC. Additionally, a study could
consider reporting the details of AD algorithms somewhere other than a public
study protocol, such as in a DMC charter.
Role of Simulations: Often, ADs, despite their pre-specified analysis plan, involve
complicated components, with hard-to-discern operating characteristics under pos-
sible true conditions, such as treatment effects, event rates, and nuisance parameters.
In order to justify the validity and advantages of a design, extensive simulation
studies can be conducted. Simulations studies use clinical knowledge, programming,
and computing technology to create a virtual clinical trial framework that incorpo-
rates all IA rules, including SSR, early stopping, and treatment allocation changes.
Simulations can provide information about expected study duration and can quan-
titate power, expected sample sizes, and potential biases. By conducting a sensitivity
analysis, researchers can justify their design and explore optimization of study
components such as critical value thresholds and sample sizes (FDA 2018; Pallmann
et al. 2018).
Generalizability: After adaptation, the results of a trial may not be generalizable
to the original study population. For example, in a trial using enrichment strategies,
demographics may change over the course of the study based on selected biomarker
groups, and study results may be restricted to a particular subgroup. Additionally, a
flexible design with resulting study changes at IA such as sample size, participant
entry criteria, and primary study endpoint may lead to stagewise study results that are
not similar enough for a broad interpretation of a resulting hypothesis test. Careful
consideration is necessary when interpreting study conclusions for trials with IAs
(Pallmann et al. 2018).
agreement for certain details (such as exact SSR procedures) to be excluded from the
publicly available protocol and documented separately (e.g., in a DMC charter).
Data Monitoring Committee:
DMCs are comprised of some combination of statisticians, epidemiologists,
pharmacists, ethicists, patient advocates, and others who are responsible for
overseeing IAs in RCTs (Ellenberg et al. 2003). To maintain trial integrity, it is
important that the DMC is both intellectually and financially independent from those
conducting and sponsoring the trial. The DMC should ideally be involved in
protocol development and approval to insure the entire team is on the same page
and that the committee understands the design and their responsibilities. At IAs, the
DMC considers participant recruitment and compliance to treatment, intervention
safety, quality of study conduct, proper measurement of response, and the primary
and secondary outcome results. It is recommended that the DMC is unblinded when
they conduct or review IAs and their recommendations should follow the previously
agreed upon protocol whenever possible. Interim data should be reviewed for
intervention efficacy and safety signals, recruitment rates, and subgroup indications.
Open reports containing pooled, aggregate information (such as enrollment rates and
serious adverse events) can be shared more broadly, and confidential closed reports
with specific efficacy information are generated separately for closed DMC review.
The DMC should then pass recommendations to a blinded trial steering committee,
whose role is to oversee the trial conduct (Pallmann et al. 2018).
Interactions with the FDA: In addition to reviewing protocol information and
study design properties, the FDA often reviews marketing applications that highlight
the results of a completed RCT. These can include new drug applications (NDAs) or
biologic license applications (BLAs). In an AD setting, the FDA’s primary concerns
of safety and efficacy are coupled with complicated design components that may not
be readily understood without further communication. As a result, applications with
ADs are often reviewed with greater scrutiny than nonadaptive designs. As
described previously, simulations and other justifications are required to rationalize
the advantages of a complicated analysis plan with the chosen study parameters. It is
also important that the FDA is able to see that the overall type I error rate is
controlled despite repeated IAs. Unless patient safety is at risk, results from IAs in
ongoing trials are generally not shared with the FDA until the conclusion of the trial
(FDA 2018).
Reporting: In the United States, most RCTs that involve human volunteers are
required to be registered through the National Library of Medicine at ClinicalTrials.
gov in order to provide transparency and give the public, patients, caregivers, and
clinical researchers access to trial information. ClinialTrials.gov is a database that
summarizes publicly available information and results for registered domestic and
international clinical trials. Additionally, a clinical trial should meet the minimum
reporting standards outlined in the CONSORT guideline, last updated in 2010
(Moher et al. 2010). This guideline helps investigators worldwide provide complete,
transparent reporting with regard to trial design, participant recruitment, statistical
methods, and results. The guideline specifically mentions the necessity of reporting
IAs (item 7b), regardless of whether or not pre-specified rules are used in decision-
1096 J. A. Kairalla et al.
making. It is required for investigators to report how many interim looks the DMC
completed along with their purpose and the statistical methods implemented.
Group Sequential Trials: Consider the Beta-Blocker Heart Attack Trial (Beta-
Blocker Heart Attack Study Group 1981), a large, multicenter, double-blind RCT
consisting of patients with recent myocardial infarction. The primary aim was to
compare mortality in patients taking the active medication, propranolol, versus a
placebo. After the independent DMC determined a clear treatment benefit, the trial
was terminated 9 months early. The interim data revealed 7% mortality in the
propranolol group (135 deaths), compared to 9.5% mortality in the placebo group
(183 deaths). In making their recommendation, the DMC ensured no potential
confounding existed due to baseline demographics, study compliance, and unantic-
ipated side effects. They were able to conclude that the outcome was unlikely to
change if the study continued and, for ethical reasons, it was appropriate to dissem-
inate the trial information as quickly as possible. The study is notable as one of the
first major trials to incorporate relatively new GSM monitoring. The rules, based on
O’Brien-Fleming-type stopping bounds, were not originally part of the trial design
but were incorporated early in the trial implementation. Other examples of pre-
specified GSM trials that stopped early include the Herceptin Adjuvant Trial,
designed to test the effect of trastuzumab after adjuvant chemotherapy in those
diagnosed with HER2-positive breast cancer, and the EXAMINE trial, which studied
the use of alogliptin in patients who were considered to be high risk for cardiovas-
cular disease (Piccart-Gebhart et al. 2005; White et al. 2013). Finally, a GSM design
combined with information-based SSR was utilized in a multinational study of
vitamin D and chronic kidney disease, termed the PRIMO study. The protocol
allowed for interim nuisance parameter estimates to modify uncertain design
assumptions, and the interim treatment effect was used to influence efficacy-based
decision rules. Ultimately, it was found at IA that no SSR was necessary to achieve at
least 85% statistical power (Pritchett et al. 2011).
Changing Hypotheses: The EXAMINE trial had an additional adaptive feature in
which the hypothesis to be tested could vary from superiority to non-inferiority. The
protocol would have allowed the trial to proceed for additional 100 primary events if,
at interim, the probability of detecting superiority, assuming it existed, was above
20%. The IA found that the conditional power was less than 20%, and the study was
terminated early, declaring non-inferiority (Bhatt and Mehta 2016; White et al.
2013).
Potential Uncertainty and Bias: Consider the recently completed TOTAL trial
which studied thrombectomy as adjuvant to traditional percutaneous coronary inter-
vention (PCI) for ST-segment elevation myocardial infarction (STEMI). At an IA
(n ¼ 2,791), the combined intervention group had an observed lower death rate than
those with only PCI (with p ¼ 0.025) along with no significant evidence of a
difference in stroke occurrence; one could deduce from this information that the
59 Interim Analysis in Clinical Trials 1097
testing coupled with different incentives. At the IA, enrollment was discontinued in
groups not showing improvement over the standard of care (p-value >0.2); in stage
2, randomized enrollment continued equally for the remaining groups. Results
showed that secondary distribution of HIV self-testing increased the number of
males being tested for HIV, and with incentives, men were more likely to access
care and prevention services.
Sample Size Re-estimation: The CHAMPION PHOENIX trial, which evaluated
the effects of cangrelor on ischemic complications of percutaneous coronary inter-
vention, incorporated SSR to adjust the trial if observed relative risks differed from
the assumed rates (Leonardi et al. 2012). The IA was performed after 70% of patients
had completed a short follow-up. The sample size was to be increased if the DMC
found the interim results to be in a “promising zone” (as opposed to being clearly
favorable or unfavorable) (Bhatt and Mehta 2016). Ultimately, the sample size was
not increased since the results were “favorable” at IA. Another example, the
CARISA trial was a double-blind, three-group parallel trial to determine whether
ranolazine improves treadmill exercise duration of patients with severe chronic
angina (Chaitman et al. 2004). An IA based on an updated standard deviation
using aggregate data was scheduled after half of the patients were followed for
12 weeks. This “internal pilot-” based SSR allowed the study to maintain stable
statistical power despite incorrect initial assumptions.
Enrichment Designs: In a two-stage adaptive enrichment study to test rizatriptan
for the treatment of acute migraines in people aged 6–17, the first stage randomized
participants at a 20:1 ratio of placebo to intervention (Ho et al. 2012). To enrich the
sample by excluding false responders, any patients who noted a quick improvement
in migraine symptoms after the first stage were dropped. Of the remaining non-
responders, those who took the active treatment in stage 1 were allocated to placebo
in stage 2, whereas those assigned to the placebo in stage 1 were randomized equally
to rizatriptan and placebo in stage 2. Ultimately, efficacy of rizatriptan was shown,
and the drug is now approved by the FDA for acute migraine treatment in this age
group.
Discussion
Two discussion points are worth further consideration: the development and poten-
tial of ADs using Bayesian methodology and the evolving need for infrastructure in
the implementation of RCTs incorporating novel and complicated IAs.
Bayesian Methods: Statistical methods using the Bayesian framework com-
bine prior information with new information to update posterior distributions of
interest. While the use of Bayesian methods in early phase ADs has been
accepted for years (Garrett-Mayer 2006), interest in their potential use in confir-
matory ADs has ramped up considerably over the last decade (Berry et al. 2010;
Brakenhoff et al. 2018). To those in the field, the “learn as you go” nature of ADs
seems like a natural fit for Bayesian reasoning. For example, to address the
59 Interim Analysis in Clinical Trials 1099
IAs in clinical trials are powerful tools that, when properly employed, greatly benefit
clinical efficiency, ethics, and chance of a successful trial. Pre-specified ADs in
particular (including GSMs) have the advantage of known possible decisions and
enumerated study operating characteristics being available for scrutiny before a trial
begins. Flexible designs incorporating unplanned analyses while controlling type I
error rate can also be useful when unanticipated situations occur during trial conduct.
As design understanding and ease of implementation catch up to methodological
development, advanced IA designs will benefit health research and patient outcomes
in the decades ahead.
1100 J. A. Kairalla et al.
Key Facts
Cross-References
References
Anderson J, High R (2011) Alternatives to the standard Fleming, Harrington, and O’Brien futility
boundary. Clin Trials 8(3):270–276. https://doi.org/10.1177/1740774511401636
Armitage P, McPherson C et al (1969) Repeated significance tests on accumulating data. J R Stat
Soc Ser A 132(2):235–244. https://doi.org/10.2307/2343787
Bassler D, Briel M et al (2010) Stopping randomized trials early for benefit and estimation of
treatment effects: systematic review and meta-regression analysis. JAMA 303(12):1180–1187.
https://doi.org/10.1001/jama.2010.310
Bauer P, Kohne K (1994) Evaluation of experiments with adaptive interim analyses. Biometrics 50
(4):1029–1041. https://doi.org/10.2307/2533441
Berry S, Carlin B et al (2010) Bayesian adaptive methods for clinical trials. CRC Press, Boca Raton
Berry S, Broglio K et al (2013) Bayesian hierarchical modeling of patient subpopulations: efficient
designs of Phase II oncology clinical trials. Clin Trials 10(5):720–734. https://doi.org/10.1177/
1740774513497539
Beta-Blocker Heart Attack Study Group (1981) The beta-blocker heart attack trial. JAMA 246
(18):2073–2074
Bhatt D, Mehta C (2016) Adaptive designs for clinical trials. N Engl J Med 375(1):65–74.
https://doi.org/10.1056/NEJMra1510061
Brakenhoff T, Roes K et al (2018) Bayesian sample size re-estimation using power priors. Stat
Methods Med Res. https://doi.org/10.1177/0962280218772315
Brannath W, Koenig F et al (2007) Multiplicity and flexibility in clinical trials. Pharm Stat J Appl
Stat Pharm Ind 6(3):205–216. https://doi.org/10.1002/pst.302
59 Interim Analysis in Clinical Trials 1101
Burman C, Sonesson C (2006) Are flexible designs sound? Biometrics 62(3):664–669. https://doi.
org/10.1111/j.1541-0420.2006.00626.x
Chaitman B, Pepine C et al (2004) Effects of ranolazine with atenolol, amlodipine, or diltiazem on
exercise tolerance and angina frequency in patients with severe chronic angina: a randomized
controlled trial. JAMA 291(3):309–316. https://doi.org/10.1001/jama.291.3.309
Chen Y, Gesser R et al (2015) A seamless phase IIb/III adaptive outcome trial: design rationale and
implementation challenges. Clin Trials 12(1):84–90. https://doi.org/10.1177/
1740774514552110
Choko A, Corbett E et al (2019) HIV self-testing alone or with additional interventions, including
financial incentives, and linkage to care or prevention among male partners of antenatal care
clinic attendees in Malawi: an adaptive multi-arm, multi-stage cluster randomized trial. PLoS
Med 16(1). https://doi.org/10.1371/journal.pmed.1002719
Coffey C, Levin B et al (2012) Overview, hurdles, and future work in adaptive designs: perspectives
from a National Institutes of Health-funded workshop. Clin Trials 9(6):671–680. https://doi.org/
10.1177/1740774512461859
Cui L, Hung H et al (1999) Modification of sample size in group sequential clinical trials.
Biometrics 55(3):853–857. https://doi.org/10.1111/j.0006-341X.1999.00853.x
Ellenberg S, Fleming T et al (eds) (2003) Data monitoring committees in clinical trials: a practical
perspective. Wiley, Chichester
European Medicines Agency (2007) Reflection paper on methodological issues in confirmatory
clinical trials planned with an adaptive design. Retrieved from http://www.ema.europa.eu
Fleming T, Harrington D et al (1984) Designs for group sequential tests. Control Clin Trials 5
(4):349–361. https://doi.org/10.1016/S0197-2456(84)80014-8
Food and Drug Administration (2018) Adaptive designs for clinical trials of drugs and biologics:
guidance for industry. Retrieved from https://www.fda.gov
Friede T, Kieser M (2011) Blinded sample size recalculation for clinical trials with normal data and
baseline adjusted analysis. Pharm Stat 10(1):8–13. https://doi.org/10.1002/pst.398
Garrett-Mayer E (2006) The continual reassessment method for dose-finding studies: a tutorial. Clin
Trials 3(1):57–71. https://doi.org/10.1191/1740774506cn134oa
Gould A, Shih W (1992) Sample size re-estimation without unblinding for normally distributed
outcomes with unknown variance. Commun Stat Theory Methods 21(10):2833–2853.
https://doi.org/10.1080/03610929208830947
Heart Special Project Committee (1988) Organization, review and administration of cooperative
studies (Greenberg report): a report from the Heart Special Project Committee to the National
Advisory Council, May 1967. Control Clin Trials 9:137–148
Ho T, Pearlman E et al (2012) Efficacy and tolerability of rizatriptan in pediatric migraineurs: results
from a randomized, double-blind, placebo-controlled trial using a novel adaptive enrichment
design. Cephalalgia 32(10):750–765. https://doi.org/10.1177/0333102412451358
Hommel G (2001) Adaptive modifications of hypotheses after an interim analysis. Biom J 43(5):581–
589. https://doi.org/10.1002/1521-4036(200109)43:5<581::AID-BIMJ581>3.0.CO;2-J
Hung H, O’Neill R et al (2006) A regulatory view on adaptive/flexible clinical trial design. Biometr
J 48(4):565–573. https://doi.org/10.1002/bimj.200610229
Jennison C, Turnbull B (2006) Efficient group sequential designs when there are several effect sizes
under consideration. Stat Med 25(6):917–932. https://doi.org/10.1002/sim.2251
Jolly S, Gao P et al (2018) Risks of overinterpreting interim data: lessons from the TOTAL trial
(thrombectomy with PCI versus PCI alone in patients with STEMI). Circulation 137
(2):206–209. https://doi.org/10.1161/CIRCULATIONAHA.117.030656
Kairalla J, Coffey C et al (2012) Adaptive trial designs: a review of barriers and opportunities. Trials
13(1):145. https://doi.org/10.1186/1745-6215-13-145
Kimani P, Todd S et al (2015) Estimation after subpopulation selection in adaptive seamless trials.
Stat Med 34(18):2581–2601. https://doi.org/10.1002/sim.6506
Krams M, Lees K et al (2003) Acute stroke therapy by inhibition of neutrophils (ASTIN). Stroke 34
(11):2543–2548. https://doi.org/10.1161/01.STR.0000092527.33910.89
1102 J. A. Kairalla et al.
Lachin J (2005) A review of methods for futility stopping based on conditional power. Stat Med 24
(18):2747–2764. https://doi.org/10.1002/sim.2151
Lan K, DeMets D (1983) Discrete sequential boundaries for clinical trials. Biometrika 70:659–663.
https://doi.org/10.1093/biomet/70.3.659
Leonardi S, Mahaffey K et al (2012) Rationale and design of the Cangrelor versus standard therapy
to achieve optimal Management of Platelet Inhibition PHOENIX trial. Am Heart J 163
(5):768–776. https://doi.org/10.1016/j.ahj.2012.02.018
Mehta C, Pocock S (2011) Adaptive increase in sample size when interim results are promising: a
practical guide with examples. Stat Med 30(28):3267–3284. https://doi.org/10.1002/sim.4102
Moher D, Hopewell S et al (2010) CONSORT 2010 explanation and elaboration: updated guide-
lines for reporting parallel group randomised trials. J Clin Epidemiol 63(8):e1–e37. Retrieved
from www.consort-statement.org
O’Brien P, Fleming T (1979) A multiple testing procedure for clinical trials. Biometrics 35
(3):549–556. https://doi.org/10.2307/2530245
Pallmann P, Bedding A et al (2018) Adaptive designs in clinical trials: why use them, and how to
run and report them. BMC Med 16(1):29. https://doi.org/10.1186/s12916-018-1017-7
Piccart-Gebhart M, Procter M et al (2005) Trastuzumab after adjuvant chemotherapy in HER2-
positive breast cancer. N Engl J Med 353(16):1659–1672. https://doi.org/10.1056/
NEJMoa052306
Pocock S (1977) Group sequential methods in the design and analysis of clinical trials. Biometrika
64(2):191–199. https://doi.org/10.1093/biomet/64.2.191
Pritchett Y, Jemiai Y et al (2011) The use of group sequential, information-based sample size re-
estimation in the design of the PRIMO study of chronic kidney disease. Clin Trials 8
(2):165–174. https://doi.org/10.1177/1740774511399128
Proschan M (2009) Sample size re-estimation in clinical trials. Biometr J 51(2):348–357.
https://doi.org/10.1002/bimj.200800266
Proschan M, Hunsberger S (1995) Designed extension of studies based on conditional power.
Biometrics 51(4):1315–1324. https://doi.org/10.1016/0197-2456(95)91243-6
Savitz J, Teague T et al (2018) Treatment of bipolar depression with minocycline and/or aspirin: an
adaptive, 2 2 double-blind, randomized, placebo-controlled, phase IIA clinical trial. Transl
Psychiatry 8(1):27. https://doi.org/10.1038/s41398-017-0073-7
Shimura M (2019) Reducing overestimation of the treatment effect by interim analysis when
designing clinical trials. J Clin Pharm Ther 44(2):243–248. https://doi.org/10.1111/jcpt.12777
Stallard N, Todd S (2010) Seamless phase II/III designs. Stat Methods Med Res 20(6):626–634.
https://doi.org/10.1177/0962280210379035
Tsiatis A (2006) Information-based monitoring of clinical trials. Stat Med 25(19):3236–3244.
https://doi.org/10.1002/sim.2625
Tsiatis A, Mehta C (2003) On the inefficiency of the adaptive design for monitoring clinical trials.
Biometrika 90(2):367–378. https://doi.org/10.1093/biomet/90.2.367
White W, Cannon C et al (2013) Alogliptin after acute coronary syndrome in patients with type 2
diabetes. N Engl J Med 369(14):1327–1335. https://doi.org/10.1056/NEJMoa1305889
Wittes J, Brittain E (1990) The role of internal pilot studies in increasing the efficacy of clinical
trials. Stat Med 9(1–2):65–72. https://doi.org/10.1002/sim.4780090113
Part VI
Advanced Topics in Trial Design
Bayesian Adaptive Designs for Phase I Trials
60
Michael J. Sweeting, Adrian P. Mander, and Graham M. Wheeler
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106
Escalation with Overdose Control (EWOC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1108
Example 1: Dose-Escalation Cancer Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1110
Varying the Feasibility Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1111
Toxicity-Dependent Feasibility Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1111
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1112
Time-to-Event Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1112
Example 2: Dose Escalation of Cisplatin in Pancreatic Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . 1113
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114
Toxicity Grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114
Ordinal Toxicity Gradings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115
Toxicity Score Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116
Dual Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116
The EffTox Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116
Example 3: The Matchpoint Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1118
Other Approaches for Joint Modeling of Efficacy and Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . 1119
M. J. Sweeting (*)
Department of Health Sciences, University of Leicester, Leicester, UK
Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
e-mail: [email protected]; [email protected]
A. P. Mander
Centre for Trials Research, Cardiff University, Cardiff, UK
e-mail: [email protected]
G. M. Wheeler
Imperial Clinical Trials Unit, Imperial College London, London, UK
Cancer Research UK & UCL Cancer Trials Centre, University College London, London, UK
e-mail: [email protected]
Abstract
Phase I trials mark the first experimentation of a new drug or combination of
drugs in a human population. The primary aim of a cancer phase I trial is to seek
a safe dose or range of doses suitable for phase II experimentation. Bayesian
adaptive designs have long been proposed to allow safe dose escalation and
dose finding within phase I trials. There are now a vast number of designs
proposed for use in phase I trials though widespread application of these designs
is still limited. More recent designs have focused on the incorporation of
multiple sources of information into dose-finding algorithms to improve trial
safety and efficiency. This chapter reviews some of the papers that extend the
simple dose-escalation trial design with a binary toxicity outcome. Specifically,
the chapter focuses on five key topics: (1) overdose control, (2) use of partial
outcome follow-up, (3) grading of toxicity outcomes, (4) incorporation of both
toxicity and efficacy information, and (5) dual-agent or dose-scheduling
designs. Each extension is illustrated with an example from a real-life trial
with reference to freely available software. These extensions open the way to a
broader class of phase I trials being conducted, leading to safer and more
efficient trials.
Keywords
Dose finding · Dose escalation · Phase I trial design · Toxicity · CRM
Introduction
Phase I trials mark the first experimentation of a new drug in a human population. A
primary objective is to identify tolerable doses while ensuring the trial is safe,
acknowledging the necessary balance of risk versus benefit for participants. In
oncology phase I trials, cytotoxic anticancer drugs may have severe toxicity at
high doses, yet at low doses little efficacy is expected from the drug. A goal of
such trials is therefore to minimize the number of patients allocated to ineffective or
excessively toxic doses, and efficient trial designs are required to achieve this and to
meet ethical considerations (Jaki et al. 2013).
Phase I trials are conducted as dose-escalation studies, where the dose of the drug
under consideration can be adapted as new patients are sequentially recruited into the
trial, using dose and outcome data from previously enrolled patients. Designs are
60 Bayesian Adaptive Designs for Phase I Trials 1107
often inherently Bayesian in nature since decisions about dose escalation must be
made early in the trial when few or no results are available, and thus prior beliefs of
the dose-toxicity relationship (and corresponding uncertainty) are often needed. The
key quantity in most phase I dose-escalation trials is the maximum tolerated dose
(MTD). This is often defined as the dose that has a probability of dose-limiting
toxicity (DLT) that is equal to a prespecified target toxicity limit (TTL), which is
commonly chosen in cancer trials to be between 20% and 33% (Le Tourneau et al.
2009). A DLT is defined as a drug-induced toxic effect or severe adverse event that is
considered unacceptable due to its severity or irreversibility, thus preventing an
increase in the dose of the treatment. This definition of the MTD assumes that
there is an underlying continuous dose-toxicity relationship, and is central to most
model-based phase I designs. An alternative set of designs, called rule-based
designs, define the MTD based on the observed proportion of patients in the trial
that experience a DLT at a dose level. These designs are not considered in this
chapter.
It has been over 30 years since the seminal publication of the continual
reassessment method (CRM) (O’Quigley et al. 1990), which was proposed as a
model-based adaptive design for dose-escalation phase I trials. The CRM in its
simplest form is a one-parameter dose-toxicity model that uses previous dose and
DLT outcomes to assign new patients to dose levels as they enter the trial and aims to
estimate the MTD. The CRM, and most designs for phase I trials, is based on the
assumption of monotonicity, whereby the probability of observing a DLT increases
with dose. Since the key interest is in estimating the MTD, a model with a single
parameter is sufficient for local estimation of dose response (i.e., if focus is on a
single point estimate) (O’Quigley et al. 1990). However, phase I trials may require
more complex designs that consider other features, such as limiting the chance of
severe overdosing, using partial data from patients who are still under follow-up for
DLTs, using toxicity outcomes based on graded responses rather than a dichotomous
outcome (DLT or no DLT), considering both toxicity and efficacy outcomes, and
designing trials where two drugs are to be administered and their dose levels adapted
in combination.
The focus of this chapter is to provide a broad overview of some of the more
advanced issues in model-based (Bayesian) adaptive designs for phase I trials and
key considerations that have led to these designs being proposed. The chapter is not
intended to be all-encompassing, but should provide the reader with a flavor of some
of the methodological developments in the area that extend the CRM approach, and
to highlight practical considerations for researchers wishing to apply these methods.
Examples from real-life trials are given throughout the chapter, along with recom-
mendations of freely available software available to apply the methods. For a more
in-depth discussion of the CRM and some of its earlier extensions, readers should
refer to the ▶ chapter 53, “Dose-Finding and Dose-Ranging Studies” by Conaway
and Petroni in section ▶ “Basics of Trial Design” of this book. While this chapter
covers some recent designs for dual-agent phase I trials, a more comprehensive
discussion of designs for drug combination dose finding is given in section
▶ “Basics of Trial Design” of this book by Tighiouart.
1108 M. J. Sweeting et al.
The CRM was the first model-based adaptive design for phase I dose escalation
studies and has been implemented in both Bayesian and frequentist frameworks
(O’Quigley et al. 1990; O’Quigley and Shen 1996). The design makes use of the
dose and toxicity data accumulating as the trial progresses to make dose selection
decisions, giving it a significant advantage over traditional rule-based designs such
as the 3 + 3 method (Iasonos et al. 2008; Le Tourneau et al. 2009). Nevertheless, a
number of modifications have been proposed since the original design to counter
safety concerns about possible overdosing. These include rules that sometimes
override model recommendations including always starting at the lowest dose,
avoiding dose-skipping when escalating, and treating more than one patient at
each dose level (Faries 1994; Korn et al. 1994; Goodman et al. 1995; Piantadosi
et al. 1998).
An alternative approach to control overdosing is to modify the CRM dose-finding
algorithm. After each patient, the CRM estimates the posterior distribution of the
MTD and uses the middle of the distribution (e.g., the mean or median) to recom-
mend the dose to administer to the next patient. However, at least early on in the trial,
the posterior mean or median MTD estimate may fluctuate wildly leading to some
patients receiving doses high above the true MTD. Overdosing can also occur if the
prespecified model is incorrect (Le Tourneau et al. 2009). To overcome this problem
Babb et al. (1998) developed the Escalation With Overdose Control (EWOC) design,
which modifies the CRM so that it recommends the α quantile of the MTD
distribution to the next patient, where α < 0.5. The quantile, α, is known as the
feasibility bound and governs the predicted chance of overdosing in the trial. For
each successive patient the predicted probability of overdosing is α, whereas for the
CRM using the median of the MTD distribution the predicted probability is 0.5. Low
values of α will result in more cautious escalation; the trade-off for this cautious
escalation is that the dose sequence allocated through the trial will generally take
longer to converge to the true MTD. In notation, let Fn ðxÞ ¼ PðMTD xjD n Þ denote
the probability that the MTD is less than or equal to dose x given the data collected
from the previous n patients, namely the doses allocated x1, . . ., xn, and the
corresponding n DLT outcome indicator variables, y1, . . ., yn. The EWOC design
selects the dose xn + 1 for patient n + 1 such that
Fn ðxnþ1 Þ ¼ α:
where γ is the true MTD. Hence a higher penalty is given to overdosing, and this
implies that treating a patient δ units above the MTD is (1 – α)/α times worse than
treating them δ units below the MTD.
In practice, with a discrete set of dose levels d1, . . ., dK, the EWOC design selects
the dose that is within a certain tolerance, T1, of the EWOC target dose xnþ1 ¼
F1
n ðαÞ and where the predicted probability of the MTD being less than the dose is
within a certain tolerance, T2, of the feasibility bound. For patient n + 1 the next
recommended dose is therefore
max d1 , . . . , dK : di xnþ1 T 1 and Fn ðd i Þ α T 2 :
A dose-toxicity model, often used with the EWOC method, is the two-parameter
logistic model, where
and x is the dose, either on the original dose scale or standardized. For example,
given a reference dose xR and using log(x/xR) as a standardized dose, the intercept β0
has the interpretation of being the log-odds of toxicity at the reference dose (see, e.g.,
Neuenschwander et al. 2008). By placing a bivariate normal prior distribution on the
parameters β0 and log(β1) we ensure a monotonically increasing dose-toxicity
relationship since β1 (the slope) is forced to be positive. An alternative parameter-
ization originally proposed in the EWOC formulation (Babb et al. 1998) is to define
ρ0 ¼ π(xmin) ¼ p(DLT|dose ¼ xmin) as the probability of a DLT at the lowest dose,
xmin, and γ as the MTD. Then it can be shown that
logitðρ0 Þ ¼ β0 þ β1 xmin
and
logitðθÞ ¼ β0 þ β1 γ,
where θ is the TTL. The rationale for the re-parameterization is that it may be easier
to specify prior distributions for γ and ρ0, which then can be translated to priors for β0
and β1 (using MCMC for example). In a phase I trial of 5-fluorouracil (5-FU) Babb
et al. (1998) propose independent Uniform (xmin, xmax) and (0, θ) distributions for γ
and ρ0, respectively, which forces the MTD to exist in the prespecified dose range. In
further investigations by Tighiouart et al. (2005) a joint prior for γ and ρ0 with
negative correlation structure was found to perform well and which generally
resulted in a safer trial. An issue with this parameterization and choice of priors is
that the MTD has prior (and hence posterior) probability of 1 of lying between xmin
and xmax. One solution proposed by Tighiouart et al. (2018) is to reparametrize the
EWOC model in terms of ρ0 and ρ1, the probabilities of DLT at the minimum and
maximum doses, respectively.
1110 M. J. Sweeting et al.
Fig. 1 Posterior distribution of the maximum tolerated dose (MTD) from Example 1 after five
cohorts of patients have been recruited, and the α ¼ 0.25 and 0.5 quantiles, F1 1
n ð0:25Þ and Fn ð0:5Þ
60 Bayesian Adaptive Designs for Phase I Trials 1111
recruited, where the dose axis is truncated to doses 40 mg. The potential doses that
can be tested within the trial are shown as points on the x-axis. At a feasibility bound
of α ¼ 0.25, the inverse cumulative distribution function of the MTD, denoted
F1
n ð0:25Þ on the figure, is 18.2 mg. To choose from the discrete dose levels, suppose
we set strict thresholds such that the next dose is not more than 1 mg above
F1
n ð0:25Þ and the probability that the MTD is below the next dose is not more
than α + 0.05 ¼ 0.30. That is we set T1 ¼ 1 and T2 ¼ 0.05. Dose 20 mg does not
satisfy the first criteria and therefore the recommended next dose would be 15 mg,
which does satisfy both constraints. This contrasts to a recommended next dose of
25 mg if the median of the distribution, F1 n ð0:5Þ, is used, which would correspond
to the sixth cohort receiving the same dose as the fifth cohort.
Different proposals have been made for choosing the final dose recommended for phase
II study at the end of an EWOC trial (Babb et al. 1998; Berry et al. 2010). One
potentially undesirable feature from choosing a central estimate from the posterior
MTD distribution (e.g., mean, median, or mode) is that the estimate may be larger
than any dose experimented on in the trial. It may also be undesirable to choose a dose
that would be given if a new patient were recruited into the trial, based on the feasibility
bound, since the final recommended dose is then acknowledged to have posterior
probability of (1 – α) being less than the MTD. An alternative approach originally
proposed by Babb and Rogatko (2001) and later by Chu et al. (2009) is to vary the
feasibility bound as the trial progresses; specifically increasing the bound until it reaches
0.5, at which point the EWOC method would behave like a CRM (with decisions based
on the posterior median). The rationale is that early on in the trial there is a lot of
uncertainty as to the value of the MTD and hence there is more chance of administering
doses that are much greater than the MTD. While, once a number of patients have been
recruited, the magnitude of overdosing will be less and hence the feasibility bound can
be raised. This hybrid approach should therefore converge quicker to the MTD than the
traditional EWOC method while also ensuring that the recommended phase II dose
coincides with the central estimate from the MTD distribution.
Increasing the feasibility bound during the trial is often done using a step-wise
procedure. However, it is possible that the approach can lead to incoherence; that is
despite the most recent patient experiencing a DLT, the recommendation may be to
treat the next patient at a higher dose (Wheeler et al. 2017). While both the
unmodified CRM and EWOC approaches have been shown to be coherent (the
latter for n 2) (Cheung 2005; Tighiouart and Rogatko 2010), coherence violations
may occur using the EWOC approach with an increasing feasibility bound (Wheeler
2018). To overcome this issue, Wheeler et al. (2017) introduced a toxicity-dependent
1112 M. J. Sweeting et al.
feasibility bound that guarantees coherence and where the feasibility bound
increases as a function of the number of non-DLT responses observed.
Software
EWOC-type designs can be fitted using a number of software packages. The bcrm
package in R allows the user to fit EWOC-type designs by specifying the quantile of
the MTD distribution that should be used for dose-escalation decisions (Sweeting
et al. 2013). The package allows users to conduct a trial interactively or to investigate
operating characteristics via simulation. However, the package only allows specifi-
cation of prior distributions on the regression parameters of the two-parameter
logistic model. Alternatively, the ewoc package, also in R (Diniz 2018), is specifi-
cally designed for EWOC designs, allowing the user to explicitly set priors for (the
probability of DLT at the minimum dose), and γ (the MTD). Users are limited,
however, to independent Beta prior distributions for ρ0 and γ, or priors can be placed
on ρ0 and ρ1, as proposed by Tighiouart et al. (2018). Finally, a Graphic User
Interface application by Dinart et al. (2020) is available to download, allowing
users to run and simulate EWOC trials with minimal programming experience
(https://github.com/ddinart/GUIP1).
Time-to-Event Endpoints
Many designs for dose-escalation studies require that for a patient’s DLT outcome to
be included in dose-escalation decision making, a patient must be observed until the
end of the DLT observation period, or until a DLT occurs, whichever time point is
first. In practice, patients may be recruited to trials whilst other patients are receiving
treatment. Therefore, complete outcomes will not be available for all patients, even
though a decision on dose allocation for the next patients is required. To accommo-
date these situations, partial DLT observations may be used to estimate DLT risks at
each dose level, conditional on the absence of a DLT up to the current time. This also
offers the benefit of reducing the overall trial duration.
Cheung and Chappell (2000) proposed an adaptation to the CRM design to
accommodate partial DLT outcomes, known as the Time-to-Event CRM (TITE-
CRM). Under the TITE-CRM, the likelihood for the single model parameter a is
weighted according to the proportion of each patient’s DLT window for which a DLT
has not been observed. That is, for patients 1, . . ., n, let xi and yi,t be the dose given
and current DLT outcome at time t for patient i, and let Δn be the set of all data for
patients 1, . . ., n. The likelihood for parameter a is defined as
Y
n
LðajD n , tÞ ¼ fπ ðxi ; aÞgyi,t f1 wi,t π ðxi ; aÞg1yi,t ,
i¼1
60 Bayesian Adaptive Designs for Phase I Trials 1113
where wi,t ¼ 0 if a DLT has been observed by time t (i.e., yi,t ¼ 1) and wi,t ¼ (t – ti,0)/T
if yi,t ¼ 0 and t T + ti,0, where ti,0 is the time at which patient i started treatment and
T is the length of the DLT observation window. As t increases, the contribution of
patient i to the likelihood, in the absence of a DLT, gets bigger. The rest of the trial
design process is similar to that of the CRM.
Extensions to the TITE-CRM have also been proposed. Braun et al. (2003)
extended the TITE-CRM to adapt the length of schedule (which they refer to as
dose) both between and within patients, in order to identify the Maximum Tolerated
Cumulative Dose (cumulative, as the schedule may change when a patient is on
treatment, and it is the total length of administration that is of interest to the
investigators). Braun (2006) also generalized the TITE-CRM approach to borrow
information on the timing of toxicity across patients. Furthermore, Mauguen et al.
(2011) and Tighiouart et al. (2014) have combined the TITE approach with the
EWOC trial design, thus allowing for overdose control methods to be used in dose
escalation studies with partial observations over a patient’s DLT window.
A similar approach was employed by Ivanova et al. (2016) in their Rapid
Enrollment Design (RED); rather than using the weighting structure of Cheung
and Chappell (2000), they proposed that a patient who has been followed up for
proportion t/T of their DLT window without a DLT being observed has experienced
1 – t/T of a temporary DLT. As t increases, the weighting ascribed to the patient’s
DLT risk goes down, and this in turn updates the likelihood. The subtle difference
between these two approaches is as follows: in the TITE-CRM, a patient who has
completed 70% of their DLT window without having a DLT is included as 0 DLTs
out of 0.7 patients; in the RED, the same patient would be included as 0.3 DLTs out
of one patient. TITE endpoints have also been included in the design of combination
therapy dose-escalation studies (Wages et al. 2013; Wheeler et al. 2019).
Muler et al. (2004) report the results of a phase I trial with the objective of
identifying the MTD of cisplatin when given with fixed doses of gemcitabine and
radiotherapy in patients with pancreatic cancer. The investigators planned to inves-
tigate four dose levels (20, 30, 40, and 50 mg/m2), with 30 mg/m2 as the starting dose
and the target toxicity level chosen as 20%. Dose-escalation decisions were
recommended using the TITE-CRM design, which used a one-parameter logistic
model, with an exponential prior distribution on the model parameter. DLT was
defined as either Grade 4 thrombocytopenia, Grade 4 neutropenia lasting more than
7 days, or any other adverse event of at least grade 3, and the DLT observation
window was 9 weeks from start of treatment. Prior to the study starting, the skeleton
DLT probabilities at each dose were chosen to be 10%, 15%, 20%, and 25%,
respectively.
Figure 2 shows the entry and follow-up times for the 18 patients who were
considered evaluable for toxicity. Patients 1 through 4 were allocated to the starting
1114 M. J. Sweeting et al.
dose to observe enough time without DLTs before patient 5 was allocated to 40 mg2
(on June 26, 2000). Patient 9 was allocated to 50 mg/m2 on November 27, 2000.
Patients 11 and 12 experienced DLTs during follow-up at 50 mg/m2 leading to
patient 16 receiving the lower dose of 40 mg/m2 and patient 18 receiving 30 mg/
m2 following a further DLT in patient 17. At the end of the trial, the 40 mg/m2 dose
had an expected posterior DLT probability of 0.204 (95% credible interval 0.064–
0.427).
Software
Toxicity Grading
Much of the literature in dose-finding trial designs focuses on the binary DLT
outcome in order to identify the MTD. Toxicities are usually graded using the
Common Terminology Criteria for Adverse Events (CTCAE) published by the US
National Cancer Institute (http://evs.nci.nih.gov/ftp1/CTCAE/About.html) and in
turn the dose-limiting toxicities will be tailored to the trial in question. Toxicities
are grouped under different System Organ Classes, and each toxicity (e.g. “nausea,”
“dermatitis,” “neutrophil count decreased”) is graded from 0 (no toxicity) to
60 Bayesian Adaptive Designs for Phase I Trials 1115
A number of phase I designs have been proposed in the literature that incorporate
ordinal toxicity outcomes. Approaches have used either a proportional odds
(PO) model (Van Meter et al. 2011; Tighiouart et al. 2012), a continuation ratio
(CR) model (Van Meter et al. 2012), or a multinomial extension of the CRM power
model (Iasonos et al. 2011) to account for the ordinal toxicity outcome. The PO
model relies on the assumption that the odds of a more severe toxicity grade relative
to any less severe toxicity is constant among all possible toxicity grades (Van Meter
et al. 2012). That is the odds that the toxicity grade is 2 versus <2 is the same as the
odds that the toxicity is 3 versus <3, etc. Meanwhile, the CR method models the
probability that the toxicity is at level g given it is greater than or equal to g but relies
on its own assumption of homogeneity of grade-specific dose effects (Cole and
Ananth 2001). However, with these assumptions, the models can focus on estimating
just one quantile of interest, namely the dose that gives the target probability of
observing grades of toxicity that define a DLT. Information from non-DLT grades are
used to refine the estimation of the relationship between dose and the common odds
ratio. To avoid assumptions imposed by the PO or CR methods, a nonparametric
approach has been proposed using a multidimensional isotonic regression estimator
(Paul et al. 2004). This allows nonparametric estimation of quantiles for each
toxicity grade subject to order constraints and based on a corresponding set of
prespecified probabilities for each grade.
There are several other approaches of note that collapse the ordinal toxicities into a
single equivalent toxicity score (between 0 and 1) such as a beta regression model
(Potthoff and George 2009) or a quasi-Bernoulli likelihood approach (Yuan et al.
2007; Ezzalfani et al. 2013). The latter uses a standard CRM model but requires a
clinically meaningful toxicity score to be assigned to each grade of toxicity.
1116 M. J. Sweeting et al.
Another approach uses an ordinal probit regression with a latent variable, for each
toxicity type under consideration (Bekele and Thall 2004; Lee et al. 2010). The
probability that the toxicity is at a given level (grade) g ¼ 0, . . ., G is then modeled
using the probit model and G – 1 cutoff parameters. Bekele and Thall (2004) used a
multivariate ordinal probit regression approach that allowed multiple toxicity types
(myelosuppression, dermatitis, liver toxicity, nausea/vomiting, and fatigue), each
graded, to be modeled simultaneously with correlation. The authors then quantified
the severity of each toxicity type and grade by eliciting numerical weights. For each
dose under consideration the posterior expected probability of each toxicity type and
grade was multiplied by its associated severity weight and the sum of these across types
and grades gave the overall total toxicity burden (TTB) for that dose. Dose escalation
then proceeded by assigning the next patient the dose with TTB closest to a prespecified
target TTB (elicited through a set of scenario analyses with the oncologists).
Software
The R package ordcrm allows the user to fit both the ordinal PO and CR CRM
models.
Dual Endpoints
The focus for most dose-escalation designs is purely on toxicity, and the common
assumption is that as dose increases, so does both the risk of toxicity and the efficacy
of the drug. However, it may be more prudent to model the dose-efficacy relationship
as well as dose-toxicity. Efficacy may even plateau after a certain dosage, and
therefore, increasing a dose with no increase in efficacy but a potential increase in
toxicity would be unwise. Therefore, many approaches have been proposed in order
to jointly model dose-efficacy and dose-toxicity outcomes.
Thall and Cook (2004) proposed what has come to be known as the EffTox Design, a
Bayesian approach that models the efficacy and toxicity risks per dose, and uses the
trade-off between toxicity and efficacy to select dose levels for new patients.
Specifically, logistic functions are assumed for the dose-toxicity and dose-efficacy
curves, that is,
and
eϕ 1
π a,b ¼ π aT ð1 π T Þ1a þ π bE ð1 π E Þ1b þ ð1Þaþb π T ð1 π T Þπ E ð1 π E Þ
eϕ þ 1
which for patients 1, . . ., n, gives the likelihood
Y
n
LðβE , βT , ϕjD n Þ ¼ π a,b ðxi Þ½a¼ai ,b¼bi :
i¼1
As per Thall and Cook (2006) and Brock et al. (2017), prior beliefs on the
efficacy and toxicity at each dose must be elicited from the clinicians, along with
the prior Effective Sample Size (ESS). It is then possible to transform these prior
beliefs and ESS onto the model parameters {βT,0, βT,1, βE,0, βE,1, βE,2, ϕ} using
specialist software (EffTox Software, MD Anderson, https://biostatistics.
mdanderson.org/softwaredownload/SingleSoftware.aspx?Software_Id¼2; or the
R package trialr). Thall et al. (2014) show how different prior effective sample
sizes affect the operating characteristics of the EffTox design, including the
probability of selecting each dose as the optimum dose and the probability of
terminating the trial early.
The key step under the EffTox approach is to define a utility function that reflects
the trade-offs between efficacy and toxicity that the trial team are willing to accept.
To do this, three target trade-offs are specified: π 1 ¼ ðπ T,1 , 1Þ , where π T,1 is the
maximum toxicity level at which efficacy is guaranteed; π 2 ¼ ð0, π E,2 Þ, where π E,2 is
the minimum efficacy level at which toxicity is guaranteed to not occur; π 3 ¼
ðπ T,3 , π E,3 Þ, an intermediate target between the two marginal targets π 1 and π 2 that,
with a contour fitted through all three target trade-offs, will provide a suitably steep
contour to encourage escalation to doses that are estimated to have substantially
higher efficacy probabilities with only a limited increase in toxicity risk (Yuan et al.
2017; Brock et al. 2017). Thall et al. (2014) use L p norms to model the utility
contours, specifically
p p 1=p
1 πE πT
uð π T , π E Þ ¼ 1 þ ,
1 π E,2 π T,1
where p determines the extent of the curvature of the contours. The utility function
u allows us to evaluate the desirability/utility of a dose level based on its estimated
probability of toxicity and efficacy. The value of p is obtained by solving u(πT, πE) ¼ 0,
which denotes the neutral contour. We may then recommend a dose level for the next
patient that maximizes this utility, subject to any other constraints one may wish to use
in the trial. For example, if we have target minimum efficacy π E and target maximum
1118 M. J. Sweeting et al.
toxicity π T , then for selected cutoffs pE and pT, only doses that satisfy the following
constraints are available for recommendation:
Pr π E ðxÞ π E > pE
and
Pr π T ðxÞ π T > pT :
Brock et al. (2017) described how they designed the Matchpoint trial, a dose-finding
study of Ponatinib plus chemotherapy in patients with chronic myeloid leukemia in
blastic transformation phase, using the Efftox design. The aim of the study was to
identify the dose of Ponatinib that produced a minimum efficacy response rate of
45%, with an acceptable toxicity level of at most 40%. Four doses were considered:
15 mg every second day, 15 mg daily, 30 mg daily, and 45 mg daily. Clinicians
specified prior toxicity and efficacy probabilities as shown in Table 1, and with the
help of the trial team, chose cutoffs for admissible doses to be pE ¼ 0.03 and
pT ¼ 0.05. The low thresholds permitted that even weak beliefs of the efficacy and
toxicity probabilities would still allow doses to be admissible.
For their three target trade-off points, the team chose three points in the toxicity-
efficacy space that they felt had equal utility, and solved simultaneous equations to
identify what π 1 , π 2 , and π 3 would be. This resulted in π 1 ¼ ð0, 0:40Þ and π 2 ¼
ð0:70, 1Þ, giving p ¼ 2.07. The resultant utility curves for different utility/desirability
levels are shown in Fig. 3; the neutral contour is shown in blue, and yields an interior
target point of π 3 ¼ ð0:4, 0:5Þ. Any other point lying on this curve could be selected
as an interior target point. A trial-and-error approach was used to select the ESS
based on the operating characteristics from simulation studies, similar to the
approach of Thall et al. (2014). For the Matchpoint trial, the investigators set the
ESS as 1.3 to obtain prior distributions on their model parameters.
The Matchpoint trial is currently ongoing, so results are not available. However,
Brock et al. (2017) provided results of simulation studies to assess their trial design
across six scenarios with different dose-efficacy and dose-toxicity relationships.
Figure 4 shows the probability of selecting each dose (or not recommending a
Table 1 Dose levels and prior probabilities for the Matchpoint trial
1 2 3 4
Dose level Ponatinib 15 mg every other 15 mg 30 mg 45 mg
dose day daily daily daily
Prior Pr(Eff) 0.20 0.30 0.50 0.60
Prior Pr(Tox) 0.025 0.05 0.10 0.25
60 Bayesian Adaptive Designs for Phase I Trials 1119
Fig. 3 Utility contours elicited for the Matchpoint trial. Green circles show the three trade-off
points used to fit contours. Blue line shows the neutral contour fitted through trade-off points
dose due to safety) across the six scenarios for their proposed design, along with the
true toxicity and efficacy probabilities in each scenario.
Additional Bayesian approaches for joint modeling efficacy and toxicity outcomes
have been proposed in the literature. Thall and Russell (1998) describe a propor-
tional odds approach to dose-escalation where there are three measurable outcomes
concerning adverse events and the onset of Graft versus Host Disease (GvHD): no
severe toxicities and no GvHD (outcome 1), no severe toxicities and only moderate
GvHD (outcome 2), and either severe toxicity or severe GvHD (outcome 3). The aim
is to find the dose that has an expected probability of outcome 2 of at least 50%, but
with the expected probability of outcome 3 being no greater than 10%. A parsimo-
nious modeling approach is used, whereby γj(x) ¼ ℙ (Outcome j), and
γ 0 ðxÞ ¼ 1
exp ðμ þ α þ βxÞ
γ 1 ðxÞ ¼
1 þ exp ðμ þ α þ βxÞ
exp ðμ þ βxÞ
γ 2 ðxÞ ¼ :
1 þ exp ðμ þ βxÞ
1120 M. J. Sweeting et al.
Fig. 4 Simulation results for the EffTox design proposed for the Matchpoint Trial. Gray shaded
area shows the set of points that are more desirable than the elicited neutral contour. Areas of points
are proportional to the probability of choosing that point as the MTD. Scenario orderings for No
dose chosen results are 1 (black), 2 (red), 3 (orange), 4 (purple), 5 (blue), and 6 (brown)
Model parameters are updated by standard Bayesian methods and then the dose
with the highest probability for outcome 2 (i.e., moderate GvHD and no severe
toxicity) is chosen, subject to the constraint that γ2(x) 0.10. A related approach by
Braun (2002) extended the CRM to jointly model the probabilities of severe toxicity
and disease progression.
Zhang et al. (2006) proposed a continuation-ratio approach for jointly modeling
efficacy and toxicity. Specifically, given dose, three probabilities are of interest:
ψ 0(x), the probability of no efficacy and no DLT; ψ 1(x), the probability of efficacy
and no DLT; ψ 2(x), the probability of DLT, regardless of efficacy status. These
probabilities are then modeled to allow toxicity to increase monotonically with dose,
but for ψ 1(x) to be non-monotonic with dose:
ψ 1 ðxÞ
log ¼ α1 þ β1 x
ψ 0 ðxÞ
and
ψ 2 ðxÞ
log ¼ α2 þ β2 x:
1 ψ 2 ðxÞ
60 Bayesian Adaptive Designs for Phase I Trials 1121
Then, with constraints on the model parameters, specifically α1 > α2 and β1,
β2 > 0, the above equations can be solved to give expressions directly on ψ 0(x),
ψ 1(x), and ψ 2(x). Given a target toxicity level θ, the dose-finding algorithm is based
on two decision functions:
and
where λ is the weight for the toxicity risk of dose x relative to its efficacy. The dose x*
for the next patient is that which satisfies δ1(x*) ¼ 1 and δ2(x*) ¼ maxx Ξ {δ2(x)},
where Ξ is the dose range (or set of doses) under consideration. Other approaches
using the continuation ratio model have been published since, including those for
combination therapy trials (Mandrekar et al. 2007, 2010).
Dragalin and Fedorov (2006) proposed using optimal design theory for dose-
finding studies with joint endpoint data. For a joint probability model py,z (x), where
x is the dose, y is the binary efficacy outcome, and z is the binary toxicity outcome,
the authors suggest either a Gumbel-type bivariate logistic regression, such as that
used in the EffTox design, or Cox bivariate binary model. In both cases, an analytical
expression for the Fisher Information Matrix (FIM) is obtained. A common choice of
optimization is the D-optimality criterion, which chooses the dose for the next
individual patient that maximizes the determinant of the FIM. An optimal design
allows the trial to obtain as much information as possible about the joint probability
model. However, the optimal dose may not always be a safe dose. Therefore the
range of doses from which the optimal dose is chosen can be restricted to doses
within the therapeutic range (above the posterior estimate of the minimum effective
dose and below the posterior estimate of the MTD). Other constraints for defining
admissible doses are also explored by Dragalin and Fedorov (2006).
The use of D-optimality and an admissible dose range aims to blend together two
goals in drug development: doing what is best for the population (by learning as
much as possible about the dose-efficacy and dose-toxicity relationships) and doing
what is best for the patient (by giving them the dose that has a controlled toxicity risk
but some efficacy benefit). Optimal design-theoretic approaches for dose-finding
studies have also been proposed by others (Pronzato 2010; Padmanabhan et al.
2010a; Padmanabhan et al. 2010b; Dragalin et al. 2008).
After exploring different endpoints and joint modeling of efficacy and toxicity
outcomes for single-agent dose-escalation designs, a natural progression for research
and application of such designs was into trials where two or more treatment-related
quantities were to be adapted. Since many treatment plans are formed from
1122 M. J. Sweeting et al.
1!2!3!4!5!6
1!2!3!5!4!6
1!3!2!4!5!6
1!3!2!5!4!6
1 ! 3 ! 5 ! 2 ! 4 ! 6:
L ðD Þf ðmÞ
ψ ðmÞ ¼ PMm n :
l¼1 Ll ðD n Þf ðlÞ
We may then choose order m* ¼ arg maxm ¼ 1,. . ., M ψ(m) to be the best guess of
the true simple order, and apply the CRM for single-agent phase I trials to the dose
combinations under this specified ordering. Alternatively, we may randomly select
an ordering by using the ψ(m) as selection probabilities; this may be beneficial if two
or more orderings have the same or very similar posterior weightings. An extension
of this approach including efficacy outcomes has also been proposed (Wages and
Conaway 2014).
Other approaches model the entire dose-toxicity surface. Gasparini (2013) describes
several models for dose-toxicity surfaces, which include logistic-type and copula-
type models employed by Thall et al. (2003), Wang and Ivanova (2005), Yin and
1124 M. J. Sweeting et al.
Yuan (2009a), and Yin and Yuan (2009b). Further to these, extensions of the EWOC
designs for the combination therapy setting have also been proposed for dual-agent
phase I dose-escalation studies (Jimenez et al. 2018; Tighiouart et al. 2017; Diniz
et al. 2017; Tighiouart 2018); we do not cover these here as they are discussed in
other areas of this book.
Bailey et al. (2009) used an extension of the logistic model to conduct a dose-
escalation study of nilotinib plus imatinib in adult patient with imatinib-resistant
gastrointestinal stromal tumors. Five doses of nilotinib {100, 200, 400, 600, 800}mg
and two doses of imatinib {600, 800}mg were considered, though patients could
also be given nilotinib alone (so imatinib dose of 0 mg). For each dose level, the
probability that the posterior DLT risk is either an underdose (i.e., ℙ(π(x) [0,
0.20))), in the target range (i.e., ℙ(π(x) [0.20, 0.35))), an excessive toxicity (i.e, ℙ
(π(x) [0.35, 0.60))), or an unacceptable toxicity (i.e., ℙ(π(x) [0.6, 1])) are
computed. These probability masses are used for dose-escalation decisions.
The model in this trial was a four-parameter logistic-type model. For dose a of
nilotinib and dose b of imatinib, the probability of DLT at dose combination (a, b) is.
a
logitðπ ða, bÞÞ ¼ log ðαÞ þ β log þ γ 1 ½b 600 þ γ 2 ½b 800,
aR
where aR is a reference dose level for nilotinib, and [·] denotes the indicator
function, taking value 1 if true and 0 otherwise. Using suitable priors on {α, β, γ1,
γ2}, the trial proceeds like most others in this chapter; after dose and DLT status data
are collected, Bayesian methods are used to update the posterior DLT risks per dose,
and also to calculate the four aforementioned interval probabilities. Then, the dose
for the next patient is that which has the largest probability of being in the target
range (i.e., xn + 1 ¼ argmaxx χ ℙ(π(x) [0.20, 0.35))), subject to the constraint that
ℙ(π(x) [0.35, 1]) 0.25. In addition to this, it was possible for patients to be
dosed at combinations that had a smaller target interval probability than π(xn + 1)
based on additional clinical data, or if xn + 1 was a combination where one of the
drugs was increased by more than 100% of the current highest level.
To begin, seven patients were recruited to (800, 0) and one had a DLT. Based on
the posterior interval probabilities and dose escalation rules, three patients were
treated at (200, 800), none of whom had DLTs; during this time an additional two
patients received (800, 0) and neither experienced DLTs. Figure 6 shows the trial
progress and allocation of patients to different dose combinations throughout the
trial, with Bayesian inference and evaluation of the model performed in five stages.
During the course of this trial, several key changes were made. Firstly, severe skin
rash was added to the definition of DLT, which meant among the five patients who
received the (800,800) combination, four were classes as having experienced DLTs;
60 Bayesian Adaptive Designs for Phase I Trials 1125
Fig. 6 Trial progress of nilotinib plus imatinib study. Evaluations (Eval) denote when a decision
was made to open up a new dose combination to recruitment based on estimates of target interval
and overdose interval probabilities
this would previously have been only one patient under the old DLT definition.
Secondly, the investigators agreed to open up a 400 mg dose of imatinib after
observing the four DLTs at (800,800). This meant that a) the dose-toxicity model
had to be modified to include an additional parameter, so the model became
a
logitðπ ða, bÞÞ ¼ log ðαÞ þ β log þ γ 0 ½b 400 þ γ 1 ½b 600
aR
þ γ 2 ½b 800,
and b) the prior distributions for γ1 and γ2 needed to be modified. Once this was
completed, posterior estimates for DLT probabilities and dosing interval probabili-
ties were recalculated. A further 16 patients were recruited to the new (800,400)
combination, and three of these patients experienced DLT. At the end of the study,
the (800,400) dose, which had the largest probability mass in the target range of
[0.20, 0.35) and satisfied the aforementioned overdose constraint, was selected as the
maximum tolerated dose combination. Figure 7 shows the posterior probability mass
for target or overdosing at each dose combination.
Choosing a model for the above designs requires careful consideration, and appro-
priate priors need to be chosen. Furthermore, for a true dose-toxicity surface with a
very asymmetric shape (i.e., increasing one drug adds little to DLT risk, but
increasing the other drug adds a lot), it may be difficult to obtain reliable estimates
1126 M. J. Sweeting et al.
Fig. 7 Summary of posterior probabilities of target (green circles) and overdosing (red circles) per
dose combination. Area of circles proportional to probability. Combinations with red circles have ℙ
(π(x) [0.35, 1]) > 0.25, so target probability not shown. Combinations with green circles all have
ℙ(π(x) [0.35, 1]) 0.25
of DLT risks under some models. In light of this, Bayesian approaches have been
proposed where a model for the dose-toxicity surface is not required.
Lin and Yin (2017) proposed a Bayesian Optimal Interval (BOIN) design for
combination trials, whereby the dose for the next patient is either increased,
decreased, or maintained if the expected probability of DLT at the current dose
falls within some predefined intervals. Let θ be the target toxicity level, πjk be the
DLT risk at dose combination ( j, k), and ΔL and ΔU be the lower and upper limits for
the target DLT interval. If bπ jk , equal to the number of DLTs at combination ( j, k)
divided by the number of patients at ( j, k), is less than θ – ΔL, then the next patient
receives either combination ( j, k + 1) or ( j + 1, k), whichever combination maxi-
mizes ℙ(πjk (θ – ΔL, θ + ΔU)|Δn). If b π jk is greater than θ + ΔU, then the next patient
receives either combination ( j, k – 1) or ( j – 1, k), whichever combination maxi-
mizes ℙ(πjk (θ – ΔL, θ + ΔU)|Δn). Otherwise, the next patient receives the
current dose.
Mander and Sweeting (2015) proposed the Product of Independent beta Proba-
bilities dose Escalation design (PIPE), which focused on identifying the Maximum
Tolerated Contour (MTC) that divides safe and unsafe doses. A working model is
first set up whereby each combination is given an independent beta prior distribu-
tion, that is, for combination ( j, k), πjk ~ Beta(ajk, bjk). With these independent priors
and trial data Δn, the posterior distribution for each combination is also a beta
distribution and it is easy to calculate the probability that the combination is less
that the TTL, pjk ¼ ℙ(πjk θ|ajk, bjk, Δn). The PIPE design then considers each
possible contour that can divide a dose-toxicity surface into safe and unsafe combi-
nations and that satisfies the assumption of monotonicity. Using the working model,
the probability of each contour being the MTC can then be calculated. For contour
60 Bayesian Adaptive Designs for Phase I Trials 1127
Cm, let Cm[j, k] ¼ 1 if combination ( j, k) is above the contour (i.e., unsafe), and Cm[j,
k] ¼ 0 if it is below (i.e., safe). Then
J Y
Y K
1Cm ½ j,k Cm ½ j,k
ℙðMTC ¼ Cm jD n Þ ¼ pjk 1 pjk :
j¼1 k¼1
The contour that maximizes the above expression is selected as the MTC, and
dose-escalation decisions may be made with the MTC as a guide. Software to
implement the PIPE designs is available in the R package pipe.design.
Most of the methods for dual-agent phase I trials may be applied directly to dose-
schedule finding studies, simply by treating the different administration schedules as
another treatment that is varied. However, there are other approaches that consider
the subtleties of dose-schedule finding studies. Braun et al. (2007) used time-to-
toxicity outcomes to adaptively select dose-schedule combinations where schedules
of lower frequency/intensity are nested within more intense ones, with a motivating
example of using 5-azacitidine to treat patients with acute myeloid leukemia.
Meanwhile, O’Quigley and Conaway (2011) extended the CRM approach so that
the skeleton values varied according to which schedule patients were to be treated
with.
References
Babb JS, Rogatko A (2001) Patient specific dosing in a cancer phase I clinical trial. Stat Med
20(14):2079–2090
Babb J, Rogatko A, Zacks S (1998) Cancer phase I clinical trials: efficient dose escalation with
overdose control. Stat Med 17(10):1103–1120
Bailey S, Neuenschwander B, Laird G, Branson M (2009) A Bayesian case study in oncology phase
I combination dose-finding using logistic regression with covariates. J Biopharm Stat 19(3):
469–484
Bekele BN, Thall PF (2004) Dose-finding based on multiple toxicities in a soft tissue sarcoma trial.
J Am Stat Assoc 99(465):26–35
Berry SM, Carlin BP, Jack Lee J, Muller P (2010) Bayesian adaptive methods for clinical trials.
Boca Raton, FL: Chapman and Hall/CRC Press
Braun TM (2002) The bivariate continual reassessment method. Extending the CRM to phase I
trials of two competing outcomes. Control Clin Trials 23(3):240–256
Braun TM (2006) Generalizing the TITE-CRM to adapt for early- and late-onset toxicities. Stat
Med 25(12):2071–2083
Braun TM, Levine JE, Ferrara JLM (2003) Determining a maximum tolerated cumulative dose:
dose reassignment within the TITE-CRM. Control Clin Trials 24(6):669–681
Braun TM, Thall PF, Nguyen H, de Lima M (2007) Simultaneously optimizing dose and schedule
of a new cytotoxic agent. Clin Trials 4(2):113–124
Brock K, Billingham L, Copland M, Siddique S, Sirovica M, Yap C (2017) Implementing the
EffTox dose-finding design in the matchpoint trial. BMC Med Res Methodol 17(1)
Cheung YK (2005) Coherence principles in dose-finding studies. Biometrika 92(4):863–873
Cheung YK, Chappell R (2000) Sequential designs for phase I clinical trials with late-onset
toxicities. Biometrics 56(4):1177–1182
Chu P-L, Lin Y, Shih WJ (2009) Unifying CRM and EWOC designs for phase I cancer clinical
trials. J Stat Plan Inference 139(3):1146–1163
Cole SR, Ananth CV (2001) Regression models for unconstrained, partially or fully constrained
continuation odds ratios. Int J Epidemiol 30(6):1379–1382
Dinart D, Fraisse J, Tosi D, Mauguen A, Touraine C, Gourgou S, Le Deley MC, Bellera C, Mollevi
C (2020) GUIP1: a R package for dose escalation strategies in phase I cancer clinical trials.
BMC Med Inform Decis Mak 20(134). https://doi.org/10.1186/s12911-020-01149-3
Diniz MA (2018) ewoc: Escalation with overdose control. R package version 0.2.0
Diniz MA, Quanlin-Li, Tighiouart M (2017) Dose finding for drug combination in early cancer
phase I trials using conditional continual reassessment method. J Biom Biostat 8(6)
Dragalin V, Fedorov V (2006) Adaptive designs for dose-finding based on efficacy–toxicity
response. J Stat Plan Inference 136(6):1800–1823
Dragalin V, Fedorov V, Wu Y (2008) Adaptive designs for selecting drug combinations based on
efficacy–toxicity response. J Stat Plan Inference 138(2):352–373
Ezzalfani M, Zohar S, Qin R, Mandrekar SJ, Deley M-CL (2013) Dose-finding designs using a
novel quasi-continuous endpoint for multiple toxicities. Stat Med 32(16):2728–2746
Faries D (1994) Practical modifications of the continual reassessment method for phase I cancer
clinical trials. J Biopharm Stat 4(2):147–164
Gasparini M (2013) General classes of multiple binary regression models in dose finding problems
for combination therapies. J R Stat Soc: Ser C: Appl Stat 62(1):115–133
Goodman SN, Zahurak ML, Piantadosi S (1995) Some practical improvements in the continual
reassessment method for phase I studies. Stat Med 14(11):1149–1161
Harrington JA, Wheeler GM, Sweeting MJ, Mander AP, Jodrell DI (2013) Adaptive designs for
dual-agent phase I dose-escalation studies. Nat Rev Clin Oncol 10(5):277–288
Hirakawa A, Sato H, Daimon T, Matsui S (2018) Dose finding for a combination of two agents. In:
Modern Dose-Finding Designs for Cancer Phase I Trials: Drug Combinations and Molecularly
Targeted Agents, pp 9–40. Tokyo: Springer
60 Bayesian Adaptive Designs for Phase I Trials 1129
Iasonos A, Wilton AS, Riedel ER, Seshan VE, Spriggs DR (2008) A comprehensive comparison of
the continual reassessment method to the standard 3 + 3 dose escalation scheme in phase I dose-
finding studies. Clin Trials 5(5):465–477
Iasonos A, Zohar S, O’Quigley J (2011) Incorporating lower grade toxicity information into dose
finding designs. Clin Trials J Soc Clin Trials 8(4):370–379
Ivanova A, Wang Y, Foster MC (2016) The rapid enrollment design for phase I clinical trials. Stat
Med 35(15):2516–2524
Jaki T, Clive S, Weir CJ (2013) Principles of dose finding studies in cancer: a comparison of trial
designs. Cancer Chemother Pharmacol 71(5):1107–1114
Jimenez JL, Tighiouart M, Gasparini M (2018) Cancer phase I trial design using drug combinations
when a fraction of dose limiting toxicities is attributable to one or more agents. Biom J 61
(2):319–332
Korn EL, Midthune D, Chen TT, Rubinstein LV, Christian MC, Simon RM (1994) A comparison of
two phase I trial designs. Stat Med 13(18):1799–1806
Le Tourneau C, Jack Lee J, Siu LL (2009) Dose escalation methods in phase I cancer clinical trials.
JNCI J Nat Cancer Inst 101(10):708–720
Lee SM, Cheng B, Cheung YK (2010) Continual reassessment method with multiple toxicity
constraints. Biostatistics 12(2):386–398
Lin R, Yin G (2017) Bayesian optimal interval design for dose finding in drug-combination trials.
Stat Methods Med Res 26(5):2155–2167
Mander AP, Sweeting MJ (2015) A product of independent beta probabilities dose escalation design
for dual-agent phase I trials. Stat Med 34(8):1261–1276
Mandrekar SJ, Cui Y, Sargent DJ (2007) An adaptive phase I design for identifying a biologically
optimal dose for dual agent drug combinations. Stat Med 26(11):2317–2330
Mandrekar SJ, Qin R, Sargent DJ (2010) Model-based phase I designs incorporating toxicity and
efficacy for single and dual agent drug combinations: methods and challenges. Stat Med 29(10):
1077–1083
Mauguen A, Le Deley MC, Zohar S (2011) Dose-finding approach for dose escalation with
overdose control considering incomplete observations. Stat Med 30(13):1584–1594
Muler JH, McGinn CJ, Normolle D, Lawrence T, Brown D, Hejna G, Zalupski MM (2004) Phase I
trial using a time-to-event continual reassessment strategy for dose escalation of cisplatin
combined with gemcitabine and radiation therapy in pancreatic cancer. J Clin Oncol 22(2):
238–243
Neuenschwander B, Branson M, Gsponer T (2008) Critical aspects of the Bayesian approach to
phase I cancer trials. Stat Med 27(13):2420–2439
O’Quigley J, Conaway M (2011) Extended model-based designs for more complex dose-finding
studies. Stat Med 30(17):2062–2069
O’Quigley J, Shen LZ (1996) Continual reassessment method: a likelihood approach. Biometrics
52(2):673
O’Quigley J, Pepe M, Fisher L (1990) Continual reassessment method: a practical design for phase
1 clinical trials in cancer. Biometrics 46(1):33
O’Quigley J, Iasonos A, Bornkamp B (2017) Handbook of methods for designing, monitoring, and
analyzing dose-finding trials. Boca Raton, FL: Chapman and Hall Press/CRC
Padmanabhan SK, Krishna Padmanabhan S, Dragalin V (2010a) Adaptive dc-optimal designs for
dose finding based on a continuous efficacy endpoint. Biom J 52(6):836–852
Padmanabhan SK, Krishna Padmanabhan S, Hsuan F, Dragalin V (2010b) Adaptive PenalizedD-
optimal designs for dose finding based on continuous efficacy and toxicity. Stat Biopharm Res
2(2):182–198
Paul RK, Rosenberger WF, Flournoy N (2004) Quantile estimation following non-parametric phase
I clinical trials with ordinal response. Stat Med 23(16):2483–2495
Piantadosi S, Fisher JD, Grossman S (1998) Practical implementation of a modified continual
reassessment method for dose-finding trials. Cancer Chemother Pharmacol 41(6):429–436
Potthoff RF, George SL (2009) Flexible phase I clinical trials: allowing for nonbinary toxicity
response and removal of other common limitations. Stat Biopharm Res 1(3):213–228
1130 M. J. Sweeting et al.
Pronzato L (2010) Penalized optimal designs for dose-finding. J Stat Plan Inference 140(1):
283–296
Riviere MK, Le Tourneau C, Paoletti X, Dubois F, Zohar S (2014) Designs of drug-combination
phase I trials in oncology: a systematic review of the literature. Ann Oncol 26(4):669–674
Sweeting M, Mander A, Sabin T (2013) Bcrm: Bayesian continual reassessment method designs for
phase I dose-finding trials. J Stat Softw 54(13)
Thall PF (2010) Bayesian models and decision algorithms for complex early phase clinical trials.
Stat Sci 25(2):227–244
Thall PF, Cook JD (2004) Dose-finding based on efficacy-toxicity trade-offs. Biometrics 60(3):
684–693
Thall PF, Cook JD (2006) Using both efficacy and toxicity for dose-finding. In: Statistical methods
for dose-finding experiments, pp 275–285. New York: Wiley
Thall PF, Russell KE (1998) A strategy for dose-finding and safety monitoring based on efficacy
and adverse outcomes in phase I/II clinical trials. Biometrics 54(1):251–264
Thall PF, Millikan RE, Mueller P, Lee S-J (2003) Dose-finding with two agents in phase I oncology
trials. Biometrics 59(3):487–496
Thall PF, Herrick RC, Nguyen HQ, Venier JJ, Norris JC (2014) Effective sample size for computing
prior hyperparameters in Bayesian phase I-II dose-finding. Clin Trials 11(6):657–666
Tighiouart M (2018) Two-stage design for phase I-II cancer clinical trials using continuous dose
combinations of cytotoxic agents. J R Stat Soc: Ser C: Appl Stat 68(1):235–250
Tighiouart M, Rogatko A (2010) Dose finding with escalation with over-dose control (EWOC) in
cancer clinical trials. Stat Sci 25(2):217–226
Tighiouart M, Rogatko A, Babb JS (2005) Flexible Bayesian methods for cancer phase I clinical
trials. Dose escalation with overdose control. Stat Med 24(14):2183–2196
Tighiouart M, Cook-Wiens G, Rogatko A (2012) Escalation with over-dose control using ordinal
toxicity grades for cancer phase I clinical trials. J Probab Stat 2012:1–18
Tighiouart M, Liu Y, Rogatko A (2014) Escalation with overdose control using time to toxicity for
cancer phase I clinical trials. PLoS One 9(3):e93070
Tighiouart M, Li Q, Rogatko A (2017) A Bayesian adaptive design for estimating the maximum
tolerated dose curve using drug combinations in cancer phase I clinical trials. Stat Med 36(2):
280–290
Tighiouart M, Cook-Wiens G, Rogatko A (2018) A Bayesian adaptive design for cancer phase I
trials using a flexible range of doses. J Biopharm Stat 28(3):562–574
Van Meter EM, Garrett-Mayer E, Bandyopadhyay D (2011) Proportional odds model for dose-
finding clinical trial designs with ordinal toxicity grading. Stat Med 30(17):2070–2080
Van Meter EM, Garrett-Mayer E, Bandyopadhyay D (2012) Dose-finding clinical trial design for
ordinal toxicity grades using the continuation ratio model: an extension of the continual
reassessment method. Clin Trials 9(3):303–313
Wages NA, Conaway MR (2014) Phase I/II adaptive design for drug combination oncology trials.
Stat Med 33(12):1990–2003
Wages NA, Conaway MR, O’Quigley J (2011a) Continual reassessment method for partial order-
ing. Biometrics 67(4):1555–1563
Wages NA, Conaway MR, O’Quigley J (2011b) Dose-finding design for multi-drug combinations.
Clin Trials 8(4):380–389
Wages NA, Conaway MR, O’Quigley J (2013) Using the time-to-event continual reassessment
method in the presence of partial orders. Stat Med 32(1):131–141
Wages NA, Ivanova A, Marchenko O (2016) Practical designs for phase I combination studies in
oncology. J Biopharm Stat 26(1):150–166
Wang K, Ivanova A (2005) Two-dimensional dose finding in discrete dose space. Biometrics 61(1):
217–222
Wheeler GM (2018) Incoherent dose-escalation in phase I trials using the escalation with overdose
control approach. Stat Pap (Berl) 59(2):801–811
60 Bayesian Adaptive Designs for Phase I Trials 1131
Wheeler GM, Sweeting MJ, Mander AP (2017) Toxicity-dependent feasibility bounds for the
escalation with overdose control approach in phase I cancer trials. Stat Med 36(16):2499–2513
Wheeler GM, Sweeting MJ, Mander AP (2019) A Bayesian model-free approach to combination
therapy phase I trials using censored time-to-toxicity data. J R Stat Soc: Ser C: Appl Stat
68(2):309–329
Yin G, Yuan Y (2009a) Bayesian dose finding in oncology for drug combinations by copula
regression. J R Stat Soc: Ser C: Appl Stat 58(2):211–224
Yin G, Yuan Y (2009b) A latent contingency table approach to dose finding for combinations of two
agents. Biometrics 65(3):866–875
Yuan Z, Chappell R, Bailey H (2007) The continual reassessment method for multiple toxicity
grades: a Bayesian quasi-likelihood approach. Biometrics 63(1):173–179
Yuan, Y., Nguyen, H. Q., and Thall, P. F. (2017). Bayesian designs for phase I–II clinical trials. Boca
Raton, FL: Chapman and Hall Press/CRC
Zhang W, Sargent DJ, Mandrekar S (2006) An adaptive dose-finding design incorporating both
toxicity and efficacy. Stat Med 25(14):2365–2383
Zhou Y (2009) Adaptive designs for phase I dose-finding studies. Fundam Clin Pharmacol 24
(2):129–138
Adaptive Phase II Trials
61
Boris Freidlin and Edward L. Korn
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1134
Interim Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1134
Phase II/III Trial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136
Adaptations Related to Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1138
Sample Size Reassessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1139
Outcome-Adaptive Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1140
Adaptive Pooling of Outcome Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1141
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1142
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1142
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1142
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1142
Abstract
Phase II trials are designed to obtain preliminary efficacy information about a new
therapy in order to assess whether the new therapy should be tested in definitive
(phase III) trials. Adaptive trial designs allow the design of a trial to be changed
during its conduct, possibly using accruing outcome data. Adaptations to
phase II trials considered in this chapter include formal interim monitoring,
phase II/III trial designs, adaptations related to biomarker subgroups, sample
size reassessment, outcome-adaptive randomization, and adaptive pooling of
outcome results across patient subgroups. Adaptive phase II trials allow for the
possibility of trials reaching their conclusions earlier, with more patients being
treated with therapies that have activity for them.
© This is a U.S. Government work and not under copyright protection in the U.S.; 1133
foreign copyright protection may apply 2022
S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_276
1134 B. Freidlin and E. L. Korn
Keywords
Biomarkers · Futility monitoring · Interim monitoring · Outcome-adaptive
randomization · Phase II/III · Sample size reassessment
Introduction
Interim Monitoring
is doing worse than the standard treatment arm by any amount. For example, in a
120-patient randomized trial to detect an improvement in response rates from
10% to 30%, the trial would stop when the first 60 patients have been evaluated if
the observed response rate was lower in the experimental arm than the control
arm. With a time-to-event endpoint, Wieand futility monitoring is performed
when one-half of the required number of events for the final analysis is observed.
Unlike phase III trials, phase II trials generally do not include the possibility of
stopping (and discontinuing enrollment) early for positive results (efficacy stop-
ping), e.g., superiority of the experimental treatment over the standard treatment
in a randomized phase II trial. This is because it is useful to get more experience
with the treatment to inform phase III design (and patients are meanwhile not
being given an ineffective experimental therapy). However, it could be useful for
phase II trials to allow for the possibility of early reporting of positive results as
the trial continues to completion, especially if the trial is relatively large
or expected to take a long time to complete (e.g., due to the rarity of the disease).
For example, using a version of the Fleming approach (Fleming 1982) to the
single-arm two-stage design above, one can report the first-stage (12 patients)
results for efficacy with three or more responses as well as stopping the trial for
futility with zero responses. For randomized studies, it often would not be
acceptable to continue randomization once a positive result is reported. However,
in studies with time-to-event endpoints, where the outcome data requires non-
trivial follow-up to mature, it might still be useful to conduct an efficacy analysis
for potential early reporting once all patients have been enrolled and are off the
randomized treatments.
Multi-arm randomized phase II trials with multiple experimental arms being
compared to a standard treatment arm are efficient designs as compared to
performing separate randomized phase II trials for each experimental treatment
(because of the shared standard-treatment arm) (Freidlin et al. 2008). Futility
monitoring for each experimental arm/control arm comparison may increase
the efficiency further, allowing individual experimental arms to be closed early
(increasing the accrual rate on the remaining open arms). An example of such a
trial is SWOG 1500 (Pal et al. 2017), which has three experimental arms and
a standard treatment arm for metastatic papillary renal carcinoma. The trial design
included Wieand futility monitoring rules, and two of the experimental arms stopped
early for futility.
There is also the possibility of having a multi-arm trial with a “master protocol”
that accommodates adding new experimental treatment arms when they become
available for phase II testing. An example of such a trial is ISPY-2 (Barker et al.
2009), which is testing neoadjuvant experimental treatments for women with locally
advanced breast cancer. In addition to being less efficient than having all treatments
available for testing at the same time (because results from patients on an experi-
mental arm can only be compared to results from patients on the standard-treatment
arm who were randomized contemporaneously), trials with master protocols present
major logistical challenges to execute (Cecchini et al. 2019).
1136 B. Freidlin and E. L. Korn
continue to the phase III stage of the trial. However, in many clinical settings, a time-
to-event endpoint (e.g., progression-free survival) is considered to be a more reliable
phase II measure of clinical activity (than response) for deciding whether to proceed
to the phase III trial. This leads to another key decision in the trial design: Should one
continue to accrue while waiting for the data from the phase II patients to mature, or
should one suspend accrual during this time? The advantage of suspending accrual is
that no additional patients will have been unnecessarily accrued in the event the
phase II analysis suggests not proceeding to phase III. In particular, in some settings,
all or practically all the phase III patients will be accrued before the data on the phase
II patients are mature enough to make a decision, negating any efficiency in using
a phase II/III design. The disadvantage of suspending accrual is that it will take
longer to get the phase III results (assuming the phase II analysis is positive),
especially if it takes time to ramp up accrual after the suspension. In addition, with
a long suspension, one can question the generalizability of the trial results because
the patient population may have changed over time. However, a changing patient
population can be of theoretical concern for any long trial (whether or not there is an
accrual suspension), so this is not a reason to avoid an accrual suspension in a phase
II/III trial (Freidlin et al. 2018).
An example of a phase II/III design with accrual suspension is given by RTOG-
1216 (Zhang et al. 2019) that compared radiation+docetaxel and radiation
+docetaxel+cetuximab arms to a standard radiation+cisplatin arm in advanced
head and neck cancer. The phase II design was based on randomly assigning 180
patents between the 3 arms (60 patients per arm), targeting for each of the two
experimental versus control comparisons a progression-free survival hazard ratio
of 0.6 (with 80% power at a one-sided 0.15 significance level). The phase II
analysis for each experimental versus control comparison was scheduled when a
total of 56 progression-free survival events were observed for the two arms. If at
least one of the experimental arms was significantly positive, the study would
proceed to the phase III portion with a total of 460 patients randomized between the
best experimental arm and the control arm, targeting an overall survival hazard
ratio of 0.67. The study was designed to suspend accrual after the 180-patient
phase II portion finished accrual until the phase II analyses were performed
requiring 56 events for each of the two experimental versus control comparisons.
The protocol projected that the phase II analysis would take place approximately
1.3 years after completion of phase II accrual (but it actually took slightly over 2
years for the phase II data to mature).
An example of a phase II/III trial without an accrual suspension is given by
GOG0182-ICON5 for advanced-stage ovarian cancer (Bookman et al. 2009). This
trial had four experimental arms that were compared to a standard treatment arm.
The phase II endpoint was progression-free survival, and the phase III endpoint
was overall survival. The trial accrued 4312 patients rapidly before the phase II
analyses determined no difference between the experimental arms and the standard
treatment arm (there was also no difference in overall survival). If an accrual
suspension of 15 months had been used, the total sample size would have been
1740 (Korn et al. 2012).
1138 B. Freidlin and E. L. Korn
one to adapt the plan for a future phase III trial. This can be done informally by
examining experimental arm versus standard arm comparisons in the biomarker-
positive and biomarker-negative subgroups from a completed trial. However, this
post hoc approach may not work because of insufficient sample sizes in one or both
subgroups. A formal approach to this problem has been proposed (Freidlin et al.
2012), which requires a larger sample size than a single randomized phase II trial, but
less total sample size than performing separate randomized phase II trials in the
biomarker-positive and biomarker-negative subgroups.
Sample size reassessment is a potential adaptation to the sample size of a trial based
on promising interim results from the trial. Although mostly used for phase III trials,
it has been recommended and used for phase II trials (Wang et al. 2012; Campbell
et al. 2014; Meretoja et al. 2014). In the phase II setting, a particular implementation
(Chen et al. 2004) initially starts with a plan to enroll a fixed number of patients to
target a potentially optimistic treatment effect with 90% power at a one-sided
significance level of 10%. When the information from the first-half of the trial
becomes available, the design examines the (one-sided) p-value. If this p-value is
less than 0.18 but greater than 0.06, then the sample size is increased up to twice the
original sample size using a formula that depends on this p-value, with p-values
closer to 0.18 leading to larger increases. The idea is that one can gain power to reject
the null hypothesis when the interim results appear promising by increasing the
sample size.
Sample size reassessment is controversial (Burman and Sonesson 2006; Emerson
et al. 2011; Freidlin and Korn 2017; Mehta 2017). The issue is whether sample size
reassessment is a reasonable approach to ensuring an adequately powered trial.
A simple numerical example illustrates the issues: Consider a randomized phase II
trial targeting a response rate of 45% for the experimental therapy versus 20% for the
standard treatment arm. With a standard design (maximum of 92 randomized
patients with Wieand futility monitoring after 46 patients), the trial would have
90% power for a one-sided significance level of 10% (Green et al. 2016). With the
same level and power and using sample size reassessment, the initial sample size
can be set to 84, with a possible increase to 168 (based on the interim results after 42
patients have been evaluated, including Wieand futility monitoring at that time).
Theoretically, if a sponsor of a trial initially had resources only for an 84-patient trial
but not a 96-patient trial, then sample size reassessment would allow the sponsor to
obtain additional resources to increase the sample size based on promising interim
results. However, as is shown in Table 1, the sample size reassessment design is
inferior to the standard design as, on average, it would require more patients (in some
cases nontrivially increasing the sample size and duration of the trial). Note that in
addition to the efficiency issues, sample size reassessment designs raise some
integrity concerns when used with time-to-event outcomes (Freidlin and Korn 2017)
1140 B. Freidlin and E. L. Korn
Table 1 Comparison of standard design with sample size reassessment design for comparing
response rates of 45% versus 20% (90% power at 0.1 significance level)
Standard design (with Wieand Sample size reassessment (with
Sample size futility monitoring) Wieand futility monitoring)
Minimum 46 42
Maximum 92 168
Average under null 72 73
hypothesis
Average under the 91 95
alternative hypothesis
Outcome-Adaptive Randomization
randomization trial for the pairwise comparisons between the experimental arms and
standard treatment arms (Korn and Freidlin 2017). The equal randomization trial
would be expected to have an overall DCR of 45% (54/120), less than the observed
rate of 48% (89/186) with the outcome-adaptive randomization, a slight plus for
outcome-adaptive randomization. However, equal randomization would have
resulted in a shorter trial, with fewer patients, and with fewer patients having a bad
outcome (66 versus 97) (Korn and Freidlin 2017).
Note that the outcome-adaptive randomization should not be confused with
statistical techniques designed to balance the distribution of baseline covariates
between the study arms (Pocock and Simon 1975). Unlike outcome-adaptive
randomization that uses study outcome to change the probability of treatment
assignment, these methods use prespecified baseline characteristics of the accruing
patents to modify the randomization and are not controversial.
ineffective in subgroups where it does work (Freidlin and Korn 2013). Vigorous
methodologic research in refining adaptive pooling designs is continuing (Chu and
Yuan 2018; Cunanan et al. 2019). It may not be possible to have a design that allows
the use of observed data to guide the pooling across subgroups without inflating the
design error rates (Kopp-Schneider et al. 2019).
Summary
Key Facts
Cross-References
▶ Biomarker-Guided Trials
▶ Futility Designs
▶ Interim Analysis in Clinical Trials
▶ Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials
References
Allen CE, Laetsch TW, Mody R, Irwin MS, Lim MS, Adamson PC, Seibel NL, Parsons DW,
Cho YJ, Janeway K, on behalf of the Pediatric MATCH Target and Agent Prioritization
Committee (2017) Target and Agent Prioritization for the Children’s Oncology Group –
National Cancer Institute Pediatric MATCH Trial. J Natl Cancer Inst 109:djw274
61 Adaptive Phase II Trials 1143
Barker AD, Sigman CC, Kelloff GJ, Hylton NM, Berry DA, Esserman LJ (2009) I-SPY 2: an
adaptive breast cancer trial design in the setting of neoadjuvant chemotherapy. Clin Pharmacol
Ther 86:97–100
Bookman MA, Brady MF, McGuire WP, Harper PG, Alberts DS, Friedlander M, Colombo N,
Fowler JM, Argenta PA, De Geest K, Mutch DG, Burger RA, Swart AM, Trimble EL, Accario-
Winslow C, Roth LM (2009) Evaluation of new platinum-based treatment regimens in
advanced-stage ovarian cancer: a phase III trial of the Gynecologic Cancer Inter Group. J Clin
Oncol 27:1419–1425
Bretz F, Schmidli H, König F, Racine A, Maurer W (2006) Confirmatory seamless phase II/III
clinical trials with hypothesis selection at interim: general concepts. Biom J 48:623–634
Burman CF, Sonesson C (2006) Are flexible designs sound? Biometrics 62:664–669
Byar DP, Simon RM, Friedewald WT, Schlesselman JJ, DeMets DL, Ellenberg JH, Gail MH, Ware
JH (1976) Randomized clinical trials – perspectives on some recent ideas. N Engl J Med
295:74–80
Campbell BC, Mitchell PJ, Yan B, Parsons MW, Christensen S, Churilov L, Dowling RJ, Dewey H,
Brooks M, Miteff F, Levi C, Krause M, Harrington TJ, Faulder KC, Steinfort BS, Kleinig T,
Scroop R, Chryssidis S, Barber A, Hope A, Moriarty M, McGuinness B, Wong AA, Coulthard
A, Wijeratne T, Lee A, Jannes J, Leyden J, Phan TG, Chong W, Holt ME, Chandra RV, Bladin
CF, Badve M, Rice H, de Villiers L, Ma H, Desmond PM, Donnan GA, Davis SM, EXTEND-IA
Investigators (2014) A multicenter, randomized, controlled study to investigate EXtending the
time for Thrombolysis in Emergency Neurological Deficits with Intra-Arterial therapy
(EXTEND-IA). Int J Stroke 9:126–132
Cecchini M, Rubin EH, Blumenthal GM, Ayalew K, Burris HA, Russell-Einhorn M, Dillon H,
Lyerly HK, Reaman GH, Boerner S, LoRusso PM (2019) Challenges with novel clinical trial
designs: master protocols. Clin Cancer Res 25:2049–2057
Chen YH, DeMets DL, Lan KK (2004) Increasing the sample size when the unblinded interim result
is promising. Stat Med 23:1023–1038
Chu Y, Yuan Y (2018) Bayesian basket trial design using a calibrated Bayesian hierarchical model.
Clin Trials 15:149–158
Cuffe RL, Lawrence D, Stone A, Vandemeulebroecke M (2014) When is a seamless study
desirable? Case studies from different pharmaceutical sponsors. Pharm Stat 13:229–237
Cunanan KM, Iasonos A, Shen R, Gönen M (2019) Variance prior specification for a basket trial
design using Bayesian hierarchical modeling. Clin Trials 16:142–153
Emerson SS, Levin GP, Emerson SC (2011) Comments on ‘Adaptive increase in sample size when
interim results are promising: a practical guide with examples’. Stat Med 30:3285–3301
Fleming TR (1982) One-sample multiple testing procedure for phase II clinical trials. Biometrics
38:143–151
Freidlin B, Korn EL (2009) Monitoring for lack of benefit: a critical component of a randomized
clinical trial. J Clin Oncol 27:629–633
Freidlin B, Korn EL (2013) Borrowing information across subgroups: is it useful? Clin Cancer Res
19:1326–1334
Freidlin B, Korn EL (2017) Sample size adjustment designs with time-to-event outcomes: a caution.
Clinical Trials 14:597–604
Freidlin B, Korn EL, Gray R, Martin A (2008) Multi-arm clinical trials of new agents: some design
considerations. Clin Cancer Res 14:4368–4371
Freidlin B, McShane LM, Korn EL (2010) Randomized clinical trials with biomarkers: design
issues. J Natl Cancer Inst 102:152–160
Freidlin B, McShane LM, Polley MY, Korn EL (2012) Randomized phase II trials designs with
biomarkers. J Clin Oncol 30:1–6
Freidlin B, Korn EL, Abrams JS (2018) Bias, operational bias, and generalizability in phase II/III
trials. J Clin Oncol 36:1902–1904
Green S, Benedetti J, Smith A, Crowley J (2016) Clinical trials in oncology, 3rd edn. CRC Press,
New York
Hey SP, Kimmelman J (2015) Are outcome-adaptive allocation trials ethical? (and Commentary).
Clin Trials 12:102–127
1144 B. Freidlin and E. L. Korn
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146
Types of Biomarker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147
Prognostic Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147
Predictive Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148
The Life Course of a Biomarker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1149
Discovery and Analytical Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1149
Clinical Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1150
Clinical Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1150
Biomarker-Guided Trial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1150
Nonadaptive Biomarker-Guided Trial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1151
Single-Arm Designs Including All Patients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1151
Enrichment Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1152
Marker-Stratified Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153
Hybrid Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1153
Biomarker-Strategy Design with Biomarker Assessment in the Control Arm . . . . . . . . . . . . 1154
Biomarker-Strategy Design Without Biomarker Assessment in the Control Arm . . . . . . . . 1155
Biomarker-Strategy Design with Treatment Randomization in the Control Arm . . . . . . . . . 1156
Reverse Marker-Based Strategy Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1157
A Randomized Phase II Trial Design with Biomarker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1158
L. C. Brown (*)
MRC Clinical Trials Unit, UCL Institute of Clinical Trials and Methodology, London, UK
e-mail: [email protected]
A. L. Jorgensen
Department of Health Data Science, University of Liverpool, Liverpool, UK
e-mail: [email protected]
M. Antoniou
F. Hoffmann-La Roche Ltd, Basel, Switzerland
J. Wason
Population Health Sciences Institute, Newcastle University, Newcastle upon Tyne, UK
MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
e-mail: [email protected]
Abstract
This chapter describes the field of precision or stratified medicine and the role that
clinical trials play in the development and validation of markers, particularly
biomarkers, to inform management of patients. We begin by defining various
types of biomarker and describe the life cycle of a biomarker in terms of
discovery, analytical validation, clinical validation, and clinical utility. We pro-
vide a detailed overview of the many types of biomarker-guided trial designs that
have been described in the literature and then summarize the analytical methods
that are often used for biomarker-guided trials. Much of the research process for
biomarker-guided trials does not differ markedly from that used for non-
biomarker-guided trials but particular attention must be given to selecting the
most appropriate trial design given the research question being investigated and
we hope that this chapter helps with decisions on trial design and analysis in the
biomarker-guided setting.
Keywords
Biomarker · Stratified · Precision · Personalized · Validation · Prognostic ·
Predictive · Subgroup · Interaction
Introduction
Types of Biomarker
There are various different types of biomarkers that are used for different clinical
applications. A comprehensive and detailed description of the different types of
biomarkers is provided in the FDA-NIH Working Group, Biomarkers, Endpoints
and Other Tools (BEST) guidelines (FDA-NIH Working Group 2018) where eight
distinct biomarkers have been defined. These are summarized in Table 1.
Biomarkers can be measured using binary, categorical, ordinal, or continuous
data, and appropriate statistical methods are required to ensure they are analyzed
correctly (both for getting robust results from biomarker assays and analyzing how
the biomarker data is associated with the prognosis or treatment effect in a trial). For
the purposes of this chapter, the most commonly used biomarkers in clinical trials are
prognostic and predictive biomarkers and most of the chapters will be dedicated to
these types. Some biomarkers demonstrate both prognostic and predictive qualities
so when designing a biomarker-guided trial, it is important to be aware of any
existing data that describe the discriminatory performance of the biomarker in
question, whether it be prognostic, predictive, or both.
Prognostic Biomarkers
event rate in the control arm will differ between strata and this may influence the
sample size needed for each subgroup. Prognostic biomarkers can potentially
also be useful for enriching trials or informing a treatment strategy that avoids
toxic treatment in patients that have, for example, a very good or very poor
prognosis (Friedlin 2014).
Predictive Biomarkers
The development and testing of a biomarker ideally follow a typical life course (see
Fig. 1). However, this life course is often iterative rather than sequential and a return
to earlier stages in the development is often required as more data become available,
biomarkers are refined and/or new possibly better biomarkers emerge.
biomarker, and this is an important consideration for the design and sample size
calculations for the subsequent trial.
Clinical Validity
Clinical Utility
Once a biomarker has passed through all the stages of analytical and clinical valida-
tion, it may be necessary to test the utility of the biomarker-guided treatment approach
against one that does not use the biomarker to make clinical decisions. These are
typically large and expensive trials as they are attempting to measure the real-world
effectiveness of biomarker-guided management including reliability, feasibility, and
acceptability for patients. If the biomarker testing is complex and expensive, then it
may be important to confirm that use of a biomarker-guided approach does indeed lead
to better outcomes for patients at a reasonable cost. As a result, cost-effectiveness
outcomes can become important in these clinical utility trials.
these factors are defined at the outset and remain fixed for the trial duration. This can
be problematic when there is uncertainty surrounding assumptions made at the
design stage. There is generally more potential for uncertainty when designing a
biomarker-guided trial. For example, new biomarkers or targeted treatments may
come to light once the trial is underway; the predictive ability of a potentially
promising biomarker may be lower than expected; and there may be uncertainty
regarding biomarker prevalence at the outset. Hence an adaptive design, which
allows adaptations based on accumulating data, can be an attractive alternative due
to its flexibility. However, while offering more flexibility, adaptive designs are more
complex and can raise both practical and methodological challenges which need
careful consideration.
A summary of the various adaptive and nonadaptive biomarker-guided trial
designs is provided below. More in-depth discussion of the various design options,
together with an overview of their methodology, guidance on sample size calcula-
tions, other statistical considerations, and their advantages and disadvantages in
various situations, is provided in two review articles by Antoniou et al. (Antoniou
et al. 2016, 2017) and as part of the Delta guidelines for sample size calculations
(Cook et al. 2018). Further guidance is also provided in an online tool available at
www.bigted.org.
Such designs include the whole study population to which the same experimental
treatment is prescribed, without taking into consideration biomarker status. All
patients are prescribed the experimental treatment, with no comparison to a control
treatment (see Fig. 2). These trial designs can be useful for initial identification
and/or validation of a biomarker since they can allow for association to be tested
between biomarker status and efficacy or safety of the experimental treatment. Their
aim is not to estimate the treatment effect, nor the clinical utility of a biomarker in a
definitive way, but to identify whether the biomarker is sufficiently promising to
proceed to a more definitive biomarker-guided randomized controlled trial.
Enrichment Designs
example, where there is prior evidence that the experimental treatment is not beneficial
for them or is likely to cause them harm. However, when it remains unclear whether or
not biomarker-negative individuals will benefit from the novel treatment, the enrich-
ment design is not appropriate and alternative designs, which also assess effectiveness
in the biomarker-negative individuals, should be considered.
Marker-Stratified Designs
Hybrid Designs
In hybrid design trials, the entire population is firstly screened for biomarker status
and all individuals enter the trial. However, only biomarker-positive patients are
randomly assigned either to the experimental or control treatment, while all
In this trial design, the entire study population is tested for its biomarker status. Next,
patients irrespective of their biomarker status are randomized either to the
biomarker-based strategy arm or to the non-biomarker-based strategy arm. In the
biomarker-based strategy arm, biomarker-positive patients receive the experimental
treatment, whereas biomarker-negative patients receive the control treatment.
Patients who are randomized to the non-biomarker-based strategy arm receive the
control treatment irrespective of biomarker status (see Fig. 6).
This approach is useful when the aim is to test the hypothesis that a treatment
approach taking biomarker status into account is superior to that of the standard of
care – that is, the clinical utility of the biomarker. Further, the biomarker-based
strategy arm does not necessarily need to be limited to one experimental treatment –
in principle, a marker-based strategy involving many biomarkers and many possible
treatments could be tested. This type of design can inform researchers whether the
biomarker is prognostic, since both biomarker positive and negative patients are
exposed to the control treatment. However, it cannot definitively answer the question
62 Biomarker-Guided Trials 1155
Fig. 6 Schematic for biomarker-strategy design with biomarker assessment in the control arm. “R”
refers to randomization of patients
Here, patients are again randomized between testing strategies (i.e., biomarker-based
strategy and non-biomarker-based strategy) but the design differs in terms of timing of
biomarker evaluation. More precisely, first, patients are randomized to either the
biomarker-based strategy or to the non-biomarker-based strategy, and biomarkers are
evaluated only in patients who are assigned to the biomarker-based strategy arm.
Patients found to be biomarker-positive are then given the experimental treatment with
biomarker-negative patients given the control treatment. Again, those randomized to
the non-biomarker-based strategy receive the control treatment (see Fig. 7).
This design is useful in situations where it is either not feasible or ethical to test
the biomarker in the entire population due to several logistical (e.g., specimens not
submitted), technical (e.g., assay failure), or clinical reasons (e.g., tumor inacces-
sible); thus, biomarker status is obtained only in patients who are randomized to
the biomarker-based strategy arm. However, biomarker-positive and biomarker-
1156 L. C. Brown et al.
Fig. 7 Schematic
for biomarker-strategy design
without biomarker assessment
in the control arm. “R” refers
to randomization of patients
Fig. 8 Schematic for biomarker-strategy design with treatment randomization in the control arm.
“R” refers to randomization of patients
Here, patients are randomized either to the biomarker-based strategy arm or the
reverse biomarker-based strategy arm. As in the previous three biomarker-strategy
designs, patients who are allocated to the biomarker-strategy arm receive the exper-
imental treatment if they are biomarker-positive whereas biomarker-negative
patients receive the control treatment. By contrast, patients who are randomly
assigned to the reverse biomarker-based strategy arm receive the control treatment
if they are biomarker-positive, whereas biomarker-negative patients receive the
experimental treatment (see Fig. 9).
1158 L. C. Brown et al.
Fig. 9 Schematic for reverse marker-based strategy design. “R” refers to randomization of patients
This design is recommended in cases where prior evidence indicates that both
experimental and control treatment are effective in treating patients, but the optimal
strategy has not yet been identified. The design enables the evaluation of an
interaction between the biomarker and different treatments. Additionally, it allows
estimation of the effect size of the experimental treatment compared to control
treatment for each biomarker-defined subgroup separately. Also, there is no chance
that the same treatment will be allocated to biomarker-positive patients in both arms
or to biomarker-negative patients in both arms. This is a problem in the other types of
biomarker-based strategy designs where there will be patients with the same bio-
marker status having the same treatment in both trial arms.
It is important to note that all biomarker-strategy designs will need a larger
sample size as compared to the marker-stratified designs.
Entire Population
Biomarker Assessment
Experimental Control
Biomarker + Biomarker –
Fig. 10 Schematic for a randomized phase II trial design with biomarkers. “R” refers to random-
ization of patients. CI refers to the confidence interval. Uncolored boxes refer to the first stage of the
trial and colored boxes refer to the second stage of the trial. Different stages refer to the analysis and
not to the trial design
given on the type of phase III trial design to be used (enrichment, marker stratified or
traditional with no biomarker). If in the interim analysis, however, the treatment
effect is not found to be significant in the biomarker-positive subgroup, the exper-
imental treatment is compared to control in the entire study population. If the overall
treatment effect is found significant at a prespecified level of significance, a tradi-
tional design with no biomarker assessment is recommended for phase III. Other-
wise, it is recommended that no phase III trial is undertaken for the experimental
treatment (see Fig. 10).
Adaptive Designs
The adaptive signature design was proposed for settings where a biomarker signa-
ture, defined as a set of biomarkers the combined status of which is used to stratify
patients into subgroups, is not known at the outset, and allows the development and
evaluation of a biomarker signature within the trial. Generally, this approach is
useful when there is no available biomarker at the start of the trial or when there
1160 L. C. Brown et al.
Experimental Control
Yes No
Success
Test Biomarker +
recruited after interim
analysis
This design can be useful when the biomarkers are either putative or unknown at the
beginning of a phase II trial, and also when there are multiple targeted treatments and
biomarkers to be considered. It aims to test simultaneously both biomarkers and
62 Biomarker-Guided Trials 1161
Entire Population
Biomarker Assessment
Biomarker + Biomarker –
R R
Fig. 12 Schematic for outcome-based adaptive randomization design. “R” refers to randomization
of patients
treatments while providing more patients with effective therapies according to their
biomarker profiles (see Fig. 12).
The trial begins with the assessment of patients’ biomarker status. Within each
biomarker subgroup, patients are then randomized equally to one or more experi-
mental arms or control arm. The design permits the modification of the allocation
ratio to different treatment arms over time so that the arm(s) which seem(s) to have
the best response rate is composed of the higher proportion of randomized patients.
This modification in allocation ratio is informed by accumulated patients’ data about
how well the biomarker performs at each interim analysis stage. For example, when
data accrued so far suggests that a particular treatment is superior to others, the ratio
will be modified to ensure a higher number of patients are allocated accordingly.
Fig. 13 Schematic for adaptive threshold enrichment design. “R” refers to randomization of
patients
treatment within the biomarker-positive subgroup; and (iii) if the interim result is
negative, then the accrual stops due to futility in the biomarker-positive subgroup
and the trial is closed without showing a treatment benefit; if the result is “promis-
ing” for the specific biomarker-positive subgroup, then the study continues with this
specific biomarker-positive subgroup and accrual also begins for biomarker-negative
patients. Thus, the trial continues with patients randomized from the entire popula-
tion. A “promising” result in the biomarker-positive subgroup at the interim stage is
claimed when the estimated treatment effect is above a particular prespecified
threshold (see Fig. 13).
Entire Population
Biomarker + Biomarker –
Experimental Control
Interim Analysis
Experimental Treatment Superior in
Biomarker – ?
Yes No
R R
Fig. 14 Schematic for adaptive patient enrichment design. “R” refers to randomization of patients
This design allows the efficacy of a novel treatment, which possibly differs in the
biomarker-positive subgroup compared to the biomarker-negative subgroup, to be
tested. It requires a predefined biomarker with well-established prevalence (see
Fig. 15). The design begins with a first stage, which entails two parallel phase II
studies, one in the biomarker-positive and the other in the biomarker-negative
subgroup. Next, if activity is not observed in either biomarker subgroup during the
first stage, the trial stops; if activity of the experimental treatment is observed during
the first stage of the study for both the biomarker-positive and biomarker-negative
subgroups, additional patients from the general patient population are enrolled into
the second stage; if results of the first stage suggest that activity is limited to
biomarker-positive patients, the second stage continues with the recruitment of
additional biomarker-positive patients only. This design may augment the efficiency
of a trial as it allows for early understanding that a particular experimental treatment
is beneficial in a specific biomarker defined subgroup.
1164 L. C. Brown et al.
Entire population
Biomarker + Biomarker –
Experimental Experimental
Interim Analysis
Efficacy of Experimental
Treatment ?
Experimental Experimental
Fig. 15 Schematic for adaptive parallel Simon two-stage design. “R” refers to randomization of
patients
This design as originally proposed was not for biomarker-guided trials but rather was
aimed at testing multiple experimental treatments against a control treatment in the
same trial. However, it is also useful in a biomarker-guided context since it allows
patients to be allocated to a trial of a particular experimental treatment, based on their
biomarker status.
The first stage of a MAMS trial (the phase II stage) involves biomarker stratifi-
cation into one of a number of separate comparisons with each comparing an
experimental treatment with a control treatment. The comparison within which a
patient is included depends on their biomarker status, for example, patients positive
for biomarker 1 may be randomized in comparison 1 to either control or experimen-
tal treatment 1 while patients positive for biomarker 2 may be randomized into
comparison 2 to either control or experimental treatment 2. At the end of this first
stage, an interim analysis is undertaken within each comparison, comparing each
experimental treatment with the control treatment. Depending on the outcome of the
interim analysis, accrual of patients in a comparison either continues to the second
stage of the trial or the accrual of additional patients stops within that comparison
(see Fig. 16).
62 Biomarker-Guided Trials 1165
Entire Population
Biomarker 1 + Biomarker 2 +
R R
Yes No Yes No
Drop Drop
Experimental 1 Experimental 2
Fig. 16 Schematic for multi-arm, multi-stage (MAMS) design. “R” refers to randomization of
patients
For many biomarkers, measurement can be relatively easy but for some of the more
complex laboratory driven and imaging biomarkers, quality assurance is necessary
to demonstrate that the biomarker test is reliable and repeatable, particularly between
laboratories and between any investigators who are responsible for making
1166 L. C. Brown et al.
Randomise
Treatment A No treatment A
Fig. 17 Sources of statistical uncertainty when exploring the role of a biomarker in a stratified trial
62 Biomarker-Guided Trials 1167
A biomarker-strategy design tests the hypothesis that using the biomarker to guide
treatment will result in superior outcomes to not using it. As discussed above,
biomarker-strategy designs can take various forms. The experimental arm may
allocate between a number of treatments depending on the results of the biomarker
test or may just choose between treatment or nontreatment. The control arm may
allocate all patients to one standard treatment or may randomize patients between
treatments.
In all cases, the primary analysis of a biomarker strategy will compare outcomes
between the experimental and control arm. Thus, methods of analysis will be similar
to traditional two-arm RCTs. As biomarker-strategy trials are often assessing the
effectiveness of implementing the strategy in routine practice, the primary analysis
should generally be by intention to treat, with patients analyzed within their ran-
domized groups. It is likely that some patients in the experimental arm may not
follow the specified treatment strategy due to practical issues. In this case, a
per-protocol analysis could be useful to determine whether an idealized version of
the biomarker strategy, where there were no errors or delays in assessing the
biomarker and no deviations from the recommended treatment occur, has a larger
advantage. This may be useful in determining if further refinements of the biomarker
test could be useful. If a per-protocol analysis is required, then it is important to
prespecify the definition of per-protocol clearly in the statistical analysis plan.
One issue with analysis of biomarker strategy designs is that there may be
heterogeneity of outcome within each arm which may violate assumptions made
by the analysis. If, for example, different types of patients are allocated to different
treatments within each arm, then the assumptions of some statistical tests that data
from within each arm is similarly distributed will not be true. Analyses that adjust or
stratify by the biomarker status may be more appropriate, but still would assume that
the mean treatment effect (i.e., the difference in effect of being on the experimental
arm compared with the control arm) is the same for different types of patients on
some scale. It is important to be clear about the assumptions of the analysis and to
ensure the results are robust to them.
The gold standard design for testing whether a biomarker is predictive is the marker
stratified design. The statistical analysis will generally focus on: (1) the interaction
effect between biomarker and treatment on outcome and (2) the marginal effect of
the treatment. The primary analysis should be the question that the study was
powered to test.
1168 L. C. Brown et al.
Estimating and testing the interaction effect can be performed by fitting a suitable
regression model. This model should include parameters for: (1) the marginal effect
of treatment arm; (2) marginal effect of biomarker status; and (3) interaction between
treatment arm and biomarker status. For instance, with a normally distributed
outcome, the suitable linear model will be:
Y i ¼ α þ βT i þ γBi þ δT i Bi þ ϵ i
Biomarker-guided trial designs have developed considerably over the last 15 years.
The trial design features described in this chapter provide a summary of what is
available and the selection of particular design features will depend upon the research
question being investigated. We have aimed to provide a high-level summary of issues
to consider but other designs will no doubt emerge in the coming years. We would
recommend further reading of the extensive literature in this field (Renfro et al. 2016;
Freidlin and Korn 2010; Buyse et al. 2011; Mandrekar and Sargent 2009; Freidlin
2010; Stallard 2014; Tajik et al. 2013; Simon 2010; Gosho et al. 2012; European
62 Biomarker-Guided Trials 1169
Medicines Agency 2015; Eng 2014; Baker 2014; Freidlin et al. 2012; Freidlin and
Simon 2005; Wang et al. 2009; Karuri and Simon 2012; Jones and Holmgren 2007;
McShane et al. 2009; Parmar et al. 2008; Wason and Trippa 2014).
Acknowledgments This work is based on research arising from UK’s Medical Research Council
(MRC) grants MC_UU_00004/09, MC_UU_12023/29, and MC_UU_12023/20.
References
Aiken LS, West SG (1991) Multiple regression: testing and interpreting interactions. Sage,
Newbury Park
Antoniou M, Jorgensen AL, Kolamunnage-Dona R (2016) Biomarker-guided adaptive trial designs
in phase II and phase III: a methodological review. PLoS One 11(2):e0149803. https://doi.org/
10.1371/journal.pone.0149803
Antoniou M, Kolamunnage-Dona R, Jorgensen AL (2017) Biomarker-guided non-adaptive trial
designs in phase II and phase III: a methodological review. J Pers Med 7(1). https://doi.org/10.
3390/jpm7010001
Baker SG (2014) Biomarker evaluation in randomized trials: addressing different research ques-
tions. Stat Med 33:4139–4140
Buyse M, Sargent G, de Matheson G (2011) Integrating biomarkers in clinical trials. Expert Rev
Mol Diagn 11:171–182
Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM (2006) Measurement error in nonlinear
models: a modern perspective, 2nd edn. Chapman Hall/CRC, Boca Raton
Cook JA et al (2018) DELTA2 guidance on choosing the target difference and undertaking and
reporting the sample size calculation for a randomised controlled trial. BMJ 363. https://doi.org/
10.1136/bmj.k3750
Eng KH (2014) Randomized reverse marker strategy design for prospective biomarker validation.
Stat Med 33:3089–3099
European Medicines Agency. Reflection Paper on Methodological Issues Associated with
Pharmacogenomic Biomarkers in Relation to Clinical Development and Patient Selection.
Available online: http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guide
line/2011/07/WC500108672.pdf (Accessed on 10 Oct 2015)
FDA-NIH Working Group, Biomarkers, Endpoints and Other Tools (BEST) guidelines, updated
May 2018
Freidlin K (2010) Biomarker-adaptive clinical trial designs. Pharmacogenomics 11(12):1679–1682
Freidlin MS, Korn (2010) Randomized clinical trials with biomarkers: design issues. J Natl Cancer
Inst 102:152–160
Freidlin B, Simon R (2005) Adaptive signature design: an adaptive clinical trial design for
generating and prospectively testing a gene expression signature for sensitive patients. Clin
Cancer Res 11(21):7872–7878
Freidlin B, McShane LM, Polley M-YC, Korn EL (2012) Randomized phase II trial designs with
biomarkers. J Clin Oncol 30:3304–3309
Friedlin K (2014) Biomarker enrichment strategies: matching trial design to biomarker credentials.
Nat Rev Clin Oncol 11:81–90
Gosho M, Nagashima K, Sato Y (2012) Study designs and statistical analyses for biomarker
research. Sensors 12:8966–8986
ICH GCP guidelines.: https://www.ich.org/products/guidelines/efficacy/efficacy-single/article/
integrated-addendum-good-clinical-practice.html
Jones CL, Holmgren E (2007) An adaptive Simon two-stage Design for Phase 2 studies of targeted
therapies. Contemp Clin Trials 28(5):654–661
1170 L. C. Brown et al.
Kaplan RK, Maughan TM, Crook AC, Fisher DF, Wilson RW, Brown LC, Parmar MP (2013)
Evaluating many treatments and biomarkers in oncology: a new design. J Clin Oncol 31(36):
4562–4568
Karuri SW, Simon R (2012) A two-stage Bayesian design for co-development of new drugs and
companion diagnostics. Stat Med 31(10):901–914
Lung-Map: Master Protocol for Lung Cancer.: https://www.cancer.gov/types/lung/research/lung-
map. (Accessed 16 Jan 2021)
Mandrekar SJ, Sargent DJ (2009) Clinical trial designs for predictive biomarker validation:
theoretical considerations and practical challenges. J Clin Oncol 27(24):4027–4034
McShane LM, Hunsberger S, Adjei AA (2009) Effective incorporation of biomarkers into phase II
trials. Clin Cancer Res 15(6):1898–1905
Parmar MKB, Barthel FMS, Sydes M, Langley R, Kaplan R, Eisenhauer E et al (2008) Speeding up
the evaluation of new agents in cancer. J Natl Cancer Inst 100(17):1204–1214
Pennello G (2013) Analytical and clinical evaluation of biomarkers assays: when are biomarkers
ready for prime time? Clin Trials 10:666–676. [300,301]
Renfro M, Ming-Wen S, Mandrekar (2016) Clinical trial designs incorporating predictive bio-
markers. Cancer Treat Rev 43:74–82
Simon R (2010) Clinical trial designs for evaluating the medical utility of prognostic and predictive
biomarkers in oncology. Pers Med 7:33–47
Stallard H (2014) Parsons, Friede adaptive designs for confirmatory clinical trials with subgroup
selection. J Biopharm Stat 24:168–187
Tajik P, Zwinderman AH, Mol BW, Bossuyt PM (2013) Trial designs for personalizing cancer care:
a systematic review and classification. Clin Cancer Res 19:4578–4588
Wang S-J, Hung HMJ, O’Neill RT (2009) Adaptive patient enrichment designs in therapeutic trials.
Biom J Biometrische Zeitschrift 51(2):358–374
Wason JMS, Trippa L (2014) A comparison of Bayesian adaptive randomization and multi-stage
designs for multi-arm clinical trials. Stat Med 33(13):2206–2221
Diagnostic Trials
63
Madhu Mazumdar, Xiaobo Zhong, and Bart Ferket
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1172
Diagnostic Trial Type I: Evaluating Diagnostic Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173
Assessment of Diagnostic Accuracy of Single Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173
Comparing Diagnostic Accuracy of Multiple Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1174
Definitions of Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1175
Sensitivity and Specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176
Positive and Negative Predictive Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177
Receiver Operating Characteristics Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177
The Area Under the ROC Curve (AUC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1178
Sample Size Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1179
Reporting Diagnostic Trials for Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1180
Diagnostic Trial Type II: Diagnostic Randomized Clinical Trials for Assessment of Clinical
Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1182
Test-Treatment Trial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1182
Evaluating a Single Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1183
Randomized Controlled Trial (RCT) of Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1183
Random Disclosure Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1184
Evaluating Multiple Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1186
Explanatory Versus Pragmatic Approaches for Test-Treatment Trials . . . . . . . . . . . . . . . . . . . . . . . . 1191
Reporting of Test-Treatment Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1191
Statistical Analysis and Sample Size Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1192
Economic Analysis in Test-Treatment Trials and Decision Models . . . . . . . . . . . . . . . . . . . . . . . 1192
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1193
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194
M. Mazumdar (*)
Director of Institute for Healthcare Delivery Science, Mount Sinai Health System, NY, USA
e-mail: [email protected]
X. Zhong · B. Ferket
Ichan School of Medicine at Mount Sinai, New York, NY, USA
e-mail: [email protected]; [email protected]
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194
Abstract
The term diagnostic trial is generally used in two different ways. A diagnostic
trial type I describes studies that evaluate accuracy of diagnostic tests in detecting
disease or its severity. Primary endpoints for these studies are generally test
accuracy outcomes measured in terms of sensitivity, specificity, positive predic-
tive value, negative predictive value, and area under the receiver operating
characteristics curves. Although establishing an accurate diagnosis or excluding
disease is a critical first step to manage a health problem, medical decision-
makers generally rely on a larger evidence base of empirical data that includes
how tests impact patient health outcomes, such as morbidity, mortality, functional
status, and quality of life. Therefore, the diagnostic trial type II evaluates the
value of test results to guide or determine treatment decisions within a broader
management strategy. Typically, differences in diagnostic accuracy result in
differences in delivery of treatment, and ultimately affect disease prognosis and
patient outcomes. As such, in the diagnostic trial type II, the downstream
consequences of tests followed by treatment decisions are evaluated together in
a joint construct. These diagnostic randomized clinical trials or test-treatment
trials are considered the gold standard of proof for the clinical effectiveness or
clinical utility of diagnostic tests. In this chapter, we define the variety of accuracy
measures used for assessing diagnostic tests, summarize guidance on sample size
calculation, and bring attention to the importance of more accurate reporting of
study results.
Keywords
Diagnostic trial type I · Diagnostic trial type II · Test-treatment trial · Sensitivity ·
Specificity · Positive predictive value · Negative predictive value · Area under the
receiver operating characteristics curves
Introduction
Diagnostic tests (such as genetic or imaging tests) are health interventions used to
determine the existence or severity of a disease (Sun et al. 2013; Huang et al. 2017).
The development and introduction process of diagnostic tests is equivalent to the
development of other health technologies such as therapeutic drugs, and similarly the
purpose of diagnostic trials can be categorized according to different research
development phases: varying from exploratory to evaluation of clinical impact
(Pepe 2003). The field of diagnostic trials has grown tremendously in the last
40 years (Zhou et al. 2009).
The term diagnostic trial is generally used in two distinct ways in the literature.
The first, here labeled as diagnostic trial type I, is used for studies covering earlier
63 Diagnostic Trials 1173
assess both test results and disease status after enrollment within a cohort study. Both
study designs, the case-control and the cohort design are subject to a variety of
biases. The two most frequently encountered forms of bias are spectrum bias (in
case-control studies) and verification bias (in cohort studies) Spectrum bias in case-
control studies occurs when subjects with more severe disease than generally
observed are selected as cases and healthier subjects are selected as controls. In
order to avoid an overestimation of diagnostic accuracy that results from such an
induced difference in case-mix between the study and target population, both cases
and controls should be randomly selected. Verification bias in cohort studies occurs
when the likelihood of obtaining true disease status depends on the results of the
diagnostic test. For example, invasive or expensive gold standard tests are oftentimes
solely or more frequently performed in subjects with positive test results. The
problem of verification bias in type I diagnostic trials is equivalent to missing
outcome data in cohort studies looking at exposure-outcome relationships. As
such, statistical inference about diagnostic accuracy would still be possible but
oftentimes requires the missingness at random (MAR) assumption. Solutions for
verification bias are available under MAR using, for example, the inverse of the
propensity of verification conditional on test results and other predictors of verifi-
cation (de Groot et al. 2011; Braga et al. 2012; Bruni et al. 2014; Kosinski and
Barnhart 2003). Verification bias often happens in studies in which it is not feasible
to obtain diagnostic results from the “gold standard” on subjects thought to be at low
risk. Thompson et al. (2005) studied the operating characteristics of prostate-specific
antigen (PSA), in which prostate biopsy, the gold standard, was only recommended
to men with PSA greater than 4.0 ng/ml or abnormal rectal examination results
(Thompson et al. 2005). Harel and Zhou (2006) found that multiple imputations
could help correct the verification bias by assuming that the missing information of
prostate biopsy would not relate to the true prostate cancer status, but may depend on
the values of PSA or rectal examination among some other variables (Harel and
Zhou 2006). As such, under the MAR assumption, we can still obtain asymptotic
unbiased estimates of diagnostic test performance results (e.g., sensitivity and
specificity) with full information maximum likelihood.
procedure. In the uncontrolled setting, patients may undergo one of the two pro-
cedures due to a variety of reasons, such as lower cost, higher comorbidity, or
hospital policy. These reasons might not be documented if the study is of observa-
tional nature, hampering the necessary adjustment for confounding. Yet, even if a
diagnostic test is considered accurate in such a study after controlling for
confounding, it will not be clear whether the result is influenced by unobserved
confounders. In RCTs, patients undergo diagnostic procedures (i.e., experimental vs.
reference) according to the results of randomization. The randomization mechanism
rules out the potential impact of unmeasured confounders and thus helps to protect
the conclusion from biases (Braga et al. 2012).
For example, cervical cancer is one of the leading causes of cancer-related mortality
in sub-Saharan Africa. Visual inspection with acetic acid (VIA) is the standard test in
this setting, but visual inspection with Lugol’s iodine (VILI) is also a commonly
recommended diagnostic technique for detecting cervical cancer. Huchko et al. (2015)
conducted a randomized clinical trial to compare the diagnostic accuracy of VILI with
VIA among HIV-infected women in western Kenya (Huchko et al. 2015). The trial
enrolled 654 women, who were randomized to undergo either VILI or VIA with
colposcopy (1:1 ratio). Any lesion suspicious for cervical intraepithelial neoplasia 2 or
greater (CIN2+) was then biopsied as the gold standard for determining true disease
status. To maximize the statistical power in a two-arm RCT, the randomization ratio is
usually set as 1:1, so the numbers of patients undergoing different procedures are
equal. However, a formulation for ratios other than 1:1 could also be used and might
be preferred for some practical reasons, such as reducing the costs, and enhancing the
feasibility of recruitment into or execution of a RCT.
Paired Trial
When diagnostic tests do not interfere with each other and can be done in the same
study subject, a trial with a paired design might provide a more efficient alternative.
For example, Ahmed et al. (2017) reported a diagnostic trial with a paired design
comparing two imaging tests for prostate cancer (Ahmed et al. 2017). Men with high
serum prostate-specific antigen (PSA) usually undergo transrectal ultrasound-guided
prostate biopsy (TRUS-biopsy), which can cause side effects such as bleeding, pain,
and infection. Multi-parametric magnetic resonance imaging (MP-MRI) might allow
avoiding these side effects and improve diagnostic accuracy. To test this idea, 576
men were enrolled and underwent an MP-MRI followed by a TRUS-biopsy. At the
end of the study, a template prostate mapping (TPM) biopsy was conducted for each
patient, and the result was adopted as true disease status (gold standard). Diagnostic
comparison was made on the paired results for the competing tests.
Definitions of Accuracy
Table 1 Underlying statistics for evaluation of a diagnostic test with binary outcomes
True disease (golden standard)
Disease No disease
Test results Positive True positive False positive
Negative False negative True negative
defined by the gold standard (e.g., TPM biopsy) and rows indicating the results of
either experimental or reference procedure (e.g., MP-MRI or TRUS-biopsy). A
diagnostic test that leads to a high proportion of positive results among patients
with true disease, and a high proportion of negative results among patients without
true disease, indicates good or better diagnostic accuracy.
Table 2 Diagnostic results of MP-MRI and impact of change in disease prevalence: PROMIS trial
(B) Prevalence rate increase from 40% to
(A) Original MP-MRI with 40% prevalence 60%
Disease No disease Total Disease No disease Total
Negative 17 141 158 ! Negative 24 94 118
Positive 213 205 418 ! Positive 322 136 458
Total 230 346 576 Total 346 230 576
Two other measures of accuracy commonly used in diagnostic trials are positive and
negative predictive values. Positive predictive value (PPV) answers the question
“How likely is it that a patient has the true disease given a positive result?” Negative
predictive value (NPV) answers the question “How likely is it that a patient does not
have the disease given a negative result?” In the example, 418 and 158 patients were
diagnosed as positive and negative, respectively, based on MR-MRI results. Thus,
the PPV was 0.51 (¼213/418) and the NPV was 0.89 (¼141/158). Unlike sensitivity
and specificity, PPV and NPV are affected by disease prevalence. It is more likely to
find positive test results in a high-prevalence population compared to a low-preva-
lence population (Trevethan 2019). If the disease prevalence of prostate cancer
increases from 40% to 60%, for the same diagnostic procedure with a sensitivity
of 0.93 and a specificity of 0.41, the PPV would increase from 0.51 to 0.7 (¼322/
346) and the NPV would decrease from 0.89 to 0.80 (¼94/118). On the contrary,
when disease prevalence decreases, the PPV of a diagnostic procedure will decrease
and NPV will increase, while the sensitivity and specificity remain constant (Table
2). PPV and NPV are important because posttest probabilities eventually determine
the clinical impact of subsequent treatment and the test-treatment strategy as a
whole. When the PPV of a diagnostic procedure is high, more benefit can be
expected from an efficacious treatment, whereas when the NPV is high, less harm
can be expected from foregoing treatment.
When the diagnostic test has a continuous scale, researchers may face multiple
possible cutoff points, and each cutoff point leads to a pair of sensitivity and specificity
values. For example, Park et al. (2004) reported a study in which 70 patients with
solitary pulmonary nodules underwent plain chest radiography to determine whether
the nodules were benign or malignant. Chest radiographs were interpreted according
to a five-point scale: 1-definitely benign, 2-probably benign, 3-possibly malignant, 4-
probably malignant, and 5-definitely malignant. Thus, a positive result in this study
was based on four possible cutoff points: 2, 3, 4, and 5, and vice versa. Note that
sometimes a low score relates to a positive test, for example, lower cycle threshold (Ct)
1178 M. Mazumdar et al.
Fig. 1 Operating points, empirical and smooth ROC curves in the radiograph study
due to its advantage in summarizing the variation of TPR and FPR across different
possible cutoff points. The accuracy of a diagnostic procedure in the ROC context is
widely measured by area under the ROC curve (AUC). The AUC provides the
average value of TPR given all the possible values of FPR. Considering both the
ranges of TPR and FPR are (0, 1), the AUC can take any value between 0 and 1. The
practical value of the AUC is reflected by a value that ranges from 0.5 (area under the
chance diagonal) to 1 (area under a ROC with perfect diagnostic ability). A higher
value of the AUC indicates better overall diagnostic performance. As with TPR and
FPR, the AUC is independent of disease prevalence. Considering that the AUC of a
diagnostic procedure from a trial is estimated based on a random sample, appropriate
statistical inference is necessary for making a conclusion, and the uncertainty around
the AUC is typically handled by a certain level of confidence interval (e.g., 95%).
We can estimate the AUC by using parametric (McClish 1989; Metz 1978) and
empirical methods (McClish 1989; Metz 1978; Obuchowski and Bullen 2018). Zhou
et al. (2009) reviewed the performances of both parametric and empirical estimators
of AUC (Zhou et al. 2009). When diagnostic procedures are evaluated based on
continuous (e.g., biomarker) or quasi-continuous (e.g., a percent-confidence scale
with range 0–100%) measurements, both empirical and parametric estimators per-
form well, and the bias is negligible. When considering discrete outcomes (e.g.,
radiography in the study) by the empirical method sometimes underestimates the
AUC. On the other hand, parametric methods rely on the distributional assumption
and sometimes have poor performance in small diagnostic trials.
Table 3 Statistical software for sample size calculation under a specific design
Software procedure Design of diagnostic trial
PASS: Proportions/test for one sample Observational study design for comparing
sensitivity and specificity sensitivity and specificity of a new diagnostic
procedure to an existing standard procedure
PASS: Proportions/test for paired Match-paired design for comparing sensitivities/
sensitivities and specificities specificities of two diagnostic procedures
PASS: Proportions/test for two Observational study design or RCT for comparing
independent sensitivities and specificities sensitivities/specificities of two diagnostic
procedures between two independent samples
PASS: Proportion/confidence intervals for Observational study design for estimating a single
one-sample sensitivity sensitivity using confidence intervals
PASS: Proportion/confidence intervals for Observational study design for estimating a single
one-sample specificity specificity confidence interval
PASS: Proportion/confidence intervals for Observational study design for estimating both
one-sample sensitivity and specificity sensitivity and specificity confidence intervals,
based on a specified sensitivity and specificity,
interval width, confidence level, and prevalence
PASS: AUC-based test for one ROC curve Observational study design for comparing ROC
curve of a new diagnostic procedure to a standard
procedure
PASS: ROC/test for two ROC curve Match-paired design for comparing the AUCs of
two diagnostic procedures
PASS: ROC/confidence intervals for the Observational study design for estimating a
AUC specified width of a confidence interval for AUC
AUC area under the curve, RCT randomized clinical trial, ROC receiver operated characteristics.
Observational study includes cohort and case-control designs
sample size formula that can be applied to trials with a paired design (Simel et al.
1991; Beam 1992). A variety of formulation for different settings and improved
efficiency are provided by many authors (Flahault et al. 2005; Fosgate 2009; Kumar
and Indrayan 2011; Li and Fine 2004; Liu et al. 2005; Obuchowski 1998; Steinberg
et al. 2009). The statistical software, PASS, implements most of these methods and is
easy to use. Table 3 summarizes these procedures and the design corresponding to
each procedure.
Several surveys have shown that studies evaluating diagnostic accuracy often fail to
transparently describe core elements of design and analysis, including how the
cohort was selected and what design parameters the sample size was based on, as
well as to comprehensively describe the study findings and how they will impact
clinical practice (Korevaar et al. 2014, 2015; Lijmer et al. 1999). They also find that
the recommendations from these studies are often unnecessarily generous and
63 Diagnostic Trials 1181
There are circumstances when diagnostic accuracy results from type 1 diagnostic
trials are considered sufficient to extrapolate about net health benefits. Yet, further
empirical evidence is often needed about how the test affects longer-term outcomes.
One way to better evaluate the potential utility of diagnostic tests is to investigate
how well test results match with future patient outcomes by determining the prog-
nostic value and/or the ability to modify treatment effects (predictive value). The
latter should be generally assessed in a randomized setting with testing performed at
baseline in all patients prior to the randomization to treatment(s) (Lijmer and
Bossuyt 2009). However, diagnostic tests are seldom used on their own, independent
of treatment. Test results generally guide or determine treatment decisions as part of
a broader management strategy. Thus, differences in diagnostic accuracy will likely
result in differences in delivery of treatment, which will ultimately affect disease
prognosis and patient outcomes. As such, the downstream consequences of tests
followed by treatment decisions should be evaluated together. The diagnostic
randomized clinical trial or the so-called test-treatment trial design is considered
the gold standard to provide such proof for the clinical effectiveness or clinical utility
of diagnostic tests (Ferrante di Ruffano et al. 2012, 2017). Sometimes consequences
beyond those for the health of patients need to be considered as well, including those
for use of resources in the healthcare sector and/or society. The goal is then for the
test-treatment trial to also provide evidence for changes in efficiency of care by
evaluation of economic outcomes, i.e., cost-effectiveness or cost-utility analysis. In
this chapter, a clinical perspective is taken for discussing the concepts of designing
test-treatment trials, although the broader healthcare sector and societal perspectives
are briefly discussed as well.
design of the trial depends on the application type and the certainty about the
diagnostic accuracy of the test(s), as well as the added value of disclosing test
results, and treatment effectiveness.
This design can be used for comparing health outcomes following a new or
established diagnostic test to health outcomes from a comparator no test strategy,
in the most extreme case defined as treat all or treat none). As the comparator
1184 M. Mazumdar et al.
strategy does not rely on testing, the randomization concerns the decision whether to
perform the testing or not. This trial can answer the question whether it would be
beneficial to avoid treatment in those who test negative, when the comparator is a
treat all strategy, or to offer treatment to those who test positive, when the compar-
ator is a treat none strategy. This scenario is illustrated in Fig. 3. The no test strategy
is oftentimes defined as usual care in which diagnostic and therapeutic interventions
following randomization are not protocolized by the investigators. For example, a
randomized trial was conducted in low-risk pregnant women to evaluate whether
routine ultrasonography in the third trimester improves severe adverse perinatal
outcomes compared with usual care (Henrichs et al. 2019). Routine ultrasonography
was associated with a higher antenatal detection of small for gestational age fetuses,
higher incidence of induction of labor, and lower incidence of augmentation of labor.
However, it did not significantly improve severe adverse perinatal and maternal
peripartum outcomes.
Fig. 3 Trial of testing with treatment as comparator, modify fig (Lijmer and Bossuyt 2009)
63
Diagnostic Trials
Fig. 4 Random disclosure trial, with treatment as comparator, modify fig (Lijmer and Bossuyt 2009)
1185
1186 M. Mazumdar et al.
For many medical decision problems, the question is not whether to test or not to
test, but which test or which combination of tests to use. Conceptually, the trial
design concerning such research questions is equivalent to designing a trial for a
single test as outlined above, with some modifications.
Fig. 5 Trial comparing two different tests (Lijmer and Bossuyt 2009)
1187
1188
In general, results from pragmatic trials are considered more generalizable to current
practice than those from explanatory trials. Pragmatic test-treatment trials enroll
patients within a standard care setting and allow both clinical decision-making at the
discretion of the clinician and patient nonadherence to recommended care (Ferrante
di Ruffano et al. 2017). The PROMISE trial enrolled patients with stable chest pain
who were predominantly at high cardiovascular risk and for whom noninvasive
cardiovascular testing was considered necessary in an outpatient setting (Douglas
et al. 2015). The data coordinating center ensured all study sites had experienced
staff and used diagnostic procedures in agreement with guidelines. Local physicians
made all clinical management decisions at their discretion based on the test results in
both the functional testing as CTA study arm. Follow-up visits were scheduled at
60 days and at 6-month intervals after randomization. Clinical events were adjudi-
cated in a blinded fashion by an independent committee.
Likewise, the SCOT-HEART trial was conducted within a standard care outpa-
tient setting and management following diagnostic testing was done at the discretion
of the clinician in both study arms (Newby et al. 2018). There were, however, no
trial-specific visits planned, and routinely collected data on events were used from
the Information and Statistics Division and the electronic Data Research and Inno-
vation Service of the National Health Service (NHS) Scotland. There was no formal
event adjudication committee involved in the study, and study end points were
classified using diagnostic codes and procedural codes from discharge records.
Thus, both the PROMISE and SCOT-HEART trial can be considered pragmatic
trials, although the SCOT-HEART trial may be considered more so.
Nonetheless, findings from any trial (pragmatic or less pragmatic of nature) are more
likely to be translated into clinical practice when the protocol is published and
provides transparent and detailed information on all test-treatment pathways that
1192 M. Mazumdar et al.
should be followed (e.g., decision trees or flow diagrams). Results should include
data on the diagnoses that were made, as well as data on how the test results may
have impacted subsequent clinical decision-making and outcomes. For example,
treatment decisions should be reported with stratification by test results to show the
extent which clinical decisions were guided by the recommended test-treatment
protocols (Ferrante di Ruffano et al. 2017). The Template for Intervention Descrip-
tion and Replication (TIDieR) checklist and guide can be used to improve the
reporting of test-treatment trials.
Statistical analysis plans and sample size estimations depend on the design of test-
treatment trials and outcome type (e.g., binary, continuous, time-to-event). For two-
arm designs, as in single test and comparative test trials, the statistical analysis is
relatively straightforward and preferably performed using the intention-to-treat
principle. Sample size and power formulas depend on disease prevalence, sensitiv-
ities, and specificities of test(s), and response rate of the treatment(s). For the paired
design used in discordant test results trials, formulas are more complicated, because
overall treatment response rates depend on how many patients had discordant test
results. This in turn is a function of the total number of patients, disease prevalence,
sensitivities, and specificities (Hooper et al. 2013; Lu and Gatsonis 2013).
Usually when new tests are being evaluated against conventional tests or standard care,
it is critical for decision-makers to consider whether replacing existing care by
implementing the new test strategy would be cost-effective. Contemporary clinical
trials therefore frequently include a secondary economic analysis or analysis that
integrates clinical effectiveness and cost implications, i.e., cost-utility or cost-effective-
ness analysis. For example, the PROMISE trialists conducted an economic sub-study in
which costs of the initial outpatient testing strategies were estimated from administra-
tive data, costs of hospitalizations were estimated from uniform billing claims data, and
physician fees were assessed based on reimbursement rates (Mark et al. 2016).
The preferred health outcome used in a more comprehensive cost-effectiveness
analysis is the quality-adjusted life year (QALY), which combines morbidity and
mortality by considering both generic health-related quality of life and survival time.
When the cost-effectiveness analysis is conducted alongside an RCT, average
cumulative QALYs can be estimated by the integral of quality of life utility repeat-
edly scored at a scale from 0 (death) to 1 (perfect health) for each participant during
the trial follow-up period (Glasziou et al. 1998). Utility scores are generally derived
from generic quality of life questionnaires that can be mapped to community
preference weights obtained by standard gamble or time trade-off methods. Costs
can be estimated using a micro-costing approach or a more indirect gross-costing
63 Diagnostic Trials 1193
Table 4 Key differences between test-treatment trials and decision models for evaluating utility of
medical tests (Adapted from Bossuyt et al. (2012), PMID 22730450)
Test-treatment trial Decision model
Can compare few competing test- Can compare multiple competing test-treatment
treatment strategies strategies
Can evaluate a limited number of Can evaluate all relevant effectiveness and safety
effectiveness and safety outcomes outcomes based on multiple sources
Restricted by a limited time horizon Lifetime horizon is possible
User empirical data and correlations can Assumptions about model structure and parameters
be observed and accounted for need to be made
approach, like done in the PROMISE trial. For example, in the Netherlands, a cost-
utility analysis was performed parallel to a randomized controlled trial to determine
the cost-effectiveness of early referral for MR imaging by general practitioners
versus usual care alone in patients with traumatic knee symptoms. QALYs and
costs were estimated over the trial duration from a healthcare and societal perspec-
tive. Results from this analysis showed that MR imaging referral was more costly
(mean costs €1109 vs €837) and less effective (mean QALYS 0.888 vs 0.899). Thus,
usual care was deemed the dominant strategy for patients with traumatic knee
symptoms (van Oudenaarde et al. 2018).
Test-treatment trials are, however, often limited by the number of strategies that can
be evaluated, follow-up duration, and ability to evaluate outcomes with low event rates
(e.g., radiation risk) or large variability (cost data). For these reasons, decision models
are increasingly used to assess the cost-effectiveness of diagnostic tests by combining
and linking data from multiple sources. These sources can include evidence regarding
diagnostic accuracy, disease prevalence, immediate and future adverse event rates,
treatment efficacy, and quality of life. Similar to a cost-effectiveness analysis conducted
alongside the trial, decision models generally analyze outcomes relevant to the
healthcare sector and society. As such, preference weights and costs from multiple
components (formal and informal health care, and non-health care sectors) are assigned
to the modeled tests, treatments, events, and health states. Key differences in charac-
teristics of empirical test-treatment trials and decision models are listed in Table 4.
Decision models can also be used to extrapolate cost-effectiveness outcomes
beyond the trial duration, when it is expected that the trial follow-up is too short to
capture all potential future benefits and harms. For example, in the economic analysis
of the PROMISE trial, a parametric model was used to extrapolate costs from 90 days
to 3 years (Mark et al. 2016). The analysis showed that CTA and conventional
diagnostic testing resulted in similar costs through 3 years of follow-up.
in the target patient population. Key questions to ask are how good is a diagnostic
test at providing the desired answers concerning these outcomes, and what rules of
evidence should be used to judge the value of new tests. The ultimate determinant is
whether the clinical intervention imposed based on the diagnostic test result truly
helped improving a relevant clinical metric for patients. The encompassing field of
diagnostic trials helps addressing these questions through the various designs pre-
sented in this chapter. Researchers must weigh the strengths and weaknesses of each
of these designs, compute sample sizes with an eye toward feasibility, and report all
results transparently to ensure that the new information obtained is useful for clinical
practice and future studies.
Key Facts
• The field of diagnostic trials has grown tremendously in the last 40 years. Two
types of diagnostic trials have emerged: (I) studies that estimate and compare
accuracy of diagnostic procedures and (II) studies that estimate and compare
effectiveness of a treatment pathway triggered by specific results of diagnostic
tests.
• Each type of diagnostic trial has a variety of design options, and related sample
size computation and software tools are now available.
• Guidelines for reporting designs and results of diagnostic trials remain underused.
Cross-References
References
Ahmed HU, El-Shater Bosaily A, Brown LC, Gabe R, Kaplan R, Parmar MK, Collaco-Moraes Y et
al (2017) Diagnostic accuracy of multi-parametric MRI and TRUS biopsy in prostate cancer
(PROMIS): a paired validating confirmatory study. Lancet 389(10071):815–822. https://doi.org/
10.1016/s0140-6736(16)32401-1
Beam CA (1992) Strategies for improving power in diagnostic radiology research. AJR Am J
Roentgenol 159(3):631–637. https://doi.org/10.2214/ajr.159.3.1503041
Begg CB, Greenes RA (1983) Assessment of diagnostic tests when disease verification is subject to
selection bias. Biometrics 39(1):207–215
63 Diagnostic Trials 1195
Bossuyt PM, Reitsma JB, Linnet K, Moons KG (2012) Beyond diagnostic accuracy: the clinical
utility of diagnostic tests. Clin Chem 58(12):1636–1643. https://doi.org/10.1373/clinchem.2012.
182576
Braga LH, Farrokhyar F, Bhandari M (2012) Confounding: what is it and how do we deal with it?
Can J Surg 55(2):132–138. https://doi.org/10.1503/cjs.036311
Bruni L, Barrionuevo-Rosas L, Albero G, Serrano B, Mena M, Gómez D, Muñoz J, Bosch FX, de
Sanjosé S (2014) Human papillomavirus and related diseases report. L’Hospitalet de Llobregat:
ICO Information Centre on HPV and Cancer
Cardoso F, Piccart-Gebhart M, Van’t Veer L, Rutgers E (2007) The MINDACT trial: the first
prospective clinical validation of a genomic tool. Mol Oncol 1(3):246–251. https://doi.org/10.
1016/j.molonc.2007.10.004
Colli A, Fraquelli M, Casazza G, Conte D, Nikolova D, Duca P, Thorlund K, Gluud C (2014) The
architecture of diagnostic research: from bench to bedside–research guidelines using liver
stiffness as an example. Hepatology 60(1):408–418. https://doi.org/10.1002/hep.26948
de Groot JA, Bossuyt PM, Reitsma JB, Rutjes AW, Dendukuri N, Janssen KJ, Moons KG (2011)
Verification problems in diagnostic accuracy studies: consequences and solutions. BMJ 343:
d4770. https://doi.org/10.1136/bmj.d4770
Douglas PS, Hoffmann U, Patel MR, Mark DB, Al-Khalidi HR, Cavanaugh B, Cole J et al (2015)
Outcomes of anatomical versus functional testing for coronary artery disease. N Engl J Med 372
(14):1291–1300. https://doi.org/10.1056/NEJMoa1415516
Faraggi D, Reiser B (2002) Estimation of the area under the ROC curve. Stat Med 21(20):3093–
3106. https://doi.org/10.1002/sim.1228
Ferrante di Ruffano L, Dinnes J, Taylor-Phillips S, Davenport C, Hyde C, Deeks JJ (2017) Research
waste in diagnostic trials: a methods review evaluating the reporting of test-treatment interven-
tions. BMC Med Res Methodol 17(1):32. https://doi.org/10.1186/s12874-016-0286-0
Ferrante di Ruffano L, Hyde CJ, McCaffery KJ, Bossuyt PM, Deeks JJ (2012) Assessing the value
of diagnostic tests: a framework for designing and evaluating trials. BMJ 344:e686. https://doi.
org/10.1136/bmj.e686
Fosgate GT (2009) Practical sample size calculations for surveillance and diagnostic investigations.
J Vet Diagn Investig 21(1):3–14. https://doi.org/10.1177/104063870902100102
Flahault A, Cadilhac M, Thomas G (2005) Sample size calculation should be performed for design
accuracy in diagnostic test studies. J Clin Epidemiol 58(8):859–862. https://doi.org/10.1016/j.
jclinepi.2004.12.009
Glasziou PP, Cole BF, Gelber RD, Hilden J, Simes RJ (1998) Quality adjusted survival analysis
with repeated quality of life measures. Stat Med 17(11):1215–1229. https://doi.org/10.1002/
(sici)1097-0258(19980615)17:11<1215::aid-sim844>3.0.co;2-y
Gluud C, Gluud LL (2005) Evidence based diagnostics. BMJ 330(7493):724–726
Hajian-Tilaki KO, Hanley JA, Joseph L, Collet JP (1997) A comparison of parametric and
nonparametric approaches to ROC analysis of quantitative diagnostic tests. Med Decis Mak
17(1):94–102. https://doi.org/10.1177/0272989x9701700111
Harel O, Zhou XH (2006) Multiple imputation for correcting verification bias. Stat Med 25(22):
3769–3786. https://doi.org/10.1002/sim.2494
Henrichs J, Verfaille V, Jellema P, Viester L, Pajkrt E, Wilschut J, van der Horst HE, Franx A, de
Jonge A (2019) Effectiveness of routine third trimester ultrasonography to reduce adverse
perinatal outcomes in low risk pregnancy (the IRIS study): nationwide, pragmatic, multicentre,
stepped wedge cluster randomised trial. BMJ 367:l5517. https://doi.org/10.1136/bmj.l5517
Hooper R, Díaz-Ordaz K, Takeda A, Khan K (2013) Comparing diagnostic tests: trials in people
with discordant test results. Stat Med 32(14):2443–2456. https://doi.org/10.1002/sim.5676
Huchko MJ, Sneden J, Zakaras JM, Smith-McCune K, Sawaya G, Maloba M, Bukusi EA, Cohen
CR (2015) A randomized trial comparing the diagnostic accuracy of visual inspection with
acetic acid to visual inspection with Lugol’s iodine for cervical cancer screening in HIV-infected
women. PLoS One 10(4):e0118568. https://doi.org/10.1371/journal.pone.0118568
Huang EP, Lin FI, Shankar LK (2017) Beyond correlations, sensitivities, and specificities: a
roadmap for demonstrating utility of advanced imaging in oncology treatment and clinical
trial design. Acad Radiol 24(8):1036–1049. https://doi.org/10.1016/j.acra.2017.03.002
1196 M. Mazumdar et al.
Obuchowski NA (1998) Sample size calculations in studies of test accuracy. Stat Methods Med Res
7(4):371–392. https://doi.org/10.1177/096228029800700405
Ogilvie JC, Douglas Creelman C (1968) Maximum-likelihood estimation of receiver operating
characteristic curve parameters. J Math Psychol 5(3):377–391
Obuchowski NA, Bullen JA (2018) Receiver operating characteristic (ROC) curves: review of
methods with applications in diagnostic medicine. Phys Med Biol 63(7):07tr01. https://doi.org/
10.1088/1361-6560/aab4b1
Park SH, Goo JM, Jo CH (2004). Receiver operating characteristic (ROC) curve: practical review
for radiologists. Korean J Radiol 5(1):11–18. https://doi.org/10.3348/kjr.2004.5.1.11
Pepe MS (2003) The statistical evaluation of medical tests for classification and prediction.
Medicine
Sackett DL, Haynes RB (2002) The architecture of diagnostic research. BMJ 324(7336):539–541.
https://doi.org/10.1136/bmj.324.7336.539
Simel DL, Samsa GP, Matchar DB (1991) Likelihood ratios with confidence: sample size estimation
for diagnostic test studies. J Clin Epidemiol 44(8):763–770. https://doi.org/10.1016/0895-4356
(91)90128-v
Steinberg DM, Fine J, Chappell R (2009) Sample size for positive and negative predictive value in
diagnostic research using case-control designs. Biostatistics 10(1):94–105. https://doi.org/10.
1093/biostatistics/kxn018
Sun F, Schoelles KM, Coates VH (2013) Assessing the utility of genetic tests. J Ambul Care
Manage 36(3):222–232. https://doi.org/10.1097/JAC.0b013e318295d7e3
Swets JA (1986) Indices of discrimination or diagnostic accuracy: their ROCs and implied models.
Psychol Bull 99(1):100–117
Thompson IM, Ankerst DP, Chen C, Scott Lucia M, Goodman PJ, Crowley JJ, Parnes HL, Coltman
CA (2005) Operating characteristics of prostate-specific antigen in men with an initial PSA level
of 3.0 ng/ml or lower. JAMA 294(1):66–70
Trevethan R (2019) Response: commentary: sensitivity, specificity, and predictive values: founda-
tions, Pliabilities, and pitfalls in research and practice. Front Public Health 7:408. https://doi.org/
10.3389/fpubh.2019.00408
van Oudenaarde K, Swart NM, Bloem JL, Bierma-Zeinstra SMA, Algra PR, Bindels PJE, Koes BW
et al (2018) General practitioners referring adults to MR imaging for knee pain: a randomized
controlled trial to assess cost-effectiveness. Radiology 288(1):170–176. https://doi.org/10.1148/
radiol.2018171383
Whiting P, Rutjes AW, Reitsma JB, Glas AS, Bossuyt PM, Kleijnen J (2004) Sources of variation
and bias in studies of diagnostic accuracy: a systematic review. Ann Intern Med 140(3):189–
202. https://doi.org/10.7326/0003-4819-140-3-200402030-00010
Walsh SJ (1997) Limitations to the robustness of binormal ROC curves: effects of model mis-
specification and location of decision thresholds on bias, precision, size and power. Stat Med 16
(6):669–679. https://doi.org/10.1002/(sici)1097-0258(19970330)16:6<669::aid-sim489>3.0.
co;2-q
Zhou X-H, McClish DK, Obuchowski NA (2009) Statistical methods in diagnostic medicine. John
Wiley & Sons
Designs to Detect Disease Modification
64
Michael P. McDermott
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1200
Standard Single-Period Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1201
Two-Period Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202
Withdrawal Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1202
Delayed Start Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1204
Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206
Eligibility Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1208
Duration of Follow-up Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1209
Statistical Considerations for Two-Period Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1210
Primary Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1210
Strategies for Accommodating Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1212
Sample Size Determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1214
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1215
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1216
Abstract
Designing a trial to determine whether or not an intervention has modified the
underlying course of the disease is straightforward for certain conditions, such as
cancer, in which it is possible to directly measure the disease course. For many
other diseases, the disease course is latent, and one must rely on indirect measures
such as clinical symptoms to quantify the effects of interventions. In this case, it is
difficult with conventional trial designs to determine the extent to which
the treatment is modifying the disease course as opposed to merely alleviating
the symptoms of the disease. This distinction has become critically important in
M. P. McDermott (*)
Department of Biostatistics and Computational Biology, University of Rochester Medical Center,
Rochester, NY, USA
e-mail: [email protected]
Keywords
Two-period design · Withdrawal design · Delayed start design · Disease-
modifying effect · Symptomatic effect · Alzheimer’s disease · Parkinson’s
disease · Missing data · Noninferiority
Introduction
For many diseases, it is not possible to directly observe the underlying disease
process. Instead, clinical symptoms and/or function or, in some cases, even labora-
tory or biological markers might serve as indirect measures of this process. Exam-
ples of such conditions include diabetic peripheral neuropathy, depression, anemia,
osteoporosis, and neurodegenerative disease (e.g., Alzheimer’s disease, Parkinson’s
disease, and Huntington’s disease). In recent years, interest has increased substan-
tially in the problem of designing clinical trials to determine whether a treatment has
modified the underlying course of the disease or has merely exerted its effect on
disease symptoms.
This heightened interest in trial designs that can detect disease modification has
been most prominent in the area of neurodegenerative disease, specifically in
Alzheimer’s disease (AD) and Parkinson’s disease (PD). Although a wide variety
of effective treatments have been developed for these conditions, none have been
conclusively shown to modify the underlying course of the disease, and most are
believed to only alleviate disease symptoms. The discovery of a treatment that either
slows, halts, or even reverses underlying disease progression has been termed “the
highest priority in PD research” (Olanow et al. 2008). The aging of the population
has raised grave concerns regarding the global public health crisis posed by AD
(Cummings 2017) and PD (Dorsey and Bloem 2018), exacerbating the need for
disease-modifying treatments.
The term disease modification implies that the treatment has an enduring effect on
the course of the underlying disease. Modifications of a key pathological feature of
the disease, such as tau and β-amyloid protein levels in the brain in AD (Kaye 2000)
64 Designs to Detect Disease Modification 1201
It has been suggested by some that standard parallel group designs can be used to
infer a disease-modifying effect of an intervention by examining whether the pattern
of group differences in mean responses on a suitable clinical rating scale diverges
over time (Guimaraes et al. 2005; Vellas et al. 2008). For example, if the pattern
of change over time is linear in each treatment group, a group difference in the rate of
change (slope) would indicate an effect of treatment on the underlying progression of
the disease. The trouble with this interpretation is that such results are also
1202 M. P. McDermott
Two-Period Designs
Withdrawal Design
In a seminal paper, Leber (1996) formally proposed the use of two-period designs to
attempt to distinguish between the symptomatic and disease-modifying effects of an
intervention. In the withdrawal design, participants are randomly assigned to receive
64 Designs to Detect Disease Modification 1203
either active treatment or placebo in the first period (Period 1) and followed for a
fixed length of time. In the second period (Period 2), those who were receiving active
treatment are switched to placebo (A/P group), and those who were receiving
placebo remain on placebo (P/P group) (Fig. 1). Period 1 is chosen to be sufficiently
long to permit the emergence of a measurable disease-modifying effect of the
treatment. Period 2 is chosen to be long enough to eliminate (or “wash out”) any
symptomatic effect of the treatment from Period 1; the two periods do not have to be
of equal length. The purpose of the withdrawal maneuver is to determine whether
any portion of the treatment effect that is apparent at the end of Period 1 persists after
withdrawal of treatment, i.e., to distinguish between the short-term symptomatic
effect and the long-term disease-modifying effect. In theory, any difference in mean
response at the end of Period 2 in favor of the A/P group can be attributed to a
disease-modifying effect of the treatment.
A key assumption of the withdrawal design is the adequacy of the length of the
withdrawal period (Period 2). Consider, for example, the Early vs. Late L-dopa in
Fig. 2 Illustration of the results of a trial of a disease-modifying treatment using a delayed start
design. In this design, participants are randomly assigned to receive either active (A) or placebo (P)
treatment in Period 1 followed by active treatment for all participants in Period 2. The notation “P/
A” indicates the group that received placebo treatment in Period 1 followed by active treatment in
Period 2. The plotted points are mean changes from baseline in the 13-item Alzheimer’s Disease
Assessment Scale-Cognitive Subscale (ADAS-Cog 13) score, where positive changes indicate
worsening. Disease modification is supported by a persisting difference in mean response between
the P/A and A/A groups at the end of Period 2, with evidence that the group difference in mean
response is not continuing to decrease over time near the end of this period
trials, one for the 1 mg/day dosage and one for the 2 mg/day dosage of rasagiline. The
trial unexpectedly produced conflicting results (Olanow et al. 2009). While the 1 mg/
day dosage yielded a pattern of mean UPDRS total scores over time that would be
expected from a drug that had at least a partial disease-modifying effect, the 2 mg/day
dosage did not demonstrate evidence of a disease-modifying effect as the delayed start
(P/A) group “caught up” to the early start (A/A) group in terms of mean response
during Period 2, as measured by the UPDRS total score.
Like the withdrawal design, the delayed start design has the problem that there is
no blinding with respect to the treatment received during Period 2. Again, one could
add a third randomized group to the study in which participants remain on placebo
throughout the trial (P/P) to address this problem, with relatively few participants
assigned to this group since it would have no value in distinguishing between the
disease-modifying and symptomatic effects of the treatment (McDermott et al.
2002). The addition of this third group, which would never receive active treatment,
might make it more difficult to recruit participants in the trial.
1206 M. P. McDermott
Assumptions
Simplified statistical models for the withdrawal and delayed start designs can be used
to illustrate the assumptions that each of these designs requires. Suppose that a
normally distributed outcome variable Y is measured on each participant at the end of
Period 1 (Y1) and at the end of Period 2 (Y2). A typical analysis of data from this
design would incorporate the additional longitudinal data collected and would likely
include certain covariates such as enrolling center and the baseline value of the
outcome variable, but these will be ignored here for simplicity. Additional details
regarding these models are described elsewhere (McDermott et al. 2002).
The models for the mean responses at the end of each period for the withdrawal
and delayed start designs are provided in Table 1. At the end of Period 1, participants
receiving placebo (i.e., those in the P/P and P/A groups) have a mean response μ1,
but participants receiving active treatment (i.e., those in the A/P and A/A groups)
have a mean response that also includes a treatment effect that is assumed to be a sum
of two components: a symptomatic effect (θS) and a disease-modifying effect (θD).
The data at the end of Period 1 can only be used to estimate the total treatment effect,
θS + θD, in that period; they cannot distinguish between these two components. In the
withdrawal design, for example, the difference in mean response between the A/P
and P/P groups would estimate θS + θD. Similarly, in the delayed start design, the
difference in mean response between the A/A and P/A groups would also estimate
θS + θD. The data from Period 2 are used to attempt to distinguish between the
symptomatic and disease-modifying components of that effect.
In the withdrawal design, participants who received placebo in both periods (P/P)
have a mean response μ2 at the end of Period 2. For the A/P group, which had active
treatment withdrawn in Period 2, it is assumed that the disease-modifying effect
acquired from active treatment during Period 1 is retained at the end of Period 2, but
that any symptomatic effect acquired during Period 1 disappears by the end of Period
2. The mean response in this group at the end of Period 2 is, therefore, μ2 + θD.
In the delayed start design, the P/A group receives active treatment in Period 2;
therefore, the mean response in this group at the end of Period 2 is μ2 + λT, i.e.,
Table 1 Statistical models for mean responses in the withdrawal and delayed start designs
Design Group End of Period 1 End of Period 2
Withdrawal P/P μ1 μ2
A/P μ1 + θS + θD μ2 + θD
Difference (A/P – P/P) θS + θD θD
Delayed start P/A μ1 μ 2 + λT
A/A μ1 + θS + θD μ2 + θD + δT
Difference (A/A – P/A) θS + θD θD + δT – λT
Group indicates the Period 1/Period 2 treatment assignments, with P ¼ placebo and A ¼ active
θS ¼ Symptomatic effect acquired during Period 1
θD ¼ Disease-modifying effect acquired during Period 1
λT ¼ Total treatment effect (symptomatic + disease-modifying) acquired during Period 2
δT ¼ Total treatment effect (symptomatic + disease-modifying) acquired during Period 2
64 Designs to Detect Disease Modification 1207
is augmented by a total treatment effect λT acquired during this period that could
consist of both symptomatic and disease-modifying components. Note, however,
that λT is not necessarily equal to θS + θD since the total treatment effect acquired
during Period 2 might not be the same as that acquired during Period 1. In the A/A
group, the mean response at the end of Period 2 is μ2 + θD + δT; it is assumed that this
group retains the disease-modifying effect (θD) and loses the symptomatic effect (θS)
acquired during Period 1 but also acquires a total treatment effect δT during Period 2
that might differ from that acquired by the P/A group.
The important assumptions of the withdrawal and delayed start designs are
illustrated by this simple model for the mean responses: (1) Period 1 is of sufficient
duration to permit the emergence of a measurable disease-modifying effect θD; (2)
the disease-modifying effect θD acquired during Period 1 persists at least through the
end of Period 2, but presumably longer; (3) Period 2 is of sufficient duration for the
symptomatic effect from Period 1 (θS) to completely disappear by the end of Period
2; and (4) withdrawal of active treatment does not modify (e.g., hasten) the disease
process in some way.
It can be seen from Table 1 that in the withdrawal design, the difference in
observed mean response between the A/P and P/P groups at the end of Period 2
will be an unbiased estimate of θD, the disease-modifying effect, under the assumed
statistical model. In the delayed start design, however, the difference in observed
mean response between the A/A and P/A groups at the end of Period 2 will not be an
unbiased estimate of θD under this model unless λT ¼ δT, i.e., unless the total
treatment effect acquired during Period 2 is the same for the P/A and A/A groups.
The assumption, therefore, is that the total (symptomatic + disease-modifying) effect
of treatment received in Period 2 is the same regardless of whether or not the
participant received treatment during Period 1. Because of this assumption, it is
important to ensure that the duration of Period 2 is sufficient to allow the symptom-
atic effect of the treatment to become fully apparent in the P/A group.
Although the assumption that λT ¼ δT is necessary in the delayed start design to
interpret the difference in observed mean response between the A/A and P/A groups
at the end of Period 2 as the magnitude of the disease-modifying effect of the
treatment, it is not a testable assumption in this design. The assumption could be
tested, however, using data from a complete two-period design (McDermott et al.
2002), i.e., a combination of the withdrawal and delayed start designs that includes
all four treatment arms (P/P, P/A, A/P, P/P) (Fig. 3). In this design, an unbiased
estimate of λT is the difference in observed mean response between the P/A and P/P
groups at the end of Period 2 (Table 1). Similarly, an unbiased estimate of δT is the
difference in observed mean response between the A/A and A/P groups at the end of
Period 2 (Table 1). A test of the null hypothesis λT ¼ δT, then, could be based on the
difference between these estimates. If one were comfortable with the assumption of
λT ¼ δT, a pooled estimate of θD could be formed from the withdrawal and delayed
start components of the design (McDermott et al. 2002). Such a design would also
promote blinding. Issues regarding allocation of participants to the different treat-
ment arms of a complete two-period design, including recruitment, dropout, and
statistical efficiency, are discussed in detail elsewhere (McDermott et al. 2002). A
1208 M. P. McDermott
Fig. 3 Illustration of the results of a trial of a disease-modifying treatment using a complete two-
period design, which can be viewed as the combination of the withdrawal and delayed start designs.
In this design, participants are randomly assigned to receive either active (A) or placebo (P)
treatment in Period 1 followed by either active or placebo treatment during Period 2. The notation
“A/P” indicates the group that received active treatment in Period 1 followed by placebo treatment
in Period 2; similar notation is used for the other three groups
slight variation on this design was used in two randomized trials of pegaptanib
sodium for the treatment of age-related macular degeneration (Mills et al. 2007). The
complete two-period design was also presented for a trial of propentofylline in AD
(Whitehouse et al. 1998), although the results of this trial do not seem to have been
published.
Eligibility Criteria
might not be feasible unless any symptomatic effect associated with the treatment is
expected to disappear rapidly. In AD, a duration of 18 months is typically used for
Period 1 (Liu-Seifert et al. 2015). In diseases that are not progressive or have no
known effective treatment, Huntington’s disease being an example of the latter,
longer period durations might be feasible.
Primary Analyses
The primary analyses for withdrawal and delayed start designs aim to address three
scientific hypotheses in support of disease modification: (1) that there is an overall
effect of the treatment during Period 1; (2) that there remains a difference between
the groups (the A/P and P/P arms in the withdrawal design or the P/A and A/A arms
in the delayed start design) in Period 2; and (3) that the group differences in mean
responses near the end of Period 2 are not continuing to decrease over time.
Several authors have advocated for a comparison of the mean responses at the end
of Period 1 between those receiving active treatment and those receiving placebo to
address the first hypothesis (Liu-Seifert et al. 2015; McDermott et al. 2002; Zhang et
al. 2011); others have suggested a comparison of average slopes during this period,
perhaps including only time points after which the symptomatic effect or any
placebo effects are thought to have fully emerged (Bhattaram et al. 2009; Xiong et
al. 2014). For example, in the ADAGIO trial, the analyses involved comparisons of
the average slopes between the rasagiline and placebo groups in Period 1, where the
slopes were based on data from Week 12 to Week 36 (Olanow et al. 2008; Olanow et
al. 2009). The rationale for this strategy seems to be that increasing separation of the
active treatment and placebo groups over time with respect to mean response would
be expected in a trial of a disease-modifying agent. Although this strategy might be
more powerful if the assumption of a linear trajectory of response over time holds, it
should only be of interest in Period 1 to determine whether or not the treatment
groups differ with regard to mean response at the end of this period and not to
speculate about the mechanism of the treatment effect; the latter is addressed in the
second hypothesis. Also, this strategy requires strong assumptions concerning the
time point after which the symptomatic effect of the treatment has fully emerged
(Week 12 in ADAGIO) and linearity of the trajectory of response over time, which
might be problematic (Holford and Nutt 2011) and ended up being a main point of
contention as rasagiline was being considered for a disease-modification claim by
the Food and Drug Administration (Li and Barlas 2017).
There is consensus in the literature concerning the key analyses to address the
second hypothesis, namely, that these should involve group comparisons of the mean
responses at the end of Period 2 (Bhattaram et al. 2009; Liu-Seifert et al. 2015;
McDermott et al. 2002; Zhang et al. 2011). The analyses for the third hypothesis
should address the issue of whether or not the group differences in mean response
near the end of Period 2 are continuing to decrease over time. Decisions are required
64 Designs to Detect Disease Modification 1211
as to how to quantify the evolution of the group difference in mean response over
time as well as which time points to include in the analysis. In the ADAGIO trial, the
investigators followed the recommendation of Bhattaram et al. (2009) to compare
the slopes of the two groups during Period 2, assuming a linear trajectory of response
over time in each group. They used the data from Weeks 48–72 in this comparison
because it was thought that the symptomatic effect of rasagiline would appear within
12 weeks of its initiation in the delayed start group at Week 36 (Olanow et al. 2009).
Evidence for disease modification would be supported by a finding that the group
differences in mean response are not continuing to decrease over time. Therefore, it
is appropriate to formulate the hypothesis testing problem as one involving non-
inferiority. Let βP/A be the slope (Weeks 48–72) in the delayed start (P/A) group, and
let βA/A be the corresponding slope in the early start (A/A) group. The following
statistical hypotheses were specified in the ADAGIO trial:
where Δ1 and Δ2 are the group differences in mean response at the end of Period 1
and at the end of Period 2, respectively. Rejection of H0 would imply that at least
1212 M. P. McDermott
50% of the total treatment effect observed during Period 1 is preserved after Period 2.
A problem with this approach is that it does not address the issue of whether the
group differences in mean response are decreasing over time.
To adequately test the hypothesis that the group differences in mean response are
not decreasing over time, more frequent evaluations might be required in the latter
part of Period 2. The frequency of evaluations will depend on the disease and
treatment being studied. In the context of AD, Zhang et al. (2011) suggested monthly
evaluations in the final 3 months of Period 2; however, evaluations so close together
in time might not allow the slopes during this period to be estimated with sufficient
precision, and there could be problems with feasibility as well (Liu-Seifert et al.
2015).
Both Bhattaram et al. (2009) and Zhang et al. (2011) propose testing the three null
hypotheses of interest in sequence: (1) no group difference in average slopes (or
mean responses) in Period 1; (2) no group difference in mean response at the end of
Period 2; and (3) group differences in mean response are decreasing over time at a
rate that is greater than the specified noninferiority margin. Each hypothesis is tested
at a pre-specified significance level (say 5%), and one proceeds to test the next
hypothesis in the sequence if and only if the previous hypothesis is rejected. If one
takes the position, however, that all three null hypotheses would have to be rejected
in order for the treatment to be considered disease-modifying, then this would be an
example of reverse multiplicity (Offen et al. 2007) whereby the overall probability of
a false-positive result will be less than the significance level used for each of the three
tests, so correction for multiple testing would not be required. Of course, if it is
desired to make an efficacy claim about the treatment using data from Period 1 alone,
regardless of the mechanism of this effect, then an appropriate adjustment for
multiplicity would be necessary (D’Agostino Sr 2009).
Compared to standard clinical trial designs, the problem of missing data can be
exacerbated in two-period designs due to the long duration of follow-up and the fact
that the evidence concerning potential disease modification is derived from the data
acquired during Period 2. The 2010 National Research Council (NRC) report on The
Prevention and Treatment of Missing Data in Clinical Trials (National Research
Council 2010) has led to increased attention to how missing data are handled in
clinical trials. In particular, the report highlighted the shortcomings of simplistic
methods such as carrying forward the last available observation (LOCF) and so-
called complete case analyses that omit cases with missing data (Mallinckrodt et al.
2017; National Research Council 2010) and promoted the use of more principled
methods such as those based on direct likelihood, multiple imputation, and inverse
probability weighting (Molenberghs and Kenward 2007).
Most of the literature on the analysis of data from two-period designs favors the
use of so-called “mixed model repeated measures” (MMRM) analyses (Mallinckrodt
et al. 2008) that treat time as a categorical variable and use maximum likelihood to
64 Designs to Detect Disease Modification 1213
estimate model parameters (e.g., mean treatment group responses at each individual
time point) using all available data, including all observed data from participants
who prematurely withdraw from the trial (Li and Barlas 2017; Liu-Seifert et al. 2015;
Zhang et al. 2011). Linear or nonlinear mixed effects models (Molenberghs et al.
2004) that specify a functional form for the relationship between response and time
can also be used for this purpose and might be more efficient than the MMRM
strategy if the specified functional form is (approximately) correct, but this could be
a strong assumption in practice. Multiple imputation can also be a useful strategy in
this setting (Little and Yau 1996; Schafer 1997).
These methods all rely on the “missing at random” (MAR) assumption
concerning the missing data mechanism, namely, that the missingness depends
only on observed outcomes in addition to covariates, but not on unobserved out-
comes (Little and Rubin 2002). The reasonableness of this untestable assumption
depends on the clinical setting but also on the estimand of interest (International
Conference on Harmonization 2017; National Research Council 2010). The
estimand is the population quantity to be estimated in the trial and requires specifi-
cation of four elements: the target population, the outcome variable, the handling of
post-randomization (intercurrent) events, and the population-level summary for the
outcome variable. Key among these elements in the context of missing data is the
handling of intercurrent events such as discontinuation of study medication, use of
rescue medication, and use of an out-of-protocol treatment. The need for additional
treatment is particularly important for trials in PD, for which there are many
available effective treatments, but applies to AD and other conditions as well.
There are a number of options for dealing with this issue, including (1) withdrawing
the participant from the trial; (2) moving the participant directly into Period 2; and
(3) allowing the participant to receive additional treatment while continuing partic-
ipation in the trial. The second of these options only applies to participants who
require treatment in Period 1 and would not apply in the case of a withdrawal design.
The third option is consistent with strict adherence to the intention-to-treat principle
and might be sensible in a trial with a pragmatic aim, but it is not appealing in a trial
that aims to evaluate the disease-modifying effect of a treatment using a two-period
design, an aim that is explanatory or mechanistic.
In the ADAGIO trial, participants who were followed for at least 24 of the
scheduled 36 weeks in Period 1 were allowed to proceed directly into Period 2 if
judged by the enrolling investigator to require additional anti-parkinsonian medica-
tion. While this allows information to be obtained in these participants on the
mechanism of the effect of the treatment, the time scale for follow-up becomes
compressed for these participants, the implications of which are not entirely clear.
Also, if the active treatment has a beneficial effect regardless of its mechanism, the
early initiation of Period 2 might occur preferentially in those receiving placebo
during Period 1, which could complicate interpretation of the results. ADAGIO
participants who required additional treatment in Period 2 were withdrawn from the
trial at that time. Only participants who had at least one follow-up evaluation after
the start of Period 2 were included in the primary analyses of Period 2 data. Even
though participant retention in ADAGIO was quite good (Olanow et al. 2009),
1214 M. P. McDermott
The considerations for sample size determination that are unique to two-period designs
are the specification of the effect size for disease modification (θD) to be detected at the
end of Period 2 and the noninferiority margin for the third hypothesis that the group
differences in mean response are not decreasing over time. The effect size specified for
sample size determination in ADAGIO was chosen to be 1.8 points for the UPDRS
total score (Olanow et al. 2009), which was criticized by some to not represent a
clinically important effect (Clarke 2008). This group difference, however, must be
interpreted in the proper context: it is the benefit attributable to disease modification
that would accrue over the duration of Period 1, i.e., 36 weeks. This is a very short time
period relative to the expected duration of the disease. If this effect is truly due to
disease modification, it would be expected to continue to accrue over time, possibly
over many years. The observed effect of the 1 mg/day dosage of rasagiline (1.7 points
over 36 weeks) represents a 38% reduction in the change from baseline (Olanow et al.
2009); if this truly represents disease modification, an effect of this magnitude would
arguably be of major clinical importance. In a two-period design, the choice of effect
size for sample size determination should be based on a realistic expectation of the
magnitude of a disease-modifying effect that could accrue over a follow-up period that
is brief relative to the disease course and might not be very large.
The sample size required to determine whether the group difference in mean
responses is not continuing to decrease appreciably over time near the end of Period
64 Designs to Detect Disease Modification 1215
2 could be quite large depending on the choice for the noninferiority margin;
considerations for choosing this margin are discussed in the ADAGIO example
above. Assumptions such as the time points included in this analysis and the residual
variability around the slopes would have to be carefully considered in the calcula-
tion. Additional factors that need to be considered in the sample size calculation
include intercurrent events (e.g., participant withdrawal and noncompliance) and
misdiagnosis (if applicable). Given the complexities that these considerations intro-
duce, the technique of simulation can be highly useful in assessing the required
sample size under a variety of design assumptions.
There is great interest in developing interventions that can modify the course of
neurodegenerative diseases and other diseases in a meaningful way. The develop-
ment of reliable and valid methods to measure the underlying course of these
diseases is urgently needed, and this is a highly active area of research. In the
meantime, clinical trials in these conditions have to rely on rating scales, functional
measures, or other instruments to indirectly measure disease status. In this setting,
two-period designs represent a potentially attractive option to distinguish between
effects of interventions that are enduring (disease-modifying) vs. those that are short-
term/reversible (symptomatic).
Two-period designs are associated with several limitations, including uncer-
tainty regarding the required durations of the two periods; the assumption in the
delayed start design that the total (symptomatic + disease-modifying) effect of
treatment received in Period 2 is independent of whether or not the participant
received treatment during Period 1; potential difficulties with recruitment and
retention, particularly for the withdrawal design; potential compromise of
blinding; requirements of large sample sizes; the need for effective ancillary
treatments in some cases; and the problem of how to address the issue of missing
data from subjects who cease participation in the trial. Another limitation, in the
context of enrolling trial participants with relatively mild disease, is that the
outcome measure might lack sensitivity to assess disease-modifying effects, espe-
cially if there is a large symptomatic component to the effect of the intervention
(Olanow et al. 2009).
As discussed above, many of the assumptions of two-period designs cannot be
verified directly and need to be informed by knowledge of the intervention acquired
outside of the trial. Also, it will likely be difficult with a two-period design to discern
the mechanisms of interventions with a very slow onset and/or offset of a symptom-
atic effect (Holford and Nutt 2011; Ploeger and Holford 2009).
So far, the withdrawal and delayed start designs to detect disease modification
have been used mainly in the context of neurodegenerative disease. Definitive
demonstration of the disease-modifying effect of an intervention has not been
achieved to date with these designs, and the experience in the ADAGIO trial
illustrates some of the difficulties in achieving this goal. Additional experience
1216 M. P. McDermott
with these designs and the development of strategies to address their limitations will
eventually determine their usefulness in detecting the disease-modifying effects of
interventions.
Cross-References
References
Ahlskog JE, Uitti RJ (2010) Rasagiline, Parkinson neuroprotection, and delayed-start trials: still no
satisfaction? Neurology 74:1143–1148
Athauda D, Foltynie T (2016) Challenges in detecting disease modification in Parkinson’s disease
clinical trials. Parkinsonism Relat Disord 32:1–11
Bhattaram VA, Siddiqui O, Kapcala LP, Gobburu JV (2009) Endpoints and analyses to discern
disease-modifying drug effects in early Parkinson’s disease. AAPS J 11:456–464
Carpenter JR, Roger JH, Kenward MG (2013) Analysis of longitudinal trials with protocol
deviation: a framework for relevant, accessible assumptions, and inference via multiple impu-
tation. J Biopharm Stat 23:1352–1371
Clarke CE (2004) A “cure” for Parkinson’s disease: can neuroprotection be proven with current trial
designs? Mov Disord 19:491–498
Clarke CE (2008) Are delayed-start design trials to show neuroprotection in Parkinson’s disease
fundamentally flawed? Mov Disord 23:784–789
Cummings JL (2009) Defining and labeling disease-modifying treatments for Alzheimer’s disease.
Alzheimers Dement 5:406–418
Cummings J (2017) Disease modification and neuroprotection in neurodegenerative disorders.
Transl Neurodegener 6:25. https://doi.org/10.1186/s40035-017-0096-2
D’Agostino RB Jr (1998) Propensity score methods for bias reduction in the comparison of a
treatment to a non-randomized control group. Stat Med 17:2265–2281
D’Agostino RB Sr (2009) The delayed-start study design. N Engl J Med 361:1304–1306
Dorsey ER, Bloem BR (2018) The Parkinson pandemic – a call to action. JAMA Neurol 75:9–10
Emery P, Breedveld FC, Hall S, Durez P, Chang DJ, Robertson D, Singh A, Pedersen RD, Koenig
AS, Freundlich B (2008) Comparison of methotrexate monotherapy with a combination of
methotrexate and etanercept in active, early, moderate to severe rheumatoid arthritis (COMET):
a randomised, double-blind, parallel treatment trial. Lancet 372:375–382
Guimaraes P, Kieburtz K, Goetz CG, Elm JJ, Palesch YY, Huang P, Ravina B, Tanner CM, Tilley
BC (2005) Non-linearity of Parkinson’s disease progression: implications for sample size
calculations in clinical trials. Clin Trials 2:509–518
Holford N (2015) Clinical pharmacology ¼ disease progression + drug action. Br J Clin Pharmacol
79:18–27
Holford NHG, Nutt JG (2011) Interpreting the results of Parkinson’s disease clinical trials: time for
a change. Mov Disord 26:569–577
International Conference on Harmonization (2017) ICH E9 (R1) addendum on estimands and
sensitivity analysis in clinical trials to the guideline on statistical principles for clinical trials:
Step 2b, 16 June 2017
Kaye JA (2000) Methods for discerning disease-modifying effects in Alzheimer disease treatment
trials. Arch Neurol 57:312–314
64 Designs to Detect Disease Modification 1217
Sormani MP, Bruzzi P (2013) MRI lesions as a surrogate for relapses in multiple sclerosis: a meta-
analysis of randomised trials. Lancet Neurol 12:669–676
Tang Y (2017) An efficient multiple imputation algorithm for control-based and delta-adjusted
pattern mixture models using SAS. Statist Biopharm Res 9:116–125
The Parkinson Study Group (1989) Effect of deprenyl on the progression of disability in early
Parkinson’s disease. N Engl J Med 321:1364–1371
The Parkinson Study Group (1993) Effects of tocopherol and deprenyl on the progression of
disability in early Parkinson’s disease. N Engl J Med 328:176–183
The Parkinson Study Group (2004) Levodopa and the progression of Parkinson’s disease. N Engl J
Med 351:2498–2508
Vellas B, Andrieu S, Sampaio C, Wilcock G, the European Task Force Group (2007) Disease-
modifying trials in Alzheimer’s disease: a European task force consensus. Lancet Neurol
6:56–62
Vellas B, Andrieu S, Sampaio C, Coley N, Wilcock G, the European Task Force Group (2008)
Endpoints for trials in Alzheimer’s disease: a European task force consensus. Lancet Neurol
7:436–450
Whitehouse PJ, Kittner B, Roessner M, Rossor M, Sano M, Thal L, Winblad B (1998) Clinical trial
designs for demonstrating disease-course-altering effects in dementia. Alzheimer Dis Assoc
Disord 12:281–294
Xiong C, Luo J, Gao F, Morris JC (2014) Optimizing parameters in clinical trials with a randomized
start or withdrawal design. Comput Statist Data Anal 69:101–113
Zhang RY, Leon AC, Chuang-Stein C, Romano SJ (2011) A new proposal for randomized start
design to investigate disease-modifying therapies for Alzheimer disease. Clin Trials 8:5–14
Screening Trials
65
Philip C. Prorok
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1220
Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1220
Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1223
Sample Size Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1226
Screening Trial Design Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227
Standard or Traditional Two Arm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227
Continuous Screen Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227
Stop Screen Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1227
Split Screen or Close Out Screen Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228
Delayed Screen Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228
Designs Targeting More Than One Intervention and Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1228
Analysis Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1229
Follow-Up Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1229
Evaluation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1231
Monitoring an Ongoing Screening Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1232
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1234
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235
Abstract
The most rigorous approach to evaluating screening interventions for the early
detection of disease is the randomized controlled trial (RCT). RCTs are major
undertakings requiring substantial resources to enroll and follow large
populations over long time periods. Consequently, it is important that such trials
be carefully conducted to ensure high quality information and scientifically valid
results. The purpose of this chapter is to discuss some of the intricacies of
P. C. Prorok (*)
Division of Cancer Prevention, National Cancer Institute, Bethesda, MD, USA
e-mail: [email protected]
Keywords
Screening · Early detection · Lead time · Length bias · Cancer
Introduction
Design Issues
The term clinical trial often brings to mind the concept of an investigation aimed at
testing a clinical intervention or treatment in a group of individuals. The trial
participants are patients who have been diagnosed with some disease and have
sought treatment to alleviate their condition. Such therapy trials typically involve a
few hundred to perhaps a few thousand patients, last for perhaps a few years, and
seek to improve a clinical outcome such as reduced recurrence rate or improved
survival rate. Many such trials have been performed by cooperative groups and other
organizations in various countries. In contrast, relatively few screening trials have
65 Screening Trials 1221
been conducted due to their size, cost and duration. They generally involve thou-
sands of ostensibly healthy participants followed for many years to determine if the
screening intervention reduces the disease related death rate in the screened popu-
lation. Given these contrasting features of screening trials compared to therapy trials,
and acknowledging many well-known requirements of clinical trials in general, it is
important to give careful thought to a number of key considerations in designing
screening trials.
Informed consent is an initial consideration in screening trial design. Both pre and
post randomization consent have been used. In post randomization consent, partic-
ipants are chosen from nationwide or regional registration rolls, for example, and
randomly assigned to the trial arms. Those in the control arm receive their usual
medical care, and sometimes are not informed that they are in a trial. Those in the
intervention arm are asked to consent to screening after being randomized (e.g.,
Bretthauer et al. 2016). This approach has the advantage of being population based,
and the participants in the control arm are less likely to undergo the screening
procedure since they are not aware of the study. One disadvantage is that interven-
tion arm participants have to choose to be screened after they are already in the trial,
and invariably some do not, thereby reducing compliance and diluting any effect of
the screening. Further, it may be difficult to obtain information other than vital status
about control arm individuals because they have not agreed to participate. There
might also be ethical concerns about entering individuals into a study which they do
not know about.
Prerandomization consent, on the other hand, requires informed consent from all
participants before randomization into study and control arms (e.g., NLST Research
Team 2011; Prorok et al. 2000). This method may lead to greater compliance in the
screening arm and allows the collection of similar detailed information from both the
study and control arms because all participants agree to be part of the study. A
disadvantage is that it may be more difficult to recruit participants because many
may refuse randomization. There may also be substantial contamination in the
control arm because the controls are aware of the screening tests being used and
could, in theory, seek them elsewhere. This would also dilute any screening effect.
A major issue is the question of whether an available test is ready for evaluation in
a large scale randomized trial, and/or how to choose among several candidate tests.
There are no straightforward scientific answers since a standard set of criteria does
not exist. Hopefully there are preliminary data providing estimates of the key process
measures of the test: sensitivity (the probability of being test positive when disease is
present), specificity (the probability of being test negative when disease is absent),
and positive predictive value (the probability of having disease when the test is
positive). However, these data often emanate from studies involving small numbers
of individuals in a clinical setting, few of whom have preclinical disease that is the
target of a population screening program. Even when appropriate data exist, agreed-
upon threshold values for these parameters that would trigger the decision to
undertake a trial do not exist. It seems clear, however, that for population screening,
particularly for a relatively rare disease such as cancer, there is a requirement for very
high specificity (on the order of 95% or higher) because of low disease prevalence,
1222 P. C. Prorok
while sensitivity need not be so high, although a value of at least 80% is often
deemed preferable.
The issue also arises as to the number of screens or screening rounds and the
interval between screens to be used in a trial. The interval between screens is
typically chosen to be 1 or 2 years (e.g., Prorok 1995), although irregular intervals
have been used, but these may be more difficult to implement in practice in terms of
participant compliance. The number of screening rounds depends on the tradeoff
between a sufficient number to produce a statistically valid effect on the primary
outcome measure, if there is one, and the cost of adding additional rounds. Although
some cancer screening trials have involved screening for essentially the entire
follow-up period (Tabar et al. 1992), most have employed an abbreviated screening
period typically involving four or five screening rounds, with a subsequent follow-
up period devoid of screening (e.g., Miller et al. 1981; Shapiro et al. 1988). These
issues can be addressed using mathematical modeling (e.g., NLST Research Team
2011).
As an example, in the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer
screening trial, the initial choice of four annual screens, at baseline plus three annual
re-examinations, was later expanded to six annual screens for PSA testing for
prostate cancer and CA125 testing for ovarian cancer. This was a trade-off between
enough screens to produce an effect versus anticipated resources (Prorok et al. 2000)
Three or four screening rounds were sufficient in some breast cancer screening trials
(e.g., Shapiro et al. 1988; Tabar et al. 1992). The annual interval between screens
was chosen as the most frequent yet practical interval if screening is shown to be
effective. Compared to less frequent screening, an annual interval also increases the
likelihood of detection of a broad spectrum of the preclinical conditions in the
natural history of the cancers under study. A longer interval might allow some
rapidly growing lesions, which might be a source of mortality but which could be
cured if found early, to escape detection.
Another design consideration involves the relationship between study duration,
sample size, and the expected timing of any effect or achievement of a maximal
effect. Sample size and study duration are inversely related. If only these two
parameters were involved, the relationship between follow-up cost versus recruit-
ment and screening cost would determine the design. For example, if follow-up costs
were substantial compared with those of recruitment and screening, a relatively
larger population would be recruited that would be screened and followed for a
shorter period to achieve the desired statistical validity. However, the issue of the
time at which the screening effect (reduction in mortality, see below) may occur must
also be considered. For those cancer screening trials that have demonstrated an
effect, a separation in the mortality rates between the screened and control groups
has often not begun to occur until 4–5 years or more into the study (e.g., Mandel
et al. 1993; Shapiro et al. 1988). Thus, even with a very large sample size, follow-up
may have to continue for many years to observe the full effect of the screening. A
follow-up period of at least 10 years is common (Prorok and Marcus. 2010).
For example, in the PLCO trial a minimum of 10 years of follow-up was initially
decided upon to allow sufficient time for any mortality reduction from screening to
65 Screening Trials 1223
Endpoints
The appropriate and most meaningful endpoint in a screening study is the clinical
event that the screening is aimed at preventing. For major chronic diseases such as
diabetes or cancer the intent of screening is to find the disease in an early phase so
that treatment can be initiated sooner, thereby preventing the most consequential
clinical outcome of such diseases, which is death (e.g., Echouffo-Tcheugui and
Prorok 2014; Prorok 1995). Particularly in cancer screening, the most valid endpoint
is the trial population cancer-specific mortality rate. This is the number of deaths
from the target cancer per unit time per unit population at risk (e.g., Prorok 1995).
The mortality rate provides a combined assessment of the impact of early detection
plus therapy. The unequivocal demonstration of reduction in the cancer mortality
rate for a population offered screening is justification for the cost of a screening
program and fulfills the implicit promise of benefit to those who elect to participate
in the program.
Careful study design and long-term follow-up of large populations are generally
required to obtain an accurate estimate of a mortality reduction. Consequently,
intermediate or surrogate outcome measures have been proposed. There are, how-
ever, critical shortcomings associated with these end points (Prorok 1995). The
shortcomings are a consequence of well-known biases that occur in screening pro-
grams: lead time bias, length bias, and overdiagnosis bias.
If an individual participates in a screening program, his or her disease may be
detected earlier than it would have been in the absence of screening. The amount of
time by which the diagnosis is advanced as a result of screening is called the lead
time. Because of the lead time, the point of diagnosis is advanced and survival as
measured from diagnosis is automatically lengthened for cases detected by screening
even if length of life is not increased. This is referred to as lead-time bias and renders
the case survival endpoint invalid (Prorok 1995).
1224 P. C. Prorok
increase in survival from time of diagnosis is, at least in part, a reflection of lead time.
For any case of disease that is screen detected, it is impossible to distinguish between
a true increase in survival time and an artificial increase due to lead time because lead
time cannot be directly observed for ethical reasons. Further, there is no universally
accepted procedure to estimate lead time or to adjust survival for lead time. Thus
case survival is not a valid measure of screening effectiveness.
Furthermore, even if one could adjust for lead time, length bias could still
confound survival comparisons. In comparing survival of cases in two groups, for
example between two subgroups of cases detected by different screening modalities,
cases in one subgroup may have a different distribution of natural histories than the
cases in another subgroup because of a modality-dependent sampling effect. Even if
one could adjust for lead time, any remaining survival difference could simply be a
consequence of the difference in disease natural history between the two subgroups
caused by differing sampling bias. Methodology has been developed to explore the
length bias effect on survival (e.g., Kafadar and Prorok 2009), but no general
methodology exists to either estimate the magnitude of a length bias effect or to
adjust survival for length bias. Approaches to separating the effects of treatment,
lead time, and length bias in certain circumstances have been proposed (Duffy et al.
2008; Morrison 1982).
Stage of disease at diagnosis, or a related prognostic categorization, can also be
used as an early indicator of screening effect, but it can be misleading and is
unsatisfactory as a final end point. The relationship between the magnitude of a
shift in the stage distribution of cases as a result of screening and the magnitude of a
reduction in mortality is not usually known. The detection of in situ or borderline
lesions can also affect the stage distribution but should have little impact if any on
mortality. The problem is most pronounced for stage I or localized cases where lead
time and length bias can lead to slow-growing, even nonprogressive, cases being
detected in stage I in a screened arm to a greater extent than in a control arm. Some
counterpart cases in the control arm may never surface clinically. As a result, the
screened arm will contain a higher proportion of stage I cases even if screening has
no effect on mortality. Or, the magnitude of a real mortality effect could be exag-
gerated by focusing on stage of disease. Thus, a proportional stage shift in a screened
arm can be a sign of early detection, but it is insufficient evidence to conclude that
there is an improvement in disease outcome.
A related measure that can be a reasonable surrogate endpoint in some screening
circumstances is the population incidence rate of advanced-stage disease. The
overall incidence rate or the rate of early-stage disease should increase with screen-
ing, as discussed above, rendering these measures invalid as endpoints. However, if
screening reduces the rate of advanced disease, disease that has metastasized and/or
is likely to lead to death, then it is reasonable to expect that the death rate from the
disease will also be reduced. Whether this is a valid substitute for mortality must be
established in a given setting. Advanced-stage disease must first be defined, then the
relationship between advanced disease and mortality must be established in properly
designed studies. Advanced stage rate is the primary endpoint in a breast cancer
screening trial comparing digital mammography with tomosynthesis (Pisano 2018).
1226 P. C. Prorok
Some screening tests for cancer, such as tests for cervical cancer and colorectal
cancer, do detect true precursor lesions. The subsequent removal of these lesions
then prevents the cancer from ever being clinically diagnosed, and consequently the
incidence rate of the cancer is reduced. The incidence rate is a meaningful endpoint
in such circumstances, but it is important to monitor the mortality rate as well, since
it is possible that cancers that are eliminated are not a major source of cancer deaths,
and so there may not be a direct correspondence between incidence effect and
mortality effect.
D
NC ¼
ðQ C þ f Q S Þ R C Y
65 Screening Trials 1227
where Y is the duration of the trial from entry to end of follow-up in years and RC is
the average annual disease-specific death rate in the control arm expressed in deaths
per person per year, adjusted for healthy screenee bias.
Most screening trials have used a traditional or standard two arm design targeting one
disease and aimed at addressing the basic question of whether the screening interven-
tion results in a reduction in cause-specific mortality. Participants in one arm receive
the screening test for a given disease and those in the other arm serve as a control
(unscreened or usual care) (e.g., Shapiro et al. 1988; Schroder et al. 2014; Yousaf-
Khan et al. 2017). Other standard trials have addressed the effect of adding one
screening modality to another (e.g., Miller et al. 1981). A related three arm design
has been used to compare different frequencies of screening (Mandel et al. 1993).
Several variants of this standard design are now discussed (Etzioni et al. 1995).
The Stop Screen design is similar to the Continuous Screen design, except that
screening is offered for only a limited time in the intervention arm while follow-up
continues. Both arms are followed for the mortality endpoint until the end of the trial.
This is the design of choice when it is anticipated that a long time will be required
before a reduction in mortality can be expected to emerge, and when it would be
expensive or difficult to continue the periodic screening for the entire trial period.
Examples of this design are the Health Insurance Plan (HIP) of Greater New York
Breast Cancer Screening Study (Shapiro et al. 1988), the PLCO trial (Prorok et al.
2000), and the European prostate cancer screening trial (Schroder et al. 2014). As an
illustration, the HIP trial randomized 62,000 women aged 40–64. The intervention
arm was offered four annual screens consisting of two-view mammography and
clinical breast examinations. The screens were offered at entry and for the next
3 years. Women in the control arm followed their usual medical practices. Although
1228 P. C. Prorok
screening ended after 3 years, follow-up continued to year 15. By restricting the
screening period, the Stop Screen design can result in a considerable saving in cost
and effort relative to the Continuous Screen design. Importantly, the Stop Screen
design is the only one that allows a direct assessment of overdiagnosis, provided
compliance is high and follow-up is complete. However, analysis of the Stop Screen
design can be more complex than that of the Continuous Screen design. This is
because the difference in disease-specific mortality between the two arms may be
diluted by deaths that arise from cancers that develop in the intervention arm after
screening stops. (See Analysis section below).
The Split Screen design is related to the Stop Screen design. The difference is that at the
time the last screen is offered to the intervention arm, a screen is also offered to all
participants in the control arm. The Stockholm Breast Cancer screening trial is an
example of this design. (Friskell et al. 1991) Women were randomized to intervention
or control arms. The intervention was single-view mammography at an initial round
then two succeeding rounds performed 24–28 months apart. The control group was
offered a single screen, at approximately 4.5 years after study entry. One potential
advantage of this design is that comparable groups of cancer cases in the two trial arms
can theoretically be identified, which can potentially enhance the analysis (See Analysis
section). A downside is that some of the control arm cancers detected by screening may
benefit, and if so, any screening benefit in the intervention arm will be diluted.
In the Delayed Screen design, periodic screening is offered to control arm partici-
pants starting at some time after the start of the study, then screening continues in
both arms until the end of the intervention period. The UK Breast Cancer Screening
Age Trial followed this design (Moss et al. 2015). Women in the intervention arm
were offered annual screening starting at age 39–41 and continuing to age 47–48,
then at age 50–52 all women in both arms were offered periodic screening as part of
the National Health Care Program. Thus one can assess the impact of starting
periodic screening at age 39–41 relative to waiting until age 50–52. This design is
well suited for the situation where screening is the standard of care beginning at a
certain age, and the research question centers on the marginal benefit of introducing
screening at an earlier age.
Analysis Methods
Follow-Up Analysis
deaths from the disease of interest to the number of person-years at risk of dying of
the disease, and (3) the survival distribution of the population using death from the
disease of interest as the endpoint, with the time of entry into the trial as the time
origin. To assess whether or not the screening intervention is of benefit, either the
difference or the ratio of the intervention and control group mortalities can be used.
The former is a measure of the absolute change in mortality due to the screening,
while the latter is a measure of the relative mortality change due to the screening.
Rate ratios, rate differences, and their confidence intervals can readily be calculated
(e.g., Ahlbom 1993).
Various statistical procedures can be used to test formally for a difference in the
mortality experience between the randomized arms. For the first measure, standard
procedures for comparing two proportions are available, such as Fisher’s exact test.
The cumulative mortality rates can be tested using Poisson methods for comparing
two groups. A test statistic is
where DC is the number of deaths from the disease of interest in the control arm
through the time of analysis, DS is the corresponding number of deaths in the
screened arm, PYC = the number of person-years at risk of death from the disease
of interest in the control arm through the time of analysis and PYS = the
corresponding number of person-years in the screened arm. This statistic has an
approximately standard normal distribution. For comparing the survival distribu-
tions, nonparametric tests such as the logrank test are used. It is important to note that
these analyses involve all individuals randomized to the respective trial arms.
Additional approaches that have been used are Cox proportional hazards regres-
sion and Poisson regression. These methods offer the possibility of a more thorough
exploration of screening trial data. Further, with the availability of modern comput-
ing power, randomization tests are an option that should be considered since
these avoid the assumptions required for other procedures (see ▶ Chap. 94,
“Randomization and Permutation Tests”).
Related testing and modeling techniques have been suggested to address the
problem of the optimal timing of a screening trial analysis relative to the appearance
of an effect (e.g., Baker et al. 2002). For several cancer screening trials that have
reported a benefit, a pattern was exhibited where the endpoint rates in the two arms
were roughly equivalent for some random period after the start of the trial, after
which they separated gradually leading to a statistically significant difference (e.g.,
Shapiro et al. 1988; Schroder et al. 2014; Tabar et al. 1992). This implies that the
proportional hazards assumption often invoked in survival analysis does not hold
and other methods of analysis are required. One possibility would be a method that
in a sense ignores the period where there is no difference in the rates and uses only
data from the period where there is a difference. However, such an approach must
account for multiplicity in the choice of the time point when separation of the rates
begins, and must be done with appropriate statistical methods to obtain the correct
variance of the test statistic (Prorok 1995).
65 Screening Trials 1231
Evaluation Analysis
The follow-up analysis is generally the preferred choice, but the method is subject to
bias in the relative effect of the screening if the effect is diluted (described below)
during follow-up. Evaluation analysis is an attempt to adjust for this.
There are many screening trials in which the intervention arm is offered screening
for a limited time only, with the follow-up continuing thereafter to the end of the
study (e.g., see Stop Screen design above). During the period of follow-up after
screening ceases, those in the intervention arm, as is the case for those in the control
arm throughout the study, follow their usual medical care practices. If the post
screening follow-up period is lengthy, the mortality comparison will be subject to
error relative to a study in which screening continues.
The primary problem is that there can be a dilution of the effect in that the
mortality in both arms will become more alike as time from the end of screening
increases. The dilution can occur when some of those dying of the disease are
individuals whose disease was diagnosed during the post screening period. For
such deaths in the intervention arm, it is unlikely that screening could have any
beneficial impact on their mortality. Hence, their inclusion in the analysis dilutes the
screening effect. However, in the control arm, some cases may correspond to cases in
the intervention arm that were screen-detected and that did benefit from the screen-
ing. If, hypothetically, deaths among these control arm cases of the disease were to
be excluded from the analysis, the screening effect is diluted in that the control arm’s
mortality will be underestimated. Thus, deaths from the disease of interest that occur
among cases diagnosed after screening stops, incorrectly included or excluded, can
result in the observed mortalities of the two randomized arms appearing to be more
similar or dissimilar than they should. This can lead to erroneous conclusions about
the effectiveness of the screening program.
An approach to countering this problem is evaluation analysis (Nystrom et al.
1993). This applies to the Split Screen design. Recall in this design participants in the
control arm are screened once at the time of the last screen in the intervention arm.
The evaluation analysis then includes deaths that occur from randomization through
the end of follow-up, but only those deaths from the target disease that occur among
cases diagnosed from the time of randomization through and including the last
screen, in each arm. If the sensitivity of the screening test is very high, this can
create two groups of cases, one in each arm, that are comparable in terms of their
natural history distributions, and hence their expected mortality outcomes in the
absence of screening. Thus, analysis of the deaths confined only to those arising
from the comparable case groups can theoretically provide an unbiased analysis and
eliminate the dilution. A concern, however, is that most screening tests do not
possess very high sensitivity. Further, it is crucial that the control arm screen be
done exactly at the same time as the last screen in the intervention arm, a circum-
stance unlikely to arise in practice. Otherwise, the case groups will likely not be
comparable and the inference about a mortality effect can be biased (Berry 1998).
In some circumstances comparable case groups can arise naturally. This can
happen in a Stop Screen design when the number of cases in the control arm “catches
1232 P. C. Prorok
up” to that in the screened arm at some point during follow-up after screening stops.
Cases in the comparable groups up to the “catch up” point are then the source of
deaths for the mortality analysis. Deaths among cases diagnosed after this point are
excluded thereby mitigating dilution. This situation occurred in the HIP trial, where
at about 5 or 6 years after randomization the cumulative numbers of breast cancer
cases were very similar in the two arms (Shapiro et al. 1988). The mortality measures
and statistical methods used in the follow-up analysis, appropriately modified, can be
used for this analysis (Prorok 1995). However, successfully determining the appro-
priate “catch up” point to identify case groups for this analysis can be problematic.
Of additional concern is that with modern screening tests, there is the likelihood of
overdiagnosis, so that the control arm will never catch up to the screened arm.
1. Population Characteristics
The demographic, socioeconomic, and risk characteristics of the study partici-
pants, possibly including dietary and occupational histories. These data are useful
for describing the study population and assessing the comparability of the
screened and control arms and may be used in statistical adjustment procedures.
2. Coverage and Compliance
Determination of the proportion offered screening who actually undergo the
initial screening. This can inform the acceptability of the screening procedures
and indicate whether the level is consistent with that assumed in the trial design.
Compliance with each scheduled repeat screen should also be recorded.
3. Test Yield in the Screened Arm
The number of cases found at each screen should be recorded and related to the
interval cases not discovered by screening. This is important for gauging how
successful the screening test is in finding the disease.
4. Contamination
The amount of screening in the control arm outside the trial protocol should be
assessed. This is crucial for ascertaining the potential level of dilution of any
intervention effect. Ideally this would be ascertained at the individual level, but
sometimes sampling of the controls is used. Approaches aimed at minimizing
contamination include cluster rather than individual randomization and post
randomization consent, and methods exist to adjust for contamination in the
analysis (Baker et al. 2002; Cuzick et al. 1997).
65 Screening Trials 1233
Conclusion
As in other areas of research, much has been learned over time from completed and
ongoing screening trials. This chapter is an attempt to convey some of this knowl-
edge. Hopefully this will lead to improved trial design and analysis in the future.
Some additional insights are the following:
65 Screening Trials 1235
1. New screening tests can rapidly become widely used, especially in the U.S., often
without valid scientific evidence of benefit nor proper assessment of harm. It is
therefore important to undertake rigorous trials as soon as possible when a new
test becomes available to take advantage of a window of opportunity, before
widespread use precludes establishment of a proper control;
2. Over-diagnosis has been indicated repeatedly, particularly in cancer screening.
This should be expected and accounted for in study design, analysis, and
interpretation;
3. A pilot phase prior to or at the beginning of a trial can be extremely valuable for
testing operational components and evaluating study centers. Although not
discussed in this chapter, pilot studies have been instrumental in several cancer
screening trials (e.g., NLST Research Team 2011; Prorok et al. 2000);
4. Quality assurance of all trial operations is crucial.
As has been stated previously, a screening trial is a major endeavor requiring a long-
term commitment by participants, investigators and funding organizations. If a decision
is made to do a such a trial, necessary resources must be provided for the full study
duration. To accomplish this in the usual climate of resource competition and peer
review can be difficult. One strategy is full commitment sequentially to the primary
phases of such a trial; ie, pilot, recruitment, screening, and follow-up, with funding for
each successive phase contingent on successful completion of the previous phase.
What is clear however is that such trials have been successfully conducted, and
that screening interventions for chronic diseases can and should be evaluated
rigorously.
Cross-References
References
Ahlbom A (1993) Biostatistics for epidemiologists. Lewis Publishers, Boca Raton, pp 61–66
Andriole GL et al (2012) Prostate cancer screening in the randomized prostate, lung, colorectal and
ovarian cancer screening trial: mortality results after 13 years of follow-up. J Natl Cancer Inst
104:125–132
Baker SG et al (2002) Statistical issues in randomized trials of cancer screening. BMC Med Res
Methodol 2:11. (19 September 2002)
Berry DA (1998) Benefits and risks of screening mammography for women in their forties: a
statistical appraisal. J Natl Cancer Inst 90:1431–1439
Bretthauer M et al (2016) Population-based colonoscopy screening for colorectal cancer: a ran-
domized trial. JAMA Intern Med 176:894–902
Cuzick J et al (1997) Adjusting for non-compliance and contamination in randomized clinical trials.
Stat Med 16:1017–1029
1236 P. C. Prorok
Duffy SW et al (2008) Correcting for lead time and length bias in estimating the effect of screen
detection on cancer survival. Am J Epidemiol 168:98–104
Echouffo-Tcheugui JB, Prorok PC (2014) Considerations in the design of randomized trials to
screen for type 2 diabetes. Clin Trials 11:284–291
Etzioni RD et al (1995) Design and analysis of cancer screening trials. Stat Meth Med Res 4:3–17
Freedman LS, Green SB (1990) Statistical designs for investigating several interventions in the
same study: methods for cancer prevention trials. J Natl Cancer Inst 82:910–914
Friskell J et al (1991) Randomized study of mammography screening – preliminary report on
mortality in the Stockholm trial. Breast Cancer Res Treat 18:49–56
Kafadar K, Prorok PC (2009) Effect of length biased sampling of unobserved sojourn times on the
survival distribution when disease is screen detected. Stat Med 28(16):2116–2146
Mandel JS et al (1993) Reducing mortality from colorectal cancer by screening for fecal occult
blood. New Engl J Med 328:1365–1371
Miller AB et al (1981) The national study of breast cancer screening. Clin Invest Med 4:227–258
Morrison AS (1982) The effects of early treatment, lead time and length bias on the mortality
experienced by cases detected by screening. Int J Epidemiol 11:261–267
Moss SM et al (2015) Effect of mammographic screening from age 40 years on breast cancer
mortality in the UK age trial at 17 years follow-up: a randomized controlled trial. Lancet Oncol
16:1123–1132
NLST Research Team (2011) The national lung screening trial: overview and study design.
Radiology 258:243–253
Nystrom L et al (1993) Breast cancer screening with mammography: overview of Swedish
randomized trials. Lancet 341:973–978
Pinsky PF et al (2007) Evidence of a healthy volunteer effect in the prostate, lung, colorectal and
ovarian cancer screening trial. Am J Epidemiol 165:874–881
Pisano ED (2018) Is tomosynthesis the future of breast cancer screening? Radiology 287:47–48
Prorok PC (1995) Screening studies. In: Greenwald P, Kramer BS, Weed DL (eds) Cancer
prevention and control. Marcel Dekker, New York, pp 225–242
Prorok PC, Marcus PM (2010) Cancer screening trials: nuts and bolts. Semin Oncol 37:216–223
Prorok PC et al (2000) Design of the prostate, lung, colorectal and ovarian (PLCO) cancer screening
trial. Conrolled Clin Trials Suppl 21(6S):273S–309S
Proschan MA et al (2006) Statistical monitoring of clinical trials. Springer, New York
Schroder FH et al (2014) Screening and prostate cancer mortality: results of the european random-
ized study of screening for prostate cancer (ERSPC) at 13 years follow-up. Lancet
384:2027–2035
Shapiro et al (1988) Periodic screening for breast cancer: the health insurance plan project and its
sequelae, 1963–1986. The Johns Hopkins University Press, Baltimore
Tabar L et al (1992) Update of the Swedish two-county program of mammographic screening for
breast cancer. Radiol Clin N Am 30:187–210
Welch HG et al (2016) Breast cancer tumor size, overdiagnosis, and mammography screening
effectiveness. NEJM 375:1438–1414
Yousaf-Khan U et al (2017) Final screening round of the NELSON lung cancer screening trial: the
effect of a 2.5 year screening interval. Thorax 72:48–56
Biosimilar Drug Development
66
Johanna Mielke and Byron Jones
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1238
The Stepwise Approach to Biosimilarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1240
Testing for Equivalence in Biosimilar Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1243
Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245
Step 1: Analytical Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245
Step 2: Nonclinical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1245
Step 3: Clinical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1246
Selected Challenges in Biosimilar Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247
The Choice of Equivalence Margins in Efficacy Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247
Interchangeability of Biosimilars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1249
Incorporating Additional Data in Clinical Efficacy Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1251
Operational Challenges in Biosimilar Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1254
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1256
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1257
Abstract
Biologics are innovative, complex large molecule drugs that have brought life-
changing improvements to patients in various disease areas like cancer, diabetes,
or psoriasis. Biosimilars are copies of innovative biologics. Their development is
currently a focus of attention because the patents of several important biologics
have expired, making it possible for competing companies to produce their own
biosimilar version of the drug. Although, at first sight, there seems to be some
similarity with the development of generics, which are copies of simple small
molecule drugs, there is an important distinction because of the complexity and
Keywords
Follow-on biologics · Equivalence testing · Totality of the evidence ·
Biosimilarity · Extrapolation · Biologics · Comparability · Analytics ·
Switchability · Historical information
Introduction
Biologics (or large molecule drugs) have revolutionized the treatment of various
diseases and dramatically improved the life of many patients. However, they suffer
from the disadvantage that their costs are very high: it is estimated that the costs of
treatment with biologics are 22 times higher than that of a nonbiological drug
(Health Affairs Health Policy Brief 2013). That is why the question as to whether
biologics should be used as the first line treatment is still controversial in many
disease areas (e.g., see Finckh et al. (2009) for a discussion in rheumatoid arthritis)
and the access of patients to these life-changing products if often limited.
Previous experience with (small-molecule) nonbiological drugs showed that the
introduction of generics, that is, copies of the originator small-molecule drug,
substantially lowered drug prices and thus improved the access of patients to these
products. Generics are usually developed and produced by a competing company
and can be marketed after the patent of the original drug has expired. The analogue to
generics for biologics are the so-called biosimilars (also known as follow-on bio-
logics). These medical products are developed and approved as copies of already
marketed biologics.
However, while the concepts of generics and biosimilars are comparable, it is
important to note that small molecule drugs and biologics differ substantially
(Crommelin et al. 2005). While small molecules tend to have a well-defined and
stable chemical structure which can be easily identified, biologics are more complex
proteins with heterogeneous structures. In addition, small molecule drugs are chem-
ically engineered, but biologics are grown in living cells: this makes the manufacture
of biologics extremely sensitive to environmental changes (e.g., a small change of
temperature in the manufacturing site might influence the therapeutic effect of the
product). The high complexity of the molecule and the sensitive manufacturing
process makes it, even for the manufacturer of the originator, impossible to produce
an exact copy. That is why, in contrast to generics, which are chemically identical to
the original small molecule drug, biosimilars are only expected to be similar to the
originator product. The high complexity and the difficult characterization of the
molecules combined with the fact that biosimilars are only similar, but not identical
to the originator product, lead to a higher uncertainty if the therapeutic effect of the
66 Biosimilar Drug Development 1239
case study that illustrates the stepwise approach using the development program of
the biosimilar Zarxio. Then, in section “Selected Challenges in Biosimilar Devel-
opment” selected challenges related to the design and analysis of biosimilar clinical
trials are discussed. Conclusions are presented in section “Summary and
Conclusion.”
Therapeutic
equivalence
Step 3
PK/PD
Totality of
the evidence
Non-clinical studies Step 2
molecule drugs. In fact, several assessments are performed in order to check different
characteristics of the molecule (so-called quality attributes) and it is assumed that if
all of these assessments indicate equivalence, then the molecules themselves are also
sufficiently similar. The structural characteristics, for example, the identity of the
primary sequence of amino acids, are analyzed with techniques like peptide mapping
or mass spectrometry. Also the biological activity needs to be comparable and, for
that, often bioassays are used which can assess comparability in terms of binding and
functionality (Schiestl et al. 2014). The way statistics can support the comparability
claims in analytical studies is still highly controversial: the FDA published (and
subsequently withdrew) a draft guideline (FDA 2018) that discussed the value of
statistics for the establishing of comparability. The FDA suggested using a risk-
based approach where the type of statistical methodology depends on the criticality
of the quality attribute. That is, for an attribute which is assumed to be strongly
related to the clinical outcome (e.g., the results of a bioassay which is assessing the
binding of the molecule to a target which is imitating the mechanism of action),
stricter criteria for comparability are applied compared to a quality attribute which is
assumed not to be critical for the therapeutic effect.
After all analytical studies are conducted, the FDA recommends classifying the
obtained comparability into one of four categories (FDA 2016): (1) insufficient
analytical similarity, (2) analytical similarity with residual uncertainty, (3) tentative
analytical similarity, and (4) fingerprint-like analytical similarity. In categories (1)
and (2), the sponsor needs to conduct additional analytical studies and/or to adjust
the manufacturing process. Categories (3) and (4) allow a sponsor to proceed to the
next step of biosimilar development. Dependent on the amount of residual uncer-
tainty, selective animal and clinical studies might be sufficient. Therefore, providing
a higher level of evidence (e.g., fingerprint-like analytical similarity instead of
tentative analytical similarity) might reduce the amount of required studies in the
following steps. On the other hand, demonstrating fingerprint-like similarity might
be challenging or even not possible in some cases.
In Step 2, studies in animals are conducted. The main aim of the animal studies is
to establish the toxicology profile of the proposed biosimilar. In some cases, also the
PK and PD profiles in animals of the biosimilar are compared to the originator.
However, it is clearly emphasized that the inclusion of animal PK and PD studies
does not lead to a negation of the need for clinical studies in humans. If there is no
relevant animal species, additional in vitro studies might be appropriate, for exam-
ple, with human cells. The extent of the required studies highly depends on the
success of the analytical studies which were performed in Step 1. This is stated in the
regulatory document issued by the FDA (2015b): “If comparative structural and
functional data using the proposed product provide strong support for analytical
similarity to a reference [originator] product, then limited animal toxicity data may
be sufficient to support initial clinical use of the proposed product.”
After Step 2 is completed successfully, the proposed biosimilar is used for the first
time in humans (Step 3). The “FDA expects a sponsor to conduct comparative
human PK and PD studies (if there is a relevant PD measure(s)) and a clinical
immunogenicity assessment” and, if these evidences are not sufficient for removing
1242 J. Mielke and B. Jones
residual uncertainties, also comparative clinical trials are required (FDA 2015b). The
aim of PK and PD equivalence studies is to confirm comparable exposure (PK) of the
proposed biosimilar and the originator and, if possible, to show that the way the drug
affects the body is sufficiently similar (PD). The results of PD studies can only be
considered an important piece of evidence if there exists a well-established PD
marker which can serve as a surrogate for the clinical outcome. In these cases, the
PK/PD studies may be seen as a more sensitive step for detecting potential differ-
ences between the biosimilar and the originator than clinical comparability studies.
For example, for the approval of Zarxio (Sandoz) the pharmacology studies reduced
the need for clinical comparability studies (Holzmann et al. 2016, see for details also
section “Case Study”).
The design and analysis of PK/PD studies are comparable to the studies which are
conducted for the showing of bioequivalence of generics. The preferred study design
(FDA 2016) for products with a short half-life (the time until half of the drug is
eliminated from the body) is a crossover design. Often two-period, two-treatment
crossover designs are used where subjects first take the biosimilar and then the
originator or vice versa. These studies have the advantage that each subject acts as
his or her own control, which reduces the variability and allows for smaller sample
sizes (Jones and Kenward 2014). In the case of a long half-life, parallel groups
designs are also acceptable (FDA 2016). The study population should consist of
healthy volunteers, if possible. This is expected to reduce the variability since
patients often have confounding factors (e.g., comorbidity). However, if this is not
feasible due to ethical reasons (e.g., known toxicology) or if a PD marker can only be
assessed in patients (e.g., in diabetes), then patients are preferred.
The analysis of PK/PD data is, compared to the other steps in biosimilar devel-
opment, standardized and leaves only a small degree of flexibility for the sponsor: a
response of the drug in the blood over time is measured after the drug is injected. For
PK analysis, the response of interest is the concentration of the drug in the blood, for
PD, it might be a well-established PD marker. Measures like the area under the
response vs. time curve (AUC) and the maximum response over time (Cmax) are
reported for each subject. The aim is to show that the ratio of the mean values of the
originator product and the proposed biosimilar for each of these measures as a
percentage lies within 80% and 125% with a prespecified confidence level 1 – α
where commonly α ¼ 0.1 is used. This confidence level corresponds to a one-sided
significance level of 5% which is typically used for testing for superiority.
An assessment of immunogenicity (e.g., the potential to induce an immune
response, as for example anaphylaxis) is required since, with the current understand-
ing of highly complex molecules, it is not possible to reliably predict the immuno-
genicity purely based on analytical studies. Since immune responses might influence
the treatment effect and the safety profile, it is important to confirm similar immu-
nogenicity. Immunogenicity is mostly assessed as part of the clinical studies (Christl
et al. 2017) and the amount and type of immunogenicity assessment depends on the
active substance.
If the comparability of the products at the PK/PD level has been established, but
there still exists residual uncertainties, then clinical comparability studies in patients
66 Biosimilar Drug Development 1243
are conducted. These studies should be carefully chosen to target the residual uncer-
tainty. The selection of endpoints, study population, study duration, and study design
need to be scientifically justified. In general, the approach should be selected which is
expected to be most sensitive to detect potential differences between the proposed
biosimilar and the originator (FDA 2015b). Therapeutic equivalence is typically
assessed at one chosen point in time using an equivalence testing approach (see
section “Testing for Equivalence in Biosimilar Trials”), that is, it is confirmed that
the characteristic of interest of the treatment response, for example, the mean value of a
chosen endpoint, after taking the biosimilar is neither smaller nor larger than after
taking the originator. A non-inferiority-type test, that is, the showing that a chosen
characteristic under treatment with the biosimilar is not larger or smaller, respectively,
than under treatment with the originator, might be acceptable in specific cases. For
example, one might consider a noninferiority design if a higher response can be ruled
out due to scientific reasons (e.g., saturation of the target with a specific dose, see
Schoergenhofer et al. (2018)). In terms of the study design, mostly parallel groups
designs are conducted which often are combined with an extension period in which the
effect of a single switch from the originator to the biosimilar is studied and the safety
and immunogenicity profile is compared between the switching and nonswitching
group. This type of assessment is explicitly required in the respective guideline (FDA
2015b). Safety and immunogenicity are usually assessed descriptively.
Taking all results into account, by referring to the concept of “totality of the
evidence” (see section “Introduction”), regulators make a decision if biosimilarity is
established or not. Consequently, this means that the failing of one analysis does
necessarily lead to the failing of the biosimilar development program, as long as a
scientific justification is provided (for an example of approval with failed “compo-
nents of evidence,” see Mielke et al. 2016). It is important to note that the overall
assessment of biosimilarity is made by appealing to scientific judgment and not by a
quantitative decision making approach. This makes the decision whether as to
approve a biosimilar a subjective one. In the recent past, some strategies were
published on how to formalize the decision making process (e.g., the biosimilarity
index by Hsieh et al. 2013), but these suggestions have not yet made it into practice.
Information regarding the provision of clinical evidence for biosimilar approval
in practice can be found in Hung et al. (2017) for the USA, in Mielke et al. (2016,
2018a) for the European Union and in Arato (2016) for Japan.
the same type of hypothesis can be assessed. More formally, let τB be a characteristic
of interest of the biosimilar (e.g., the mean value of log(AUC)) and τO be the same
characteristic of interest of the originator. Then, the aim is to test the hypotheses
(Wellek 2010):
where Δ is a positive value and called the equivalence margin. The choice of the
equivalence margin is discussed in more detail in section “The Choice of Equiva-
lence Margins in Efficacy Trials” and for the time being, it is assumed that an
equivalence margin Δ is provided.
There exists two common ways to test the above mentioned hypotheses: first, one
can split the equivalence hypothesis into two one-sided hypotheses. This approach is
commonly known as the two-one-sided-test (TOST) approach (Schuirmann 1987).
The two sets of hypotheses are given by:
ð1Þ ð1Þ
H 0 : τB τO Δthickmathspacevs:H 1 : τB τO > Δ,
ð2Þ ð2Þ
H 0 : τB τO Δthickmathspacevs:H1 : τB τO < Δ:
ð1Þ ð2Þ
If both H 0 and H 0 are rejected, the overarching hypotheses H0 is also rejected
and equivalence can be claimed. In the following, the test statistics and decision rules
ð1Þ ð2Þ
for the hypotheses H 0 and H 0 are illustrated using the example of a normally
distributed endpoint. For that, let τB be the expected value of the biosimilar and τO be
the expected value of originator. The standard deviation of the originator is denoted
by σ O, whereas the standard deviation of the biosimilar is denoted by σ B. We assume
that both standard deviations are equal, that is, σ B ¼ σ O. In addition, we assume a
parallel groups design with n subjects per group. Let yO and yB be the observed mean
values of the originator and the biosimilar, respectively. The estimated standard
deviations are denoted by σ O and σ^B . The corresponding test statistics are then
ðyB yO Þ þ Δ ðy y B Þ þ Δ
Z1 ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffi2ffi and Z 2 ¼ Oqffiffiffiffiffiffiffiffiffiffiffiffiffiffi :
2
σ^B σ^O σ^2B σ^2O
n þ n n þ n
Both test statistics follow, under the null hypotheses, a t-distribution with 2n – 2
degrees of freedom. Therefore, the null hypothesis is rejected if both realizations are larger
than the (1 – α)-quantile of a t-distribution with 2n – 2 degrees of freedom. A typical
choice for the significance level α is α ¼ 0.05.
The second strategy is based on a confidence interval approach: a (1 – 2α)-
confidence interval for the difference of the mean value is calculated. If this
confidence interval fully lies within
½Δ, Δ,
Case Study
In 2015, Sandoz Inc., a Novartis Company, gained approval from the FDA to market
Zarxio, its biosimilar version of the reference biologic, Neupogen (active substance:
filgrastim). Neupogen is used to treat neutropenia (an abnormally low level of
neutrophils in the blood) which can occur, for example, in cancer patients undergo-
ing chemotherapy. This case study gives a brief description of some of the informa-
tion presented by Sandoz Inc. in its submission to the FDA to gain approval to
market Zarxio. This information is publically available online (FDA 2015a) and the
summary given below is based directly on that text. The following subsections
briefly describe the contributions to the steps that were illustrated in Fig. 1 and
described in detail in section “The Stepwise Approach to Biosimilarity.”
Analytical similarity was assessed using multiple quality attributes, and some of
them are listed in Table 1. In total, it was concluded that the proposed biosimilar is
“highly similar” to Neupogen.
EP2006 was compared to Neupogen in animal studies for assessing the pharmacody-
namics (PD), toxicity, toxicokinetics, and local tolerance of the products. Of these
1246 J. Mielke and B. Jones
Table 1 Quality attributes and methods used to evaluate analytical similarity of EP2006 and US-
Licensed Neupogen (partial list, for illustration)
Quality attribute Method
Primary N-terminal sequencing
structure Peptide mapping with ultraviolet (UV) and mass spectrometry detection
Protein molecular mass by electrospray mass spectrometry (ESI MS)
Protein molecular mass by matrix-assisted laser desorption ionization mass
spectrometry (MALDI-TOFMS)
DNA sequencing of the EP2006 construct cassette
Peptide mapping coupled with tandem mass spectrometry (MS/MS)
Bioactivity Proliferation of murine myelogenous leukemia cells (NFS-60 cell line)
Receptor Surface Plasmon Resonance
binding
Protein content RP-HPLC
studies, one was a single dose tolerance study in rabbits and the other was a 28-day
multiple, repeat dose toxicology study in rats. According to Bewesdorff (2016), these
two studies used, respectively, a single group of 24 rabbits and two groups of 60 rats.
To assess PK and PD, four studies were reported, labeled as EP06-109, EP06-103,
EP06-105, and EP06-101. Each of these was a 2 2 cross-over trial and involved
between 24 and 32 healthy subjects. For the PK assessment, the usual PK parame-
ters, for example, AUC and Cmax, were used and for the PD assessment, the
endpoints were the absolute neutrophil count (ANC) and the increase in CD34+
cell count.
For establishing equivalence of the PK profiles, equivalence was tested using the
usual criteria that the 90% confidence interval for the geometric ratios of the AUC
and Cmax parameters should lie within (80%, 125%), except for study EP06-101
which used the wider margin of (75%, 133%) for Cmax.
In the PD studies, equivalence was assessed using the criterion that the 95%
confidence interval for the ratio of geometric means for AUEC (area under the effect
curve over time) and the maximum ANC should lie within the (80%, 125%) interval,
except for study EP06-103 where the interval was (87.25%, 114.61%) for the
2.5 mcg/kg dose and (86.5%, 115.61%) for the 5 mcg/kg dose. According to the
publicly available information, there were no predefined equivalence criteria for
CD35+ and 95% and 90% confidence intervals for the ratio of the parameters
(AUEC and maximum CD34+ count) were reported. It is notable that, in general,
the choice of equivalence margins, especially for the PD parameters, is not fixed, but
may depend on the chosen endpoint. This is taken up in subsection “The Choice of
Equivalence Margins in Efficacy Trials” where the choice of margins is discussed.
For the clinical assessment of efficacy and safety, data from two trials were used:
EP06-301 and EP06-302. The latter trial was a double-blind parallel groups trial in
66 Biosimilar Drug Development 1247
women with histologically proven breast cancer and the treatments were adminis-
tered over six cycles of chemotherapy. The study had four arms: (1) EP006 (E) given
repeatedly for all cycles, (2) Neupogen (N) given repeatedly for all cycles, (3) E and
N were alternated over the cycles in the order (N,E,N,E,N,E), and (4) E and N were
alternated over the cycles in the order (E,N,E,N,E,N). This design was planned to not
only assess similarity but also interchangeability. The concept of interchangeability
is discussed in subsection “Interchangeability of Biosimilars.” The endpoint in this
study was the duration of severe neutropenia. Study EP06-301 was a non-
comparative single arm study in which patients with breast cancer were treated
with chemotherapy and then one day later were given daily EP2006 until neutrophil
recovery.
In January, 2015, the expert panel reviewing the Sandoz Inc. application unani-
mously recommended its approval and in March 2015, the FDA gave approval for
the biosimilar to be marketed for all five of the indications approved for Neupogen.
substance, the indication and the chosen endpoint. That is why a case-by-case
decision has to be made for each active substance and endpoint and the margins
are typically determined by negotiation with the regulatory agencies. In the follow-
ing, we discuss experiences with the choice of equivalence margins in applications to
the European Medicines Agency (EMA) instead of experience with the FDA since
the EMA has already approved more than 40 biosimilars and therefore allows for a
broader overview of current practice.
In their respective guidelines (CHMP 2014a), the EMA only states that “compa-
rability margins should be prespecified and justified on both statistical and clinical
grounds by using the data of the reference [original] product” and refers to a related
guideline on the choice of non-inferiority margins (CHMP 2005). The EMA aims at
being transparent in their decision making and that is why they publish so-called
European public assessment reports (EPARs) which give detailed information on the
provided evidence for approved biosimilars and are accessible by the general public.
Thus, it is possible to analyze the choice of margins in practice where, indeed, the
equivalence margins for the clinical comparability studies were usually prespecified
and only in a few cases were post hoc decisions made (Mielke et al. 2018a).
However, the information on the derivation provided in the EPARs was only in a
few cases reported in enough detail so that a reproduction of the equivalence margins
would be possible. In addition, there does not seem to be a standardized strategy for
the determination of equivalence margins. This is illustrated by using the regulatory
applications for Benepali (active substance: etanercept) and Rixathon (active sub-
stance: rituximab).
For Benepali, the chosen endpoint was the ACR20 responder rates (CHMP
2016): a subject is classified as an ACR20 responder if the relative improvement
in percentage according to the American College of Rheumatology (ACR) criterion
(Felson et al. 1993) when compared to baseline is larger than 20%. Three historical
studies were identified and combined using a random-effects meta-analysis and a
95% confidence interval for the difference in response rates of the originator vs.
placebo was obtained and is reported as (0.3103, 0.4996). The sponsor decided to
aim for a preservation of 50% of the effect of the originator vs. placebo and chose
Δ ¼ 0.15. For Rixathon, in contrast, the overall response rate to the treatment was
the chosen endpoint. The sponsor used only a single historical trial in a comparable
study population for deriving the equivalence margin. A 95% confidence interval for
the observed add-on effect of the originator was estimated and is given by (0.14,
0.34). The sponsor decided for an equivalence margin of Δ ¼ 0.12 which preserves
only 15% of the add-on effect of the originator.
It is acknowledged that the richness of historical data was different in the two
situations: while for Benepali three comparable studies were used with, in total, 460
patients enrolled, for Rixathon only one single study with 320 subjects was ana-
lyzed. If only limited data are available, this leads to a wider confidence interval and
that generally lowers the equivalence margin for a fixed percent of effect to be
preserved. A lower equivalence margin Δ makes it finally more difficult to claim
equivalence. Nonetheless, this example shows that the statistical approach for the
choice of the equivalence margins is not standardized yet. Due to the close
66 Biosimilar Drug Development 1249
connection between the equivalence margin and the test result, it would be beneficial
if more concrete guidance was provided by regulatory agencies, specifically on the
percentage of effect to be preserved.
Negotiation of the equivalence margin with the regulatory authorities by seeking
Scientific Advice is not mandatory in Europe. This is evident in some of the EPARs
in which it is explicitly stated that the EMA did not agree with the chosen margin.
One example is the application of Amgevita (active substance: adalimumab). There,
the sponsor decided to use a margin of (0.738, 1.355) for the risk ratio of ACR20
responders. The EMA (CHMP 2017) was concerned that this margin was too wide
because it “would correspond to an absolute margin of more than –16% on the
additive scale.” It was concluded that “however, in light of the results observed this
does not represent an issue that could compromise the reliability of the study.” It is
unclear how the EMA would have decided if the study results would not have
supported also the tighter margins. Therefore, an early discussion with regulatory
authorities on an acceptable choice of equivalence margins is recommended.
Interchangeability of Biosimilars
The primary efficacy endpoint in therapeutic equivalence studies (see section “The
Stepwise Approach to Biosimilarity”) is usually compared in a parallel groups
design, that is, it is confirmed that patients who are taking repeatedly the biosimilar
and patients who are taking repeatedly the originator respond comparably to the
treatment. The focus is typically on treatment-naive patients, that is, patients without
any relevant pre-treatment prior to the start of the study (FDA 2015b). In practice,
since biosimilars are often developed for chronic diseases, patients might want or
need to switch between the biosimilar and its originator once or even multiple times
during the duration of the treatment. While for the approval as a biosimilar in
Europe, no data on transition from the originator to the biosimilar or vice versa is
required, the FDA recommends assessing the impact on immunogenicity of a single
transition from the originator to the biosimilar (FDA 2015b). However, also in the
USA, usually no data are provided in the biosimilar application by the sponsor on the
impact of multiple switches and single crossovers from the biosimilar to the
originator.
To fill this gap, the FDA has the legal option to approve biosimilars as “inter-
changeable biosimilars.” According to BPCI Act (FDA 2009), a proposed product is
considered to be interchangeable, if (1) the proposed product is biosimilar to its
originator, (2) it “can be expected to produce the same clinical result as the reference
[originator] product in any given patient” and (3) “for a biological product that is
administered more than once to an individual, the risk in terms of safety or dimin-
ished efficacy of alternating or switching between use of the biological product and
the reference [originator] product is not greater than the risk of using the reference
product without such alternation or switch” where alternating relates to multiple
switches (e.g., biosimilar to originator back to biosimilar). It is important to note that
there is a clear hierarchy between “biosimilarity” and “interchangeability”: a product
1250 J. Mielke and B. Jones
Originator
Originator
Biosimilar Originator Biosimilar
Wash-out period
Fig. 2 FDA’s proposed study design for a separate trial for establishing interchangeability
66 Biosimilar Drug Development 1251
might be required which focus on the question if the patients are able to use the
biosimilar device without any additional training.
In discussion among stakeholders, it can be seen that the above guidance is
controversial opinions (Barlas 2017). Independently of the guidelines, a collection
of statistical methodologies for the assessment of interchangeability has evolved.
Compared to the recommended approach in the guideline, some of these methodol-
ogies are less focused on the mean value and are also sensitive to detect changes in
variability (e.g., Li and Chow 2017) or make better use of all measured data by using
the longitudinal assessments of the patients (e.g., Mielke et al. 2018d).
So far, no interchangeable biosimilar has gained approval. It is important to note
that, so far, no results of any study have been published which revealed that
biosimilars cannot be used interchangeably. In contrast, there exist several publica-
tions indicating that switching is not problematic (e.g., Jørgensen et al. 2017;
Benucci et al. 2017). Therefore, it is unclear if the concerns related to interchange-
ability will diminish when more experience with biosimilars in practice is gained or
if the complex clinical studies dedicated to the assessment of interchangeability will
be required in the future.
It should be noted that interchangeability is not a regulatory topic in Europe since
the EMA (2012) clearly states that “the Agency’s evaluations do not include
recommendations on whether a biosimilar should be used interchangeably with its
reference [originator] medicine” and recommends that “for questions related to
switching from one biological medicine to another, patients should speak to their
doctor or pharmacist.” The member states in Europe handle switching and alternat-
ing of biosimilars differently and no joint position is expected to evolve in the near
future (Moorkens et al. 2017).
Typically when clinical efficacy studies are conducted, already rich information on
the biosimilar and its originator has been gathered. The proposed biosimilar has
already been evaluated in analytical, animal, and human PK (and possibly PD)
studies. Even more information is available on the originator since this product is
already an established medical product: the sponsor of the originator conducted
several clinical efficacy studies for gaining market authorization and often the
product has additionally been assessed in postmarketing studies. Furthermore, also
academic institutes or health care providers might have conducted separate trials.
Therefore, it seems natural to include all available information into the showing of
similar efficacy. In the following, it is first assumed that historical information is
available for the originator only and the information is of same type, that is, the focus
is on the incorporation of results from historical clinical trials (same endpoint) in the
showing of equivalent efficacy of the biosimilar and the originator. It should be noted
that historical information in biosimilar trials was already used in practice by
comparing the efficacy outcomes of a single-arm trial to a historical control trial,
for example, in the application for Zarzio in 2008 (CHMP 2008).
1252 J. Mielke and B. Jones
In general, the aim of the inclusion of historical data is the lowering of the
required sample size or, in other words, the increase of the power (the probability
of claiming equivalence) of the study. However, one also needs to consider the
disadvantages of including all available knowledge in the assessment of equivalent
efficacy: clearly the statistical approach is more complicated, making it more
difficult to analyze the study and communicate the results to a nonstatistical audi-
ence. In addition, in case the data in the new study follows a different distribution
than the data in the historical trials (a so-called prior-data conflict), the Type I error
rate (the probability of false positive decisions) might be higher than the acceptable
nominal level. Especially the potential of an inflation of the Type I error rate, which
is the patient’s risk that a nonequivalent product will be called equivalent, is of
concern for regulatory agencies. The use of historical information is common
practice in some disease areas: for example, in rare diseases where it is challenging
to recruit a sufficient number of patients in randomized trials, the use of prior
information might be required so that the development is feasible at all. Also in
situations in which it is unethical to include a placebo group, a comparison of the
active treatment to historical placebo data has already been used for regulatory
approval. In these situations, a moderate inflation of the Type I error rate is
considered acceptable.
In biosimilar development, the situation is quite different (Mielke et al. 2018c):
biosimilars are not developed for rare diseases; therefore, a sufficient number of
subjects are available. In addition, the randomization of patients to the control group
is not unethical since the control group receives the originator which is often still the
standard of care. Nonetheless, even though the necessity for the inclusion of all
available information is weaker for biosimilars, it is still important to emphasize that
the inclusion of all available data is desirable from a scientific point of view,
especially in the context of “totality of the evidence,” and can also speed up the
development and bring the product earlier to the patient. Therefore, the use of all
information is desirable both for the sponsor and the general public. However, due to
the reasons outlined above, it is expected that the regulatory expectations in terms of
control of the Type I error rate are stricter.
Several approaches for the incorporation of historical information have already
been proposed and an overview can be found in van Rosmalen et al. (2017). For
most approaches, it is possible to adjust the methodology to make it more robust
against a potential prior-data conflict and for limiting the overall Type I error rate. In
the context of biosimilar development, Pan et al. (2017) developed a methodology
which features some tuning parameters for an improved control of the Type I error
rate. Mielke et al. (2018c) proposed not to aim for control of the Type I error rate
over the whole parameter space (e.g., for response rates between 0 and 1 for a binary
endpoint), but to focus instead on scenarios which are realistic in practice (e.g., true
response rates between 0.2 and 0.3). This idea is displayed in Fig. 3: the Type I error
rate and the power are displayed dependent on the rate of a binary characteristic of
interest of the originator in the new study for two hypothetical approaches. One of
these approaches is making use of the historical data (solid lines, via a prior
distribution, for example), while the other (dotted horizontal line) is not
66 Biosimilar Drug Development 1253
incorporating the historical data. The vertical solid line gives the center of the
historical data (e.g., the mean of the prior distribution): an observed rate on the x-
axis close to this line shows the Type I error rate and the power for situations in
which the data on the original in the new study approximately match the historical
data. In contrast, if the observed rate of the characteristic of interest is, for example,
at a position on the x-axis of 0.8, this refers to the operating characteristics for a
scenario with a clear prior-data conflict. The proposal of Mielke et al. (2018c) is to
control the Type I error rate only for scenarios with a good to moderate fit of
historical data and data in a new study which is displayed in Fig. 3 as within the
dashed vertical lines. This idea is motivated by the understanding that since the
originator is already an established product, there exists a rich collection of knowl-
edge about the originator which can be used for planning a study which might not be
identical, but at least would provide results similar to those from previous studies.
Outside of the chosen interval, an increased Type I error rate and a lower power are
acceptable since one is certain that the true estimates will not lie outside of the
chosen interval. Mielke et al. (2018c) proposed a hybrid Bayes-frequentist method-
ology for binary endpoints which has the above-described operating characteristics.
The incorporation of information gathered during early development (preclinic,
animal, and human PK and PD) is less straightforward. Combest et al. (2014)
proposed constructing an informative prior for the efficacy assessment based on
preclinical data. However, the proposal is rather vague and does not give any
detailed information on the underlying methodology. The main challenge for an
approach like this will, most likely, be the connection of the preclinical assessment
with the clinical result: in contrast to the previously discussed examples where the
same endpoint was measured in the historical study as in the new efficacy study, it is
1254 J. Mielke and B. Jones
here necessary to combine completely different pieces of evidence. For example, the
preclinical result may be the result of a bioassay, but the clinical endpoint is a binary
endpoint (responder, nonresponder). Then, one needs to establish a link between the
different measurements, that is, how does a difference of, for example 0.2 from the
bioassay, relate to the chance of being a responder or nonresponder. Often, the
connection between preclinical and clinical results is not known and therefore
even the establishing of equivalence margins for the most critical quality attributes
is not straightforward. That is why the aim to include information from early
development into the clinical efficacy studies is interesting, but ambitious and
more research is required.
In the previous sections, challenges related to the design and analysis of biosimilar
trials were described. However, it is important to emphasize that there exists also
multiple challenges which are not related to these aspects. In the following, some of
these aspects are briefly discussed.
Due to the complex nature of the processes needed to manufacture a biologic, the
batches of the drug that are produced over a given time period may vary in terms of
their exact analytical properties. This is understood by regulators and manufacturers
are expected to run so-called comparability studies at regular intervals to ensure that
critical quality attributes of the biologic are maintained within agreed limits (ICH
2004). Over time, the manufacturer builds up a history of how batches vary over
time, but this knowledge is not available to the developer of a biosimilar. The only
knowledge that the biosimilar developer has comes from analysis of batches of the
original biologic that are purchased on the open market. So, in some sense, the
biosimilar developer is having to chase a moving target in terms of showing
analytical similarity (Step 1 in Fig. 1). See Schiestl et al. (2011) and Mielke et al.
(2019) for examples and further discussion on this. See Berkowitz (2017) for a more
complete discussion of issues related to the structural assessment of biosimilarity.
As some biologics have a long half-life, this may preclude the use of cross-over
trials to show equivalence of PK and PD markers for certain active substances.
Therefore, parallel groups trials have to be used and these typically have much larger
sample size requirements compared to those needed for providing evidence of
equivalence of a nonbiologic (generic) drug with its reference using a cross-over
trial.
Another challenge related to recruitment, mentioned by Weschler (2016), is that
experienced clinical investigators might prefer to be involved in the development of
innovative drugs rather than in the development of copies of existing drugs. This
might limit the number of research centers that are available to take part in a
biosimilar study.
Once the biosimilar has gained regulatory approval and is on the market, a further
challenge is to convince physicians to prescribe the new drug. In addition, patients
need to agree to use the biosimilar instead of the originator product. Surveys
66 Biosimilar Drug Development 1255
In practice, global development programs are becoming more common. From the
recently approved applications in Europe, many used a strategy with bridging at the
PK level (Mielke et al. 2018a). Herzuma (Celltrion Healthcare, active substance:
trastuzumab) was the first approved product in Europe where bridging studies were
only conducted at the analytical level and no head-to-head comparisons of the EU-
approved originator to the proposed biosimilar in human subjects were performed (i.
e., no PK, PD, clinical comparability, and safety studies were conducted).
In this chapter, also some key statistical challenges were highlighted (the choice
of equivalence margins, the establishing of interchangeability, and the formal
incorporation of additional information into the analysis of efficacy endpoints).
These key challenges can certainly only be seen as a short introduction to statistical
issues in biosimilar development and there exists several other important topics to
be considered, for example, the handling of multiplicity (e.g., Mielke et al. 2018b),
the use of statistics in preclinical development (e.g. Tsong et al. 2017; Mielke et al.
2019), or the application of advanced statistical tools like network meta-analysis for
an improved efficacy assessment (e.g., Messori et al. 2017). With the increasing
number of approved biosimilars, the number of tailored statistical methodologies
developed specific for biosimilar development is expected to further increase in the
near future.
Key Facts
Cross-References
▶ Cross-over Trials
▶ Essential Statistical Tests
▶ Introduction to Meta-Analysis
▶ Pharmacokinetic and Pharmacodynamic Modeling
▶ Use of Historical Data in Design
66 Biosimilar Drug Development 1257
Acknowledgements The authors gratefully acknowledge the funding from the European Union’s
Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant
agreement No 633567 and from the Swiss State Secretariat for Education, Research and Innovation
(SERI) under contract number 999754557. The opinions expressed and arguments employed herein
do not necessarily reflect the official views of the Swiss Government.
References
Arato T (2016) Japanese regulation of biosimilar products: past experience and current challenges.
Br J Clin Pharmacol 82(1):30–40
Barlas S (2017) FDA guidance on biosimilar interchangeability elicits diverse views: current and
potential marketers complain about too-high hurdles. Pharm Ther 42(8):509
Benucci M, Gobbi FL, Bandinelli F, Damiani A, Infantino M, Grossi V, Manfredi M, Parisi S,
Fusaro E, Batticciotto A et al (2017) Safety, efficacy and immunogenicity of switching from
innovator to biosimilar infliximab in patients with spondyloarthritis: a 6-month real-life obser-
vational study. Immunol Res 65(1):419–422
Berkowitz SA (2017) Analytical characterization: structural assessment of biosimilarity, Chap 2. In:
Endrenyi L, Declerck P, Chow SC (eds) Biosimilar drug product development. CRC Press, Boca
Raton, pp 15–82
Bewesdorff M (2016) Biosimilars in the U.S. – the long way to their first approval. Master of drug
regulatory affairs, Rheinischen Friedrich-Wilhelms-Universitat Bonn
Blackstone E, Fuhr JP Jr (2017) Biosimilars and biologics. The prospect for competition, Chap 16.
In: Endrenyi L, Declerck P, Chow SC (eds) Biosimilar drug product development. CRC Press,
Boca Raton, pp 413–438
Brennan Z (2016) FDA to hold one advisory committee for each initial biosimilar. https://www.
raps.org/regulatory-focus%E2%84%A2/news-articles/2016/9/fda-to-hold-one-advisory-commi
ttee-for-each-initial-biosimilar. Accessed 07 June 2018
Cazap E, Jacobs I, McBride A, Popovian R, Sikora K (2018) Global acceptance of biosimilars:
importance of regulatory consistency, education, and trust. Oncologist 23:1188
CHMP (2005) Guideline on the choice of non-inferiority margins. http://www.ema.europa.eu/docs/
en_GB/document_library/Scientific_guideline/2009/09/WC500003636.pdf. Accessed 07 June
2018
CHMP (2008) Zarzio: EPAR public assessment report. http://www.ema.europa.eu/docs/en_GB/docu
ment_library/EPAR_-_Public_assessment_report/human/000917/WC500046528.pdf. Accessed
26 Oct 2015
CHMP (2014a) Guideline on similar biological medicinal products containing biotechnology-derived
proteins as active substance: non-clinical and clinical issues (revision 1). http://www.ema.europa.
eu/docs/en_GB/document_library/Scientific_guideline/2015/01/WC500180219.pdf. Accessed 22
Feb 2018
CHMP (2014b) Guideline on similar biological medicinal products (revision 1). http://www.ema.
europa.eu/docs/en_GB/document_library/Scientific_guideline/2014/10/WC500176768.pdf.
Accessed 22 Feb 2018
CHMP (2016) Benepali: EPAR – public assessment report. http://www.ema.europa.eu/docs/en_
GB/document_library/EPAR_-_Public_assessment_report/human/004007/WC500200380.pdf.
Accessed 07 June 2018
CHMP (2017) Amgevita: EPAR – public assessment report. http://www.ema.europa.eu/docs/en_
GB/document_library/EPAR_-_Public_assessment_report/human/004212/WC500225231.pdf.
Accessed 07 June 2018
Chow SC, Hsieh TC, Chi E, Yang J (2009) A comparison of moment-based and probability-based
criteria for assessment of follow-on biologics. J Biopharm Stat 20(1):31–45
1258 J. Mielke and B. Jones
van Rosmalen J, Dejardin D, van Norden Y, Lwenberg B, Lesaffre E (2017) Including historical
data in the analysis of clinical trials: is it worth the effort? Statistical methods in medical
research. https://www.ncbi.nlm.nih.gov/pubmed/28322129
Webster CJ, Woollett GR (2017) A ‘global reference’ comparator for biosimilar development.
BioDrugs 31(4):279–286
Weise M, Kurki P, Wolff-Holz E, Bielsky MC, Schneider CK (2014) Biosimilars: the science of
extrapolation. Blood 124(22):3191–3196
Wellek S (2010) Testing statistical hypotheses of equivalence and noninferiority, 2nd edn. CRC
Press, London
Weschler B (2016) Biosimilar trials differ notably from innovator studies. Appl Clin Trials. http://
www.appliedclinicaltrialsonline.com/biosimilar-trials-differ-notably-innovator-studies
Prevention Trials: Challenges in Design,
Analysis, and Interpretation of Prevention 67
Trials
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1262
Trial Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1263
The Disease Process and Identifying a Population at Risk Who Can Benefit
from a Preventive Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1264
Components of Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265
Sustainability of the Behavior Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265
The Time Course of the Intervention Within the Disease Process . . . . . . . . . . . . . . . . . . . . . . . . 1266
The Dose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1267
The Duration of “Exposure/Intervention” Needed to Produce Risk Reduction . . . . . . . . . . . 1268
The Durability of the Impact of the Intervention After It Has Stopped . . . . . . . . . . . . . . . . . . . 1269
Outcomes for Prevention Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1269
Analysis ITT and Adherence in Prevention Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1270
Biomarkers and Other Emerging Areas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1271
Interpreting Prevention Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1274
Abstract
Designing a prevention trial requires understanding the natural history of the
disease, and the likely length of intervention required to achieve a reduction in
incidence. The population suitable to contribute meaningful information to the
outcomes under study, the intervention is appropriate and likely will generate a
balance of risks and benefits for the typically disease free population, and the
primary outcome is biologically plausible and clinically relevant. Given the
relatively long evolution of chronic diseases, prevention trials bring extra pres-
sures on two fundamental issues in the design of the trial: adherence to the
preventive intervention among participants who are otherwise healthy, and
sustained follow-up of trial participants. With growing emphasis on the compo-
sition of the trial participant population reflecting the overall population for
ultimate application of the results, there is the need for additional attention to
recruitment and retention of participants. This is fundamental to planning a
prevention trial. Planning for follow-up after the intervention is completed
helps place the intervention and outcomes in the context of the disease process
but adds complexity to recruitment. Never the less this adds to insights from
prevention trials. Improving risk stratification for identification of eligible partic-
ipants for recruitment to prevention trials can improve efficiency of the trials and
fit prevention trials in the context of precision prevention.
Keywords
Participants · Baseline risk · Diversity: natural history · Sustainability ·
Intervention timing · Adherence
Introduction
intervention; and (4) the outcome. For interventions we consider (a) sustainability of
the behavior change, (b) the time course of the intervention within the disease
process, (c) the dose, (d) the duration of “exposure” needed to effect risk reduction,
and (e) the durability of the impact of the intervention after it has stopped. Issues of
adherence to the intervention and approaches to analysis also impact the inference
from prevention trials when informing changes in policy and practice.
Trial Population
A key assumption for enrolling participants in a trial is that they will contribute
meaningful information to the outcome measures and that they are likely to engage
in the intervention for the duration of the study. Participant recruitment for preven-
tion trials brings added challenges beyond enrolling patients facing acute disease or
major catastrophic outcomes in the near term. The setting of disease treatment adds
to issues when enrolling participants as likely adherence is high to treatment options
and the balance of risks and benefits can be conveyed in time frames that relate to the
patient situation at hand. For primary prevention trials, however, we begin enrolling
healthy individuals who may be at risk of future disease and engage them for longer
term adherence to prevention strategies (pills, behaviors, or combinations) to
observed endpoints off in the future. This challenge of recruitment to prevention
trials has resulted in a number of trials where the prevalence of baseline behaviors
results in population not ideally selected to evaluate the intervention (see Physicians
Health Study of aspirin to prevent cardiovascular death, and very low cardiovascular
disease incidence (Cairns et al. 1991)) or calcium and vitamin D in the Women’s
Health Initiative where baseline calcium was above the threshold of benefit as
determined from observational studies(Martinez et al. 2008). On the other hand,
the trials evaluating Tamoxifen for breast cancer prevention used a baseline estimate
of breast cancer risk to identify women at elevated risk and so shift the balance of
risks and benefits for those randomized (Fisher et al. 1998). Similar issues are
discussed in ▶ Chap. 112, “Trials Can Inform or Misinform: “The Story of Vitamin
A Deficiency and Childhood Mortality””).
Eligibility and enrollment of the study population may also limit the application
of results beyond the trial. This issue is not limited to primary prevention of course,
but is also demonstrated with exclusion based on older age, or presence of major
comorbidities, that limit generalizability and application of results (see Stoll et al.
(2019), and pragmatic trials (Ware and Hamel 2011)).
Beyond disease severity, risk factor profiles and the like, many clinical trials are
underpopulated with minority participants (Chen et al. 2014). This is due, in part, to
eligibility criteria, and lack of engagement strategies tailored to minorities. Evidence
shows that concerted efforts to modify eligibility to include broader populations of
patients, and use of culturally tailored materials and processes, result in increased
research and trial accruals of minorities and their retention through the duration of
the trial (Warner et al. 2013). Thus, to generate results applicable to the broader
population, design of trials should increase eligibility to populations with multiple
1264 S. Jiang and G. A. Colditz
Epidemiologic and natural history studies often define risk factors and provide input
to models that classify risk of chronic disease. In defining the population for
recruitment to a prevention trial, the aim is to identify those with sufficiently high
risk of disease based on a combination of risk factors so that the benefit from a
preventive intervention will outweigh any possible adverse effects of the interven-
tion. Thus, selection of participants is in part driven by baseline disease risk being
sufficient to generate a research answer in a short funding time frame. For example,
after numerous meetings convened by NIH to discuss trial design for weight loss to
prevent chronic disease, NIDDK choose to move forward with a prevention trial of
intensive lifestyle intervention to prevent or delay development of diabetes (Knowler
et al. 2002).
Numerous examples from prior trials show that healthy volunteers are not
necessarily at sufficient risk to generate endpoints from the intervention. Improving
risk classification for entry to prevention trials is a major imperative. Much work is
ongoing in this area to more precisely define at risk groups whether by combing
questionnaire risk factors, polygenic risk scores, or metabolomic profiles to differ-
entiate those who might respond to a prevention intervention and those who will not.
Take for example breast cancer where chemoprevention shows marked differ-
ences between prevention receptor positive and negative breast cancer. Overall
selective estrogen receptor modulators (SERMs) such as Tamoxifen and Raloxifene
have been shown in randomized controlled prevention trials to reduce risk of
preinvasive and invasive breast cancer (Fisher et al. 1998; Martino et al. 2004).
The separation of incidence curves is dramatic and clear within 2 years of initiating
therapy. Like aspirin, SERMs also raise the challenge of risks and benefits of
therapies as well as the limitation of randomized trials to quantify potential harms
that are much less frequent than the primary trial end point. Tamoxifen increases risk
of uterine cancer, a finding confirmed by epidemiologic studies; Raloxifene, which
looks to have a safer profile, does not (Chen et al. 2007). Yet the protection is limited
to receptor positive disease. Thus, either identifying and enrolling those who are at
highest risk of receptor positive but not receptor negative breast cancer could
maximize benefits and reduce potential harms.
The emerging field of personalized medicine has been offering possibilities for
improving risk prediction and stratification based on patient-specific demographic,
clinical factors, medical histories, and genetic profile. Models that are tailored for
patients have provided high-quality recommendations for screening accounting for
individualized heterogeneity (Sargent et al. 2005). Traditional approaches usually
aim to investigate evidence of treatment differences by conducting subgroup
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1265
analysis based on prior data to gain insights on markers that can better stratify
patients according to their risk level. Other approaches include regression models
which usually include interaction terms between treatments and the covariates in
order to examine whether these interactions are statistically significant as well as
estimating the true (undiluted) benefit of the intervention. Effective classification of
patients can thus be translated into statistical models which aim to minimize the
prediction error, where the optimal risk classifier can lead to the best predicted
outcome.
Components of Intervention
Interventions for prevention range from a one (or two) time events (some vaccines),
to longer term use of preventive drugs such as aspirin, nutritional supplements, or the
polypill for cardiovascular diseases (Yusuf et al. 2021), and lifestyle changes
(components of diet, physical activity, sun exposure) (Knowler et al. 2002). The
components of the intervention have major implications for the design and cost of
prevention trials and the ultimate interpretation of the results as discussed in the next
sections.
The control or comparison intervention generates similar challenges for designing
and implementing a trial. If usual care is the comparison such as the Hypertension
Detection Follow-up Program in the 1970s (1979), or Health Insurance Plan of
New York mammography trial (Shapiro et al. 1985), the control arm may seek out
the intervention and dilute the comparison. Similar issues arose in the Women’s
Health Initiative trial discussed below.
For lifestyle intervention to change diet, physical activity, or other aspects of our
lifestyle such as transportation and commuting, many issues arise related to sustain-
ability of the intervention and its associated lifestyle changes, and use of approaches
to document adherence.
Adherence to interventions has not been high in long-term primary prevention
trials. The Tamoxifen breast cancer prevention trial (P1) was designed allowing 10%
of women/year to discontinue Tamoxifen therapy, though the observed non-
compliance was lower 23.7% of women randomized to Tamoxifen stopped their
therapy during the trial vs 19.7% of the placebo group (Fisher et al. 1998). In the
Women’s Health Initiative evaluation of menopausal hormone therapy, drop out was
42% for estrogen plus progestin and 38% for placebo. This exceeded the design
projections. Of note, women in the placebo group initiated hormone use through
their own clinical providers (10.7% by the sixth year) (Rossouw et al. 2002). Similar
adherence issues apply for diet interventions. In the low-fat diet intervention in the
Women’s’ Health Initiative a total of 48,835 postmenopausal women were random-
ized to the dietary intervention (40%) or the comparison group (60%). The
1266 S. Jiang and G. A. Colditz
intervention promoted dietary change with a goal of reducing intake of total fat to
20% of energy. Concomitant with this the participants would increase consumption
of vegetables and fruit to at least five servings daily, and also increase their intake of
grains to six servings daily. Estimated adherence in the intervention group was 57%
at year 3, 31% at year 6, and 19% at year 9, substantially lower than the adherence in
the comparison group (Prentice et al. 2006). Given the challenges of sustained
behavior change in otherwise healthy trial participants, adaptations of technology
such as text messaging and more real time feedback have been studied as adjuncts to
motivating and sustaining lifestyle changes (Wolin et al. 2015). There is much
research ongoing to identify the most effective strategies to motivate and sustain
participation and adherence for different populations based on gender, age, and race/
ethnicity. As technology continues to evolve, and access increases, additional
insights should improve the design and interpretation of prevention trials.
One strategy that has been used to improved adherence in trial participants is the
active run-in phase before randomization. For example, in the physicians health
study, a randomized trial of aspirin and beta carotene to prevent heart disease and
cancer, and active run in facilitated identification of those with an adverse tolerance
of every other day aspirin.
Beyond examples such as these, and the continuing research to adapt and improve
approaches to enroll and sustain adherence to interventions over prolonged time
periods, adherence in prevention trials has major implications for design, analysis
and interpretation of trial results as discussed below.
Interpreting null results in prevention trials begs the question of whether the inter-
vention was delivered at an appropriate time in the disease process, or whether dose
and duration of the intervention were chosen correctly. First, we consider the timing
of the intervention.
The null RCTs of fiber and fruit and vegetables for prevention of polyp recurrence
amply illustrate Zelen’s concerns about the timing of the preventive intervention in the
disease process. Randomized trials of fiber and fruit and vegetables in the prevention
of colon polyp recurrence have not shown any benefit from increased intake (Alberts
et al. 2000; Schatzkin et al. 2000). Furthermore, in prevention trials addressing
recurrence of polyps, the extent of DNA damage accumulated across the colonic
mucosa at the time the eligibility polyp is detected certainly is not limited to only the
removed polyp. Thus we must ask of RCTs, at what stage in the disease process may
fiber play a role in protecting against colon cancer? Constraints of design in RCTs
usually limit to a narrow time point and defined dose of exposure (and specific
duration), which contrast with the richness of epidemiologic studies that can address
exposure over the life course and relate such exposure to disease risk.
Other nutritional agents have also been tested in chemoprevention trials in the
developed world and in China (Greenwald et al. 2007). Based on evidence
documenting that people in Linxian, China, had low intakes of several nutrients, a
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1267
The Dose
Often investigators move from treatment trials showing efficacy for an agent on
disease outcomes to then apply the agent for prevention. This sequence has been
followed in examples such as aspirin and CHD; tamoxifen and breast cancer preven-
tion, to name a few. In both heart disease and breast cancer prevention lower doses
have been chosen for prevention in part to avoid potential adverse events that
accumulate in the healthy population taking a drug to prevent future disease onset.
In breast cancer prevention the dose of Tamoxifen has been reduced to minimize
menopausal symptoms and now shows significant benefits with reduction in breast
cancer events (DeCensi et al. 2019) and also in breast density a marker of breast
cancer risk (Eriksson et al. 2021). If we have more markers of response, we might
shorten the time frame from development of these trials to endpoint ascertainment.
Prevention trials typically are large, expensive, and of long duration, because we are
interrupting a slow disease process – e.g., chronic disease.
The initial Tamoxifen P1 breast cancer prevention trial screened 98,018 women
identifying 57,641 risk eligible women and randomized 13,388 participants to
determine the worth of Tamoxifen in preventing breast cancer in women with
5-year risk above 1.66%. Cumulative incidence through 69 months was 43.4/1000
1268 S. Jiang and G. A. Colditz
women in placebo group and 22.0/1000 in the Tamoxifen group (total 175 invasive
cases). This trail cost $64 million (without costs for participant enrolment, follow-up
visits, or drug/placebo). Subsequent trials compared Raloxifene and Tamoxifen
(Martino et al. 2004), costing $134 million, and investigators secured drug and
$30 million from Novartis for the STELLAR trial (study to evaluate Letrozole and
Raloxifene but NCI withdrew support at the level of $55 million (The Lancet 2007;
Parker-Pope 2007).
The need to balance benefits against adverse effects from interventions in
otherwise healthy populations places emphasis on determining the lowest possible
does to achieve benefit and reduce risk of adverse side effects. An earlier explo-
ration of response at lower doses of possible preventive agents may speed the move
to large scale prevention or phase 3 trials. Lower dose reduces risk of adverse
events in many settings, but a sufficient framework for evaluation of response by
dose, including biomarkers or risk profiles would speed the path to efficient
prevention trials. Promising options in precision-based approaches include pros-
taglandin pathways, BRAF and HLA class 1 antigen expression, among others
(Jaffee et al. 2017).
Drawing on examples from the breast cancer prevention trials and the China/Linxian
trial – we see that to address the persistence of a prevention benefit after the cessation
of the intervention requires planned additional follow-up beyond the primary hypoth-
esis of the trial. If clearly defined as secondary hypothesis and analyses, then continued
follow-up of trial participants can answer key questions of duration for the trial
intervention, and further inform evaluation of risks and benefits for prevention.
The additional insight on prevention gained from the precise knowledge of
exposure recorded in the randomized trial includes the added understanding of the
disease process after cessation of a precisely measured intervention. Continued
follow-up of trial participants has shown the durability of the effect of a prevention
agent. In the Linxian trial, factor D, which included selenium, vitamin E, and beta-
carotene, statistically significantly reduced total mortality, total cancer mortality, and
mortality from gastric cancer (Blot et al. 1993). An important question remained,
however: whether the preventive effects of factor D would last beyond the trial
period. The results of the continued follow-up showed that hazard ratios (HRs), as
indicated by moving HR curves, remained less than 1.0 for each of these end points
for most of the follow-up period; 10 years after completion of the trial, the group that
received factor D still showed a 5% reduction in total mortality and an 11% reduction
in gastric cancer mortality (Qiao et al. 2009).
Similar insight on the duration of protection has been provided from continued
follow-up of three tamoxifen trials, which showed benefit after the conclusion of
active therapy (Fisher et al. 2005). The calcium polyp prevention trial also reported
that the protection observed during the trial persisted for up to 5 years after
supplementation ended and may, in fact, have been stronger after, rather than during,
active intervention (Grau et al. 2007). With the exception of smoking cessation,
cessation of exposure to occupational carcinogens, and termination of drug use,
lifestyle factors (diet, energy balance, physical activity, sleep pattern or sun expo-
sure) rarely have a clearly demarcated cessation, thus requiring observational studies
to provide insight on the durability of effects and lag from exposure to disease. For
pharmacologic interventions, on the other hand, long term follow-up is essential to
fully determine risks and benefits (Cuzick 2010).
community care (usual care) showed 5-year mortality from all causes was signifi-
cantly lower in the stepped care treatment arm compared to community care (1979).
Evolving trial design and scientific agreement on more proximate endpoints reflects
the evolution of understanding of the underlying disease processes and the priority
for interventions that show benefits exceeding harms.
Debate regarding endpoints has included the focus on mortality reduction vs a
reduction in incidence of disease. For example, the UK Doctors Study were designed
to test whether aspirin 500 mg daily reduced incidence and mortality from stroke,
myocardial infarction, or other vascular condition (Peto et al. 1988). The US
Physicians Health Study evaluated second daily aspirin (325 mg) vs placebo ran-
domizing 22,071 participants and following them for an average of 57 months
(Steering Committee of the Physicians’ Health Study Research 1988). Incident
myocardial infarction was significantly reduced but mortality was equivalent in
each arm (44 cardiovascular deaths). Subsequent reporting from the Data Monitor-
ing Board demonstrated the futility of continuing the trial for a mortality benefit
(Cairns et al. 1991). They present data on the substantially lower cardiovascular
mortality than expected from age comparable population rates, consistent with
baseline prevalence of current smoking at 12% (Glynn et al. 1994). Treatment of
nonfatal myocardial infarction further complicated interpretation as demonstrated by
Cook et al. (2002). While disease incidence is the primary endpoint for most
prevention trials, design features and clinical diagnosis and treatment must be
carefully monitored to avoid inducing bias in endpoint ascertainment.
The US FDA defines endpoints for drugs and biologics to be assessed as safe and
effective. They consider clinical outcomes and surrogate endpoints. Surrogate end-
points are used when clinical outcomes might take a long time to study (think
prevention trials of stroke, or cervical cancer, for example). The FDA now publishes
a list of acceptable surrogate endpoints for both adult and childhood diseases
(US Food and Drug Administration 2021). These are typically very strict surrogacy
criteria to avoid apparent benefit for reduced clinical disease incidence when none is
present. Across prevention trials the importance of outcome choice and rigor of
confirmation applies in similar manner to the more general issues (see ▶ Chap. 47,
“Ascertainment and Classification of Outcomes”).
Zelen considered the challenges of primary prevention trials in the 1980s and
addressed both compliance and models of carcinogenesis as major impediments to
the use of RCTs to evaluate cancer prevention strategies (Zelen 1988). It is important
to contrast these issues in treatment trials and prevention trials. In treatment trials, we
typically take recently diagnosed patients and offer them, often in a life-threatening
situation, the option to participate in a trial of a new therapy compared with standard
therapy or placebo. Compliance or adherence to therapy is usually very high among
these highly motivated patients and outcomes are generally in a short to mid-term
time frame. In contrast, prevention trials recruit large numbers of healthy
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1271
participants, offer them a therapy, and then follow them over many years, since the
chronic diseases being prevented are relatively rare. With substantial nonadherence –
often in the range of 20–40% over the duration of the trial – an intention-to-treat
analysis is no longer unbiased.
Issues in analysis using the a priori intention to treat plan (ITT) (see ▶ Chap. 82,
“Intention to Treat and Alternative Approaches”) and detailed approaches to model-
ing adherence over time in the prevention trial setting calls for rigorous details in the
study protocol. Additional challenges that recur in the prevention trial setting include
drop out and loss to follow-up that may be nondifferential, particularly in settings
such as weight loss trials (Ware 2003). When endpoints such as weight loss or
quality of life may reflect both engagement with the trial and adherence to the
intervention, maximizing strategies to obtain endpoint data and retain participants
in the study are fundamental to integrity of the trial results.
Many of the design issues, population, intervention and control arm, adherence
and cost constraints come together to be balanced in the design of the trial. The
protocol for most trial in the last 10 years have been posted online when the primary
trials results are published. For older prevention trials, access to a full protocol,
sample size considerations and so forth may be harder to locate. The Women’s
Health Initiative published their protocol (Writing Group 1998), and the Diabetes
Prevention Program web site at the NIH (NIDDK) provides access to the study
protocol with extensive details of a less complicate but still three arm design. The
principle objective of the trial was to prevent or delay development of Non-Insulin
Dependent Diabetes Mellitus (Type 2 diabetes) in persons at high risk with impaired
glucose tolerance (Knowler et al. 2002; The Diabetes Prevention Program Research
Group 1999). The protocol is available http://www.bsc.gwu.edu/dpp.
For prevention trials the challenge is to harness these resources to better stratify or
classify underlying disease risk. An increasing array of technologies allows
non-invasive imaging with increasing precision. Imaging is spatially defined, adapt-
able to a variety of instruments, minimally invasive, and sensitive to capturing
detailed information, and it supports the use of contrast agents. For primary preven-
tion of cancer – including prevention trials – imaging provides information on organ
health, such as sun damage to skin, liver fat or fibrosis, and breast density. For
secondary prevention, imaging identifies early disease in high-risk populations
through such screenings as mammography, colonoscopy, colposcopy, lung com-
puted tomography (CT), dermoscopy, and in prostate cancer, where better stratifi-
cation of patients who may be able to forego biopsy if MRI shows evidence of
indolent disease. For tertiary prevention, imaging is used to monitor a primary tumor
or metastasis. Advanced imaging techniques enable digital pathomics analyses of
cell shape, nucleus texture, stroma patterns, and tissue architecture arrangement.
Much of this is coupled with AI and ML to speed discovery and translation of
applications. The ultimate goal is often delivery of results at point of care, with
immediate decision-making and action. Importantly, point of care can increasingly
be used in under-resourced settings to potentially bridge access gaps and reduce cancer
health disparities. AI/ML methods are good if the data set is sufficiently large, often
requiring huge data sets for training in order for them to perform optimally. Bringing
these technologies to point of care for evaluation of patient eligibility for prevention
trials is rapidly emerging area of study with much potential to increase efficiency of
prevention trials.
Interfaces with data science and machine learning in -omics and other applica-
tions beyond imaging are rapidly expanding. Opportunities for application in preci-
sion prevention include development of conventional analysis as well as AI/ML to
handle disparate data types from imaging, omics, demographic, lifestyle, environ-
mental exposure and generate actionable information.
Multidimensional data typically combines several lines of evidence, such as
whole-genome sequencing, gene expression, copy number variation, and methyla-
tion, to produce plots that can predict patient outcomes. These multidimensional data
can also vary over time (e.g., time-varying factors, markers, and images). The
approach with high dimensional baseline covariates is being used in the ongoing
NCI Precancer Atlas (PCA) and other advances in applications require novel
analytic strategies and methods to verify the robust AI and ML approaches. Bringing
these approaches to risk classification will transform eligibility assessment for
prevention trials with precision approaches in the coming years.
There is great promise in the integration of multidimensional data into cancer risk
prediction. Risk stratification algorithms will be required. This work will build on
the record of methods development and application in cancer prevention for risk
models (both classic statistical models and Bayesian approaches) (Steyerberg 2009).
Strategies to bring multidimensional data to point of care for risk stratification and
precision prevention decision making will need integrated studies of communication
of these approaches and their interpretation (Klein and Stefanek 2007). At the same
time, coverage of populations regardless of socioeconomic status and race/ethnicity
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1273
Prevention trials may offer a range of stress-tests for the design and interpretation of
randomized trials. Not only are these often longer in duration as they aim to reduce the
incidence or onset of disease, but the challenges of volunteers or participants willing to
enroll not reflecting the distribution of risk factors on the broader population that may
have motivated the scientific questions being addressed add to the challenges of
interpreting and applying results. Sommer reviews some case studies in the chapter
in Vitamin A deficiency (see ▶ Chap. 112, “Trials Can Inform or Misinform: “The
Story of Vitamin A Deficiency and Childhood Mortality””) and many have written
critiques on other prevention trials when results do not “hold up” as expected from the
motivation for the trial (Martinez et al. 2008; Tanvetyanon and Bepler 2008).
Recent experience with vaccines against COVID-19 demonstrate increasing
public focus on trial design, protocol access, and almost real time reporting of the
race/ethnic and age composition of participants to hold trialist accountable for
enrolling study populations reflecting the at risk population. Despite these advances
in the face of pandemic COVDI-19, there remains much room for improvement in
recruitment of broader and more diverse populations of participants for prevention
trials in general, and the application of advancing methods in design of trials to bring
timely results for prevention of chronic diseases.
Conclusion
Prevention trials allow the investigator to evaluate the magnitude of benefit for a
preventive intervention in the context of the natural history of disease develop.
Through randomization prevention trials avoid self-selection to new uses of thera-
pies or potentially preventive lifestyle patterns that can be confounded in observa-
tional settings by socioeconomic status, education, and access to prevention and
1274 S. Jiang and G. A. Colditz
Key Facts
• Prevention trials offer results that remove self-selection bias to evaluation pre-
vention approaches for chronic diseases.
• Choosing population for inclusion in the trial balances level of risk, duration of
trial needed for sufficient endpoints to test the intervention, and the generaliz-
ability of the findings for prevention.
• Long durational may exacerbate challenges for adherence by otherwise healthy
populations.
• Extended follow-up beyond the planned trail intervention may add important
details trade-offs on risk and benefits.
Cross-References
References
Alberts DS, Martinez ME, Roe DJ, Guillen-Rodriguez JM, Marshall JR, van Leeuwen JB, Reid
ME, Ritenbaugh C, Vargas PA, Bhattacharyya AB, Earnest DL, Sampliner RE (2000) Lack of
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1275
Kittles R, Le QT, Lippman SM, Mankoff D, Mardis ER, Mayer DK, McMasters K, Meropol NJ,
Mitchell B, Naredi P, Ornish D, Pawlik TM, Peppercorn J, Pomper MG, Raghavan D, Ritchie C,
Schwarz SW, Sullivan R, Wahl R, Wolchok JD, Wong SL, Yung A (2017) Future cancer
research priorities in the USA: a Lancet Oncology Commission. Lancet Oncol 18(11):e653–
e706. https://doi.org/10.1016/S1470-2045(17)30698-8
Klein WM, Stefanek ME (2007) Cancer risk elicitation and communication: lessons from the
psychology of risk perception. CA Cancer J Clin 57(3):147–167. https://doi.org/10.3322/
canjclin.57.3.147
Knowler WC, Barrett-Connor E, Fowler SE, Hamman RF, Lachin JM, Walker EA, Nathan DM,
Diabetes Prevention Program Research Group (2002) Reduction in the incidence of type
2 diabetes with lifestyle intervention or metformin. N Engl J Med 346(6):393–403. https://
doi.org/10.1056/NEJMoa012512
Limburg PJ, Wei W, Ahnen DJ, Qiao Y, Hawk ET, Wang G, Giffen CA, Wang G, Roth MJ, Lu N,
Korn EL, Ma Y, Caldwell KL, Dong Z, Taylor PR, Dawsey SM (2005) Randomized, placebo-
controlled, esophageal squamous cell cancer chemoprevention trial of selenomethionine and
celecoxib. Gastroenterology 129(3):863–873
Martinez ME, Marshall JR, Giovannucci E (2008) Diet and cancer prevention: the roles of
observation and experimentation. Nat Rev Cancer 8(9):694–703
Martino S, Cauley JA, Barrett-Connor E, Powles TJ, Mershon J, Disch D, Secrest RJ, Cummings
SR (2004) Continuing outcomes relevant to Evista: breast cancer incidence in postmenopausal
osteoporotic women in a randomized trial of raloxifene. J Natl Cancer Inst 96(23):1751–1761
Matheny M, Israni S, Ahmed M, Whicher D (2019) Artificial intelligence in health care: the hope,
the hype, the promise, the peril, NAM special publication. National Academy of Medicine,
Washington, DC
Moons KG, Kengne AP, Grobbee DE, Royston P, Vergouwe Y, Altman DG, Woodward M (2012a)
Risk prediction models: II. External validation, model updating, and impact assessment. Heart
98(9):691–698. https://doi.org/10.1136/heartjnl-2011-301247
Moons KG, Kengne AP, Woodward M, Royston P, Vergouwe Y, Altman DG, Grobbee DE (2012b)
Risk prediction models: I. Development, internal validation, and assessing the incremental value
of a new (bio)marker. Heart 98(9):683–690. https://doi.org/10.1136/heartjnl-2011-301246
Parker-Pope T (2007) Do pills have a place in cancer prevention? Wall Street J 2007:D1
Peto R, Gray R, Collins R, Wheatley K, Hennekens C, Jamrozik K, Warlow C, Hafner B,
Thompson E, Norton S et al (1988) Randomised trial of prophylactic daily aspirin in British
male doctors. Br Med J 296(6618):313–316. https://doi.org/10.1136/bmj.296.6618.313
Prentice RL, Caan B, Chlebowski RT, Patterson R, Kuller LH, Ockene JK, Margolis KL, Limacher
MC, Manson JE, Parker LM, Paskett E, Phillips L, Robbins J, Rossouw JE, Sarto GE, Shikany
JM, Stefanick ML, Thomson CA, Van Horn L, Vitolins MZ, Wactawski-Wende J, Wallace RB,
Wassertheil-Smoller S, Whitlock E, Yano K, Adams-Campbell L, Anderson GL, Assaf AR,
Beresford SA, Black HR, Brunner RL, Brzyski RG, Ford L, Gass M, Hays J, Heber D, Heiss G,
Hendrix SL, Hsia J, Hubbell FA, Jackson RD, Johnson KC, Kotchen JM, LaCroix AZ, Lane DS,
Langer RD, Lasser NL, Henderson MM (2006) Low-fat dietary pattern and risk of invasive
breast cancer: the Women’s Health Initiative Randomized Controlled Dietary Modification
Trial. JAMA 295(6):629–642
Qiao YL, Dawsey SM, Kamangar F, Fan JH, Abnet CC, Sun XD, Johnson LL, Gail MH, Dong ZW,
Yu B, Mark SD, Taylor PR (2009) Total and cancer mortality after supplementation with
vitamins and minerals: follow-up of the Linxian General Population Nutrition Intervention
Trial. J Natl Cancer Inst 101(7):507–518
Rossouw JE, Anderson GL, Prentice RL, LaCroix AZ, Kooperberg C, Stefanick ML, Jackson RD,
Beresford SA, Howard BV, Johnson KC, Kotchen JM, Ockene J (2002) Risks and benefits of
estrogen plus progestin in healthy postmenopausal women: principal results from the Women’s
Health Initiative randomized controlled trial. JAMA 288(3):321–333
Sargent DJ, Conley BA, Allegra C, Collette L (2005) Clinical trial designs for predictive marker
validation in cancer treatment trials. J Clin Oncol 23(9):2020–2027
Schatzkin A, Lanza E, Corle D, Lance P, Iber F, Caan B, Shike M, Weissfeld J, Burt R, Cooper MR,
Kikendall JW, Cahill J (2000) Lack of effect of a low-fat, high-fiber diet on the recurrence of
67 Prevention Trials: Challenges in Design, Analysis, and Interpretation. . . 1277
colorectal adenomas. Polyp Prevention Trial Study Group. N Engl J Med 342(16):1149–1155.
https://doi.org/10.1056/NEJM200004203421601
Shapiro S, Venet W, Strax P, Venet L, Roeser R (1985) Selection, follow-up, and analysis in the
Health Insurance Plan Study: a randomized trial with breast cancer screening. Natl Cancer Inst
Monogr 67:65–74
Steering Committee of the Physicians’ Health Study Research Group (1988) Preliminary report:
findings from the aspirin component of the ongoing Physicians’ Health Study. N Engl J Med
318(4):262–264. https://doi.org/10.1056/NEJM198801283180431
Steyerberg EW (2009) Clinical prediction models. A practical approach to development, validation,
and updating, Statistics for biology and health. Springer, New York. https://doi.org/10.1007/
978-0-387-77244-8
Stoll CRT, Izadi S, Fowler S, Philpott-Streiff S, Green P, Suls J, Winter AC, Colditz GA (2019)
Multimorbidity in randomized controlled trials of behavioral interventions: a systematic review.
Health Psychol 38(9):831–839. https://doi.org/10.1037/hea0000726
Tanvetyanon T, Bepler G (2008) Beta-carotene in multivitamins and the possible risk of lung cancer
among smokers versus former smokers: a meta-analysis and evaluation of national brands.
Cancer 113(1):150–157. https://doi.org/10.1002/cncr.23527
The Diabetes Prevention Program Research Group (1999) The Diabetes Prevention Program.
Design and methods for a clinical trial in the prevention of type 2 diabetes. Diabetes Care
22(4):623–634. https://doi.org/10.2337/diacare.22.4.623
The Lancet (2007) NCI and the STELLAR trial. Lancet 369(9580):2134. https://doi.org/10.1016/
S0140-6736(07)60987-8
US Food and Drug Administration (2021) Table of surrogate endpoints that were the basis of drug
approval or licensure. https://www.fda.gov/drugs/development-resources/table-surrogate-end
points-were-basis-drug-approval-or-licensure
Ware JH (2003) Interpreting incomplete data in studies of diet and weight loss. N Engl J Med
348(21):2136–2137. https://doi.org/10.1056/NEJMe030054
Ware JH, Hamel MB (2011) Pragmatic trials – guides to better patient care? N Engl J Med 364(18):
1685–1687. https://doi.org/10.1056/NEJMp1103502
Warner ET, Glasgow RE, Emmons KM, Bennett GG, Askew S, Rosner B, Colditz GA (2013)
Recruitment and retention of participants in a pragmatic randomized intervention trial at three
community health clinics: results and lessons learned. BMC Public Health 13:192. https://doi.
org/10.1186/1471-2458-13-192
Wolin KY, Steinberg DM, Lane IB, Askew S, Greaney ML, Colditz GA, Bennett GG
(2015) Engagement with eHealth self-monitoring in a primary care-based weight man-
agement intervention. PLoS One 10(10):e0140455. https://doi.org/10.1371/journal.pone.
0140455
Writing Group (1979) Five-year findings of the hypertension detection and follow-up program.
I. Reduction in mortality of persons with high blood pressure, including mild hypertension.
Hypertension Detection and Follow-up Program Cooperative Group. JAMA 242(23):
2562–2571
Writing Group (1998) Design of the Women’s Health Initiative clinical trial and observational
study. The Women’s Health Initiative Study Group. Control Clin Trials 19(1):61–109. https://
doi.org/10.1016/s0197-2456(97)00078-0
Yach D, Hawkes C, Gould CL, Hofman KJ (2004) The global burden of chronic diseases:
overcoming impediments to prevention and control. JAMA 291(21):2616–2622. https://doi.
org/10.1001/jama.291.21.2616
Yusuf S, Joseph P, Dans A, Gao P, Teo K, Xavier D, Lopez-Jaramillo P, Yusoff K, Santoso A,
Gamra H, Talukder S, Christou C, Girish P, Yeates K, Xavier F, Dagenais G, Rocha C,
McCready T, Tyrwhitt J, Bosch J, Pais P, International Polycap Study 3 Investigators (2021)
Polypill with or without aspirin in persons without cardiovascular disease. N Engl J Med 384(3):
216–228. https://doi.org/10.1056/NEJMoa2028220
Zelen M (1988) Are primary cancer prevention trials feasible? J Natl Cancer Inst 80:1442–1444
N-of-1 Randomized Trials
68
Reza D. Mirza, Sunita Vohra, Richard Kravitz, and Gordon H. Guyatt
Contents
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1280
History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1281
Introduction: Why Conduct an N-of-1 RCTs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1281
Limitations of Informal Trials of Therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1282
How N-of-1 RCTs Address the Limitations of Informal Trials of Therapy . . . . . . . . . . . . . . . . . . . 1282
Five Reasons for Conducting N-of-1 RCTs to Improve Patient Care . . . . . . . . . . . . . . . . . . . . . . . . . 1282
N-of-1 RCTs Addressing Treatment Effects in a Group of Patients . . . . . . . . . . . . . . . . . . . . . . . . . . 1283
Determining Appropriateness for an N-of-1 RCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1284
Designing an N-of-1 RCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285
Choosing an Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285
Trial Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1286
Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287
Collaboration with Pharmacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287
Advanced Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287
Interpreting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287
Visual Inspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1288
Nonparametric Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1288
Wilcoxon Signed Rank Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1288
Parametric Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1289
R. D. Mirza
Department of Medicine, McMaster University, Hamilton, ON, Canada
e-mail: [email protected]
S. Vohra
University of Alberta, Edmonton, AB, Canada
e-mail: [email protected]
R. Kravitz
University of California Davis, Davis, CA, USA
e-mail: [email protected]
G. H. Guyatt (*)
McMaster University, Hamilton, ON, Canada
e-mail: [email protected]
Abstract
Single-subject trials have a rich history in the behavioral sciences, but a much
more limited history in clinical medicine. This chapter deals with a particular
single-subject design, the N-of-1 randomized control trial (RCT). N-of-1
RCTs are single-patient multiple crossover studies of an intervention and
usually one comparator. Typically, patients undergo pairs of treatment periods;
random allocation determines the order of intervention and comparator arms
within each pair and patients and clinicians are ideally blind to allocation.
Patients and clinicians repeat pairs of treatment periods as necessary to
achieve a convincing result. In the medical sciences, N-of-1 RCTs have
seen limited use, in part due to lack of familiarity and feasibility concerns
that arise in day-to-day clinical practice. Investigators may carry out a number
of N-of-1 RCTs of the same intervention and comparator as part of a formal
research study, aggregating across N-of-1 RCTs to develop population esti-
mates. N-of-1 RCTs have demonstrated their utility in clarifying whether a
clinical intervention is effective or not. Although N-of-1 trials have the
potential for improving patient outcomes, the few small randomized trials
comparing N-of-1 to conventional care have not demonstrated important
benefits.
Keywords
N-of-1 · Single-patient trial · Randomized controlled trial · Crossover trial ·
Personalized medicine
Definition
This chapter deals with a particular type of single-participant experiment, the N-of-1
randomized control trial (RCT). N-of-1 RCTs are prospective, single-patient trials
with repeated pairs of intervention and comparator periods in which the order is
randomized and patients and clinicians are ideally blinded with respect to allocation.
We will describe the history of N-of-1 RCTs, as well as the indications, design,
interpretation, reporting, and associated ethical issues.
68 N-of-1 Randomized Trials 1281
History
N-of-1 randomized control trials (RCTs) can be broken down into two major
categories depending on the underlying purpose. In one, the purpose is to improve
the care of individual patients by carrying out rigorous trials that leave patients and
clinicians confident that a particular treatment is, or is not, beneficial or harmful. By
ensuring applicability to the individual, N-of-1 RCTs represent the highest quality of
evidence.
The second reason for conducting N-of-1 RCTs is to determine the effect of an
intervention in a population. Conducting a series of N-of-1 RCTs allows investiga-
tors to provide an estimate of the proportion of patients who achieve an important
benefit, or who suffer troubling adverse effects, and thus establish the extent of
heterogeneity of response (Stunnenberg et al. 2018). Many patients and clinicians
considering the impact of a treatment are likely to find such a result more informative
than, for example, a mean effect.
1282 R. D. Mirza et al.
N-of-1 RCTs involve protecting against risk of bias that bedevils informal trials of
therapy. Choosing chronic, stable diseases attenuates the risk of conflating treatment
benefit and natural history. Blinding patients and clinicians to allocation to treatment
versus comparator minimizes biases related to expectation and placebo effects.
Multiple crossover periods control the risk of the misleading impact of transient
third variables, as well as effectively addressing natural history effects (i.e., it is very
unlikely that natural history will correspond closely to the institution and withdrawal
of a beneficial treatment).
from existing parallel group RCTs. A particular strength of N-of-1 RCTs is that they
can address the question of whether benefits extend to such individuals.
Third, some patients have symptoms that lack evidence-based management
options, or are refractory to standard medical management. Determined clinicians
may be tempted to trial off-label interventions to alleviate their patient’s suffering. In
these cases, an N-of-1 RCT allows for objective assessment of untested therapeutic
strategies.
Fourth, patients using a therapy with anticipated benefits may be experiencing
troubling symptoms for which the treatment they are using may, or may not, be
responsible. N-of-1 RCTs can provide definitive evidence confirming the culpability,
or exoneration, of the particular treatment (Joy et al. 2014).
Fifth, sometimes patients remain on a treatment for extended periods and it is
unclear whether there is any ongoing benefit. Given the rise of polypharmacy and the
increased recognition of its risks, the importance of reevaluating medications is
increasingly clear. RCTs often provided only limited data on the long-term efficacy
of a treatment. N-of-1 RCTs can clarify whether a medication is providing ongoing
benefit, or is not. A good use case for this is chronic PPI therapy in an asymptomatic
patient with a history of gastroesophageal reflux disease.
Soon after their introduction into medicine, proponents suggested N-of-1 RCTs may
hold promise as a tool for efficient early drug development. The proposal addressed
three major questions faced by drug developers before engaging in large, costly
parallel group RCTs. First, does the drug in question show sufficient promise to
justify drug development? Second, what patient population will be most responsive
to the drug? Third, what is the optimal dose to maximize benefit and minimize
adverse effects?
In drug development, these questions are managed by using a combination of
small efficacy studies in conjunction with small studies using nonrepresentative
healthy volunteers examining safety, tolerance, pharmacology, and drug disposition.
The efficacy studies are often unblinded and uncontrolled, instead using historical
reference groups. The data from these studies are of limited value due to bias and
limited power. The problem manifests when trying to use the data during the design
of the first large parallel group RCT. Investigators are forced to gamble on the most
efficacious dose (or doses if they opt for multiple treatment arms) and which
population is most likely to benefit (Guyatt et al. 1990b).
N-of-1 RCTs allow for methodologically robust small-scale studies that can
address whether a drug shows promise (Phase 3), which patient populations are
most responsive, and which doses are optimal (Phase 1). These principles are
demonstrated in an early N-of-1 RCT examining the role of amitriptyline in fibro-
myalgia (Guyatt et al. 1988). Low-dose amitriptyline is currently a first-line agent for
the treatment of fibromyalgia, but at the time of this N-of-1 RCT series – reported in
1988 – there was only one parallel group RCT suggesting benefit.
1284 R. D. Mirza et al.
For a patient to be deemed appropriate for an N-of-1 RCT, the clinical circumstances
must meet particular requirements. N-of-1 RCTs are useful when uncertainty exists
regarding treatment effect (either benefit or harm). Earlier in this chapter, we
provided examples of circumstances in which such uncertainty is likely to exist.
The N of RCT requires that specific clinical circumstances be met.
1. The outcome of interest (typically symptoms) should occur frequently, ideally daily.
Intervention period lengths must be tailored to outcome frequency. If the
outcome is infrequent, the requirement for treatment periods sufficiently long
for the outcome to be manifest may make the N-of-1 RCT excessively burden-
some for both patient and clinician. One exception is when treatments are
unusually expensive, in which case clinicians and patients may be particularly
motivated to complete the trial (Kravitz et al. 2008).
2. The condition should be chronic and stable.
Acute symptoms may represent transient conditions that are likely to resolve
spontaneously. By choosing a stable condition in terms of severity and symptoms,
clinicians reduce the random error that may make true treatment effects very
68 N-of-1 Randomized Trials 1285
difficult to detect. Stability does not preclude frequently episodic conditions, such
as a child with multiple seizures a day.
3. Interventions should have rapid onset and termination of effect.
Rapid onset ensures that intervention periods can be a reasonable length. An
N-of-1 RCT with selective serotonin reuptake inhibitors would, for instance, be
prohibitively cumbersome given the 4–6 weeks required at a minimum for
treatment effect, and several weeks for tapering to discontinuation. If each
intervention period was 8 weeks in length, and there are three crossovers periods,
the total trial length would be at least 48 weeks (sufficient time for spontaneous
resolution of the condition).
Rapid termination of action ensures that treatments effects do not influence
comparator periods, without requiring washout periods. Typically, if there are
residual effects, the treatment periods are lengthened and the patient/physician
team considers only the data after resolution of effects. For instance, if one expects
treatment effects to persist for a week, treatment periods can continue for 2 weeks,
and one can use data only from the second week. Alternatively, a washout period
can be used as a buffer between periods to prevent carryover effects.
N-of-1 RCTs represent multiple crossover trials of an intervention and one or more
comparators that, to minimize risk of bias, include randomization in terms of
sequence order. Interventions are typically drugs – but may be nonpharmacologic
or complementary and alternative medicine – compared in one of three ways: drug
versus placebo, drug versus comparator drug, or high dose versus low dose of the
same drug. For optimal rigor, clinicians and patients must be blind to allocation.
Blinding is not always possible (e.g., physical therapy). N-of-1 trials are particularly
amenable to being codesigned by patient and clinician, including with regards to
outcome measure selection. Typically outcomes are symptoms monitored daily.
Some researchers choose to use physiologic and biochemical variables as outcomes,
but the value of such surrogates for inferring patient-important benefit is limited. The
number of pairs – each pair including one period of each treatment and comparator –
continue until both patient and clinician are satisfied that superiority or equivalence
have been demonstrated. A run-in period may be employed for the same reason as in
other trials: establishing dose tolerability and compliance.
Choosing an Outcome
Physiologic outcomes were used in 35% trials, including clinical tests such as blood
pressure or laboratory tests such as erythrocyte sedimentation rate (Punja et al.
2016). A single N-of-1 RCT can address more than one outcome. Regardless of
the outcome(s) chosen, clinicians should work with patients to identify patient-
important targets prior to starting the trial.
As mentioned earlier in the chapter, the outcome measure is ideally one that can
be measured frequently (e.g., daily) to ensure that there is enough data to analyze
within an intervention period (typically 5–14 days). Physiologic parameters should
be one that patients can measure themselves at their convenience, such as blood
pressure or blood sugar concentrations. Automated tracking using cell phones and
other monitoring devices are likely to prove increasingly useful for outcome mon-
itoring (Ryu 2012; Kravitz et al. 2018).
Likert scales are widely used outside of N-of-1 research for their simplicity,
allowing for patient familiarity, ease in understanding, and ease in interpretation.
There is evidence to suggest that seven-point scales are more sensitive in detecting
small differences in comparison to fewer response options, and are more convenient
than visual analogue scales (Guyatt et al. 1987; Girard and Ely 2008). If using a
Likert scale to assess symptoms, clinicians should consider including items specif-
ically assessing symptom interference with daily activities. An example of how one
might phrase this is presented below.
Please indicate how much your pain interferes with your everyday activities of
daily living, such as cooking, cleaning, and getting dressed:
1. No interference at all
2. A little interference
3. Some interference
4. Moderate interference
5. Much interference
6. Severe interference
7. I am unable to do carry out these activities as a result of the interference
Trial Length
The duration of an N-of-1 RCT will depend on the number of days in each treatment
period, and the number of pairs of periods undertaken. Most often the trial addresses
a single intervention and a single comparator. Typically each treatment period will
range between 5 and 14 days (median: 10 days), the interquartile range of all
captured N-of-1 RCTs in the aforementioned systematic survey by Punja and
colleagues (Punja et al. 2016). In terms of the number of pairs of treatment periods –
one period in which the patient receives the intervention and one period with the
comparator, in Punja’s survey 75% of trials required between 2 and 5 pairs of
treatment periods (median: 3 pairs).
Based on these numbers, a typical N-of-1 RCT with a pair of treatment periods
will take 20 days (10 days for each of the two arms); with 3 such pairs, the total
duration would be 60 days.
68 N-of-1 Randomized Trials 1287
Randomization
N-of-1 RCTs at their most rigorous are blinded to protect against the bias of patient
and clinician expectations, co-interventions, and placebo effects. To blind effectively
and efficiently, physicians should and often do collaborate with a local pharmacy to
prepare treatments and comparators that are identical in appearance, taste, texture,
and smell. Pharmacists can achieve this goal by crushing the active drugs and
repackaging in capsules. Placebos can be filled with an inert substance.
Pharmacists can also play a number of other important roles in N-of-1 RCTs.
They can provide input in terms of drug half-life and thus determining whether and
how long each treatment period need be. Certainly, with increased scope of practice,
Doctors of Pharmacy in particular can design, conduct, and interpret the N-of-1 trial.
Pharmacist technicians can also help, particularly with monitoring drug compliance
by conducting pill counts and assessing whether patients are refilling their medica-
tions at the correct time.
Advanced Techniques
There are a number of options in interpreting the data that depend on the goals of
the trial, the trial design, and the data generated. Broadly this can be broken
down into statistical methods, which were used in 84% of the trials that Punja
reported; in the other 16%, clinicians and patients use visual inspection alone
(Punja et al. 2016).
1288 R. D. Mirza et al.
Visual Inspection
Using visual inspection of the data, clinicians and patients examine a graph
displaying repeated measures of the outcome of interest with specification of
intervention and control arms. The features to suggest an arm is effective
include: 1) minimal variability within periods; 2) the magnitude and direction
of difference between the arm of interest and comparator arm is consistent; and
3) the difference between the arm of interest and its comparator is large in
comparison to the variability within periods. Review of the evidence collected
after 2 or more pairs of periods can help determine whether to conduct further
pairs.
The rationale for visual inspection is that both clinician and patient can intuitively
assess the components of efficacy – direction, magnitude, and consistency of effect –
in a straightforward manner that may satisfy both and simplify decision-making. The
limitation is the subjective nature of the assessment that can lead to inconsistent and
incorrect inferences. This methodology is appropriate only for individual patient
clinical decision-making rather than using the N-of-1 methodology to make infer-
ences about treatment effects in a population.
Broadly speaking, nonparametric tests refer to those that do not assume the data is
normally distributed; this makes them a more conservative test. There are a number
of nonparametric statistical tests available. We will focus on the Wilcoxon signed
rank test, and a quantitative randomization test.
The Wilcoxon signed rank test incorporates the size of the treatment difference, but
fails to take the absolute value of difference into account. To conduct a Wilcoxon
signed rank test, the absolute difference within treatment cycles is ranked by
absolute difference from smallest to largest (i.e., independent of direction).
The sum of the ranks in favor of the treatment are compared to the sum of the
ranks in favor of the comparator. The null hypothesis would expect the sums to be
equivalent.
More sophisticated than either of two previous tests is a pure quantitative
randomization test. This approach assesses not only the direction and size in
comparing arms, but also the mean treatment difference. The probability of a
given mean treatment difference is calculated by determining the proportion of
randomizations that would lead to the given outcome over the denominator of all
possible randomizations. The null hypothesis for this test states the expected mean
treatment difference is zero.
68 N-of-1 Randomized Trials 1289
Parametric statistical tests, by contrast, assume the data are normally distributed. The
two most commonly used tests are the analysis of variance (ANOVA) and student’s
t-test, which is a special case of the ANOVA model. There are two factors that will
help guide which test to use. First, if the trial under consideration is comparing three
or more treatment arms, then the ANOVA is the preferred approach. ANOVA allows
for a single analysis (F-test) across arms which the t-test does not.
Student’s T-Test
The t-test is only appropriate for N-of-1 RCTs comparing two arms, regardless of
whether the comparator is placebo or alternative treatment. In the general case, the
student’s t-test can be either paired or unpaired, but in the case of N-of-1 RCTs each
intervention arm is paired by design, and therefore the paired t-test constitutes the
appropriate approach. To conduct the paired t-test one calculates a single value for
each treatment period. So, for instance, if the patient has completed a daily diary for
7 days, and each day has answered three questions, the score for that period will be
the mean of 21 observations. One then makes the same calculation for the paired
control period and examines the difference in means which the t-test addresses. The
degrees of freedom for the test is the number of blocks of treatment periods minus
one. The t-test is routinely used for N-of-1 RCTs, and is universally included in
statistical packages.
ANOVA
Often the student’s t-test functions as an extension of the ANOVA model and will
provide the same result. There is at least one case in which this is not true. ANOVA
may provide a different result when there is no dependency between one observation
and the next. Under such circumstances the ANOVA can use each individual
observation (in the example above, 7 instead of one observation per period).
Unfortunately, independence of observations will rarely if ever be the case. For
most illnesses, good days tend to run together, as do bad ones.
the analysis should consider the possibility of carryover of treatment effect between
periods. This is true of any crossover trial analysis.
The second method is using conventional meta-analysis techniques. (See
Chap. 8.11 to learn more about meta-analysis.) There are at least two benefits in
meta-analyzing N-of-1 RCTs. The first is to generate a more precise estimates of
treatment effects, and predictors of patient response versus nonresponse, sustained
response, and susceptibility to side effects (Lillie et al. 2011). Second, in trials where
N-of-1 methodology is compared to standard of care, meta-analysis can assess
whether the N-of-1 methodology provides benefit over traditional clinical care.
The third method is Bayesian analysis that has been adapted specifically for use in
N-of-1 RCTs (Zucker et al. 2010). What distinguishes Bayesian analysis from other
forms of aggregation is the requisite incorporation of preexisting estimates into the
analysis. Typically analyses require prespecification of population mean effect and
variance. This serves as a liability in the many cases where this information is neither
available in the literature nor easily estimated.
The previous sections have focused on the use of N-of-1 RCTs for either clinical
practice or as part of a research endeavor. If the latter, the issue of how one reports
N-of-1 RCTs for a wider audience (typically in a publication) arises.
Similar to other types of trials, reporting standards for N-of-1 RCTs maintained
by the Consolidated Standards of Reporting Trials (CONSORT) exist. To ensure
optimal reporting for N-of-1 RCTs, CONSORT has published a standardized
25-item checklist (CENT), most recently updated in 2015 (Vohra et al. 2015). The
recommendations were based on CONSORT recommendations for parallel group
RCTs, and address reporting expectations of title and abstract, specifying rationale
and objectives in the introduction, trial design, patient selection, intervention, out-
comes, sample size, randomization, allocation concealment, randomization, results,
analyses, discussion, protocol registration, and funding.
Ethics
The following is an example of one of the early N-of-1 RCTs conducted as part of
McMaster’s clinical service. A 34-year-old female with a past medical history
significant for scleroderma was referred for evaluation of treatment for persistent
weakness, in the context of possible myasthenia gravis. Two separate encounters
with specialists revealed electromyographical findings atypical for the disease, and
so the question of whether treatment with pyridostigmine would benefit remained
uncertain. This trial meets our criteria: she experienced symptoms daily, her disease
is chronic and stable, there is uncertainty about therapeutic benefit, and
pyridostigmine has rapid onset and termination of effect.
The intervention was pyridostigmine 30 mg by mouth twice daily and was
placebo controlled. Each treatment period was 7 days, and the outcome measure
was daily ratings of weakness and energy levels.
Figure 1 represents the patient’s reported data using a seven-point Likert scale,
where 7 represents the highest level of function, and 1 represents the lowest level of
function. There were four pairs of treatment periods. Unsurprisingly, the patient did
not have 100% adherence to symptom charting.
Visual inspection reveals that the treatment seems to be consistent better than the
placebo. This is particularly clear in Figs. 2 and 3, which reveal the mean symptom
score in each treatment period, and differences in each pair, respectively.
1292 R. D. Mirza et al.
A two-tailed paired t-test comparing the differences in symptom scores across the
four periods was conducted to confirm the visual inspection. The results revealed in
Fig. 4 confirm a clear benefit in treatment.
N-of-1 RCTs are unique among experimental studies in giving physicians the ability
to answer clinical questions for individual patients in a methodologically rigorous
way. Other trials – including parallel group RCTs, observational trials, and meta-
analyses – are all limited to answering questions at the population level. For this
reason, N-of-1 trials have been suggested as the pinnacle of the evidence pyramid.
Aggregation of N-of-1 RCTs by meta-analysis and Bayesian techniques allow for
treatment effect estimates at the population level. By conducting an N-of-1 RCT,
physicians are afforded the opportunity to offer optimal care to the individual
patients whom they serve.
Key Facts
• N-of-1 RCTs are single-patient multiple crossover trials that seek to answer a
clinical question and improve patient care. Multiple N-of-1 RCTs can also inform
treatment effects in a population.
1294 R. D. Mirza et al.
• N-of-1 RCTs constitute the highest quality evidence for a particular patient’s care,
because the evidence is specific to the individual patient (OCEBM Levels of
Evidence Working Group 2011).
• Varied analytic techniques can inform the interpretation of N-of-1 RCTs including
nonstatistical techniques (i.e., visual inspection) and statistical techniques includ-
ing both nonparametric and parametric tests.
Cross-References
▶ Introduction to Meta-Analysis
References
Barsky AJ, Saintfort R, Rogers MP, Borus JF (2002) Nonspecific medication side effects and the
nocebo phenomenon. JAMA 287:622–627
68 N-of-1 Randomized Trials 1295
Duan N, Kravitz RL, Schmid CH (2013) Single-patient (n-of-1) trials: a pragmatic clinical decision
methodology for patient-centered comparative effectiveness research. J Clin Epidemiol 66:S21–
S28. https://doi.org/10.1016/j.jclinepi.2013.04.006
Girard TD, Ely EW (2008) Delirium in the critically ill patient. Handb Clin Neurol 90:39–56.
https://doi.org/10.1016/S0072-9752(07)01703-4
Guyatt G, Sackett D, Taylor DW et al (1986) Determining optimal therapy — randomized trials in
individual patients. N Engl J Med 314:889–892. https://doi.org/10.1056/NEJM198604033141406
Guyatt GH, Townsend M, Berman LB, Keller JL (1987) A comparison of Likert and visual
analogue scales for measuring change in function. J Chronic Dis 40:1129–1133
Guyatt G, Sackett D, Adachi J, Roberts R, Chong J, Rosenbloom D, Keller J (1988) A clinician’s
guide for conducting randomized trials in individual patients. CMAJ: Canadian Medical
Association Journal 139(6):497–503
Guyatt GH, Keller JL, Jaeschke R, Rosenbloom D, Adachi JD, Newhouse MT (1990a) The n-of-1
randomized controlled trial: clinical usefulness: our three-year experience. Annals of Internal
Medicine 112(4):293–299
Guyatt GH, Heyting A, Jaeschke R et al (1990b) N of 1 randomized trials for investigating new
drugs. Control Clin Trials 11:88–100
Irwig L, Glasziou P, March L (1995) Ethics of n-of-1 trials. Lancet (Lond) 345:469
Joy TR, Monjed A, Zou GY et al (2014) N-of-1 (single-patient) trials for statin-related myalgia. Ann
Intern Med 160:301–310. https://doi.org/10.7326/M13-1921
Kazdin A (2011) Single-case research designs: Methods for clinical and applied settings. Second
Edition. New York, NY: Oxford University Press
Kratochwill TR (Ed) (2013) Single subject research: Strategies for evaluating change. Academic
Press
Kravitz RL, Duan N, White RH (2008) N-of-1 trials of expensive biological therapies: a third way?
Arch Intern Med 168:1030–1033. https://doi.org/10.1001/archinte.168.10.1030
Kravitz R, Duan N, Eslick I et al (2014) Design and implementation of N-of-1 trials: a user’s guide.
Agency for Healthcare Research and Quality, US Department of Health and Human Services
(2014). 540 Gaither Road. Rockville, MD 20850 www.ahrq.gov
Kravitz RL, Schmid CH, Marois M et al (2018) Effect of mobile device–supported single-patient
multi-crossover trials on treatment of chronic musculoskeletal pain. JAMA Intern Med. https://
doi.org/10.1001/jamainternmed.2018.3981
Lillie EO, Patay B, Diamant J et al (2011) The n-of-1 clinical trial: the ultimate strategy for
individualizing medicine? Per Med 8:161–173. https://doi.org/10.2217/pme.11.7
Mirza RD, Punja S, Vohra S, Guyatt G (2017) The history and development of N-of-1 trials. J R Soc
Med 110:330–340. https://doi.org/10.1177/0141076817721131
Molloy DW, Guyatt GH, Wilson DB et al (1991) Effect of tetrahydroaminoacridine on cognition,
function and behaviour in Alzheimer’s disease. CMAJ 144:29–34
Nonoyama ML, Brooks D, Guyatt GH, Goldstein RS (2007) Effect of oxygen on health quality of
life in patients with chronic obstructive pulmonary disease with transient exertional hypoxemia.
Am J Respir Crit Care Med 176:343–349. https://doi.org/10.1164/rccm.200702-308OC
OCEBM Levels of Evidence Working Group (2011) The Oxford 2011 Levels of Evidence. Oxford
Centre for Evidence-Based Medicine. https://www.cebm.net/index.aspx?o=5653
Punja S, Eslick I, Duan N, Vohra S, the DEcIDE Methods Center N-of-1 Guidance Panel (2014) An
ethical framework for N-of-1 trials: clinical care, quality improvement, or human subjects
research? In: Kravitz RL, Duan N (eds), and the DEcIDE Methods Center N-of-1 Guidance
Panel (Duan N, Eslick I, Gabler NB, Kaplan HC, Kravitz RL, Larson EB, Pace WD, Schmid
CH, Sim I, Vohra S). Design and implementation of N-of-1 trials: a user’s guide. AHRQ
Publication No. 13(14)-EHC122-EF. Agency for Healthcare Research and Quality, Rockville,
Chapter 2, pp. 13–22, January 2014. http://www.effectivehealthcare.ahrq.gov/N-1-Trials.cfm
Punja S, Bukutu C, Shamseer L et al (2016) N-of-1 trials are a tapestry of heterogeneity. J Clin
Epidemiol 76:47–56. https://doi.org/10.1016/j.jclinepi.2016.03.023
Ryu S (2012) Book review: mHealth: new horizons for health through Mobile technologies: based
on the findings of the second global survey on eHealth (global observatory for eHealth series,
volume 3). Healthc Inform Res 18:231. https://doi.org/10.4258/hir.2012.18.3.231
1296 R. D. Mirza et al.
Shamseer L, Sampson M, Bukutu C, Schmid C (2015) CONSORT extension for reporting N-of-1
trials (CENT) 2015: explanation and elaboration. BMJ 76:18–46
Stunnenberg BC, Raaphorst J, Groenewoud HM et al (2018) Effect of Mexiletine on muscle
stiffness in patients with nondystrophic Myotonia evaluated using aggregated N-of-1 trials.
JAMA 320:2344. https://doi.org/10.1001/jama.2018.18020
Vohra S, Shamseer L, Sampson M et al (2015) CONSORT extension for reporting N-of-1 trials
(CENT) 2015 statement. BMJ 350:h1738. https://doi.org/10.1136/BMJ.H1738
Zucker DR, Ruthazer R, Schmid CH (2010) Individual (N-of-1) trials can be combined to give
population comparative treatment effect estimates: methodologic considerations. J Clin
Epidemiol 63:1312–1323. https://doi.org/10.1016/j.jclinepi.2010.04.020
Noninferiority Trials
69
Patrick P. J. Phillips and David V. Glidden
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1298
Hypotheses and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1299
Motivation for NI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1300
Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1300
DISCOVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1301
STREAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1301
Defining Margin of NI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1302
The 95/95 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1302
Combination Therapies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303
Public Health Clinical Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1304
Design and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305
Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1305
How to Design a NI Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1306
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1308
Choice of Analysis Populations and Estimands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1310
Further Challenges Unique to NI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1311
Assay Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1311
Effect Preservation in Determining the NI Margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1312
Two Sides of the Same Coin: Superiority Versus NI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1313
Testing for Noninferiority and Superiority in the Same Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1314
Sensitivity of Trial Results to Arbitrary Margin and Control Arm Event Rate . . . . . . . . . . . 1314
Justification of Margin in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1315
Interim Analyses and Data and Safety Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1315
P. P. J. Phillips (*)
UCSF Center for Tuberculosis, University of California San Francisco, San Francisco, CA, USA
Department of Epidemiology and Biostatistics, University of California San Francisco, San
Francisco, CA, USA
e-mail: [email protected]
D. V. Glidden
Department of Epidemiology and Biostatistics, University of California San Francisco, San
Francisco, CA, USA
e-mail: [email protected]
Abstract
In this chapter we provide an overview of non-inferiority trials. We first introduce
two motivating examples and describe scenarios for when a non-inferiority trial is
appropriate. We next describe the procedures for defining the margin of non-
inferiority from both regulatory and public health perspectives and then provide
practical guidance for how to design a non-inferiority trial and analyze the
resulting data, paying particular attention to regulatory and other published
guidelines. We go on to discuss particular challenges unique to non-inferiority
trials including the importance of assay sensitivity, the enigma of effect preser-
vation, switching between non-inferiority and superiority, the interpretation of
results when event rate assumptions are incorrect, and the place of interim
analyses and safety monitoring. We conclude the chapter by addressing alterna-
tive methodologies and innovative perspectives on non-inferiority trials that have
been proposed in an attempt to mitigate these challenges, including Bayesian
approaches, alternative three-arm and pragmatic designs, and methods that
address different treatment durations and the averted infections ratio.
Keywords
Noninferiority · Margin of noninferiority · Assay sensitivity · Effect
preservation · Active control trial · Biocreep
Introduction
Historically, NI trials were a subset of equivalence trials which had the objective
of showing that an investigational intervention was not much worse and not much
better than a control intervention (Wellek 2010). In practice, the dual objectives of
equivalence are less relevant to randomized clinical trials of interventions to improve
human health, apart from studies to demonstrate the bioequivalence of two pharma-
ceutical agents, and this chapter therefore relates exclusively to the NI trial design.
This chapter describes aspects related to the design, conduct, analysis, and interpre-
tation of NI trials, although one could extend many of these ideas to equivalence
trials if needed.
The most common NI trial design is a two-arm trial where the internal comparator
is an active control intervention which usually reflects a standard of care treatment,
and the focus of this chapter is therefore on this two-arm design; other variations on
this design are addressed in section “Alternative Analyses and Designs and Innova-
tive Perspectives on NI Trials.”
The most common primary efficacy outcome of a clinical trial relates to the occur-
rence or nonoccurrence of an event of interest, e.g., death, failure, cure, or stable
culture conversion. We therefore consider a treatment effect of the form
θEC ¼ pE pC, where θEC is the true treatment effect, and where pE and pC are
the proportions of participants with the event on the investigational and control arms,
respectively (the former is sometimes called the experimental arm), and where the
difference might be calculated on the linear scale (for a risk difference) or the log
scale (for a risk ratio). Although this convention is used here, the discussion in this
chapter can easily be extended to NI trials with other types of primary outcomes,
such as continuous or ordinal.
The one-sided null and alternative hypotheses for a superiority and NI trial are
shown in Table 1. For simplicity, and without loss of generality, we consider a
negative θEC corresponds to a beneficial effect of the investigational intervention on
the outcome of interest (e.g., a reduction in mortality or an increase in cure) and
therefore δ > 0 ; we will use this convention throughout the chapter.
In setting the hypotheses of a NI trial alongside those of a superiority trial, the
only difference is in changing the number on the right-hand side of the equations
from 0 to δ; the hypotheses otherwise stay the same. In superiority trials, a minimum
treatment effect that has some analogy to the margin of noninferiority is used for
sample size and power calculations but not for hypothesis testing (see section “Two
Sides of the Same Coin: Superiority Versus NI”).
Table 1 A comparison of null and alternative hypotheses for superiority and NI trials
Superiority comparison NI comparison
Treatment effect measure pE pC ¼ θEC pE pC ¼ θEC
Null hypothesis H0 : θEC ¼ 0 H0 : θEC ¼ δ (δ > 0)
Alternative hypothesis H1 : θEC < 0 H1 : θEC < δ
1300 P. P. J. Phillips and D. V. Glidden
Motivation for NI
NI trials arise as an option in settings where there are one or more effective
treatments for a condition. Typically, a new product is being developed because
there is some hope that it offers superior efficacy, a better safety profile, simpler
administration, lower cost, or other advantages. This new product should be evalu-
ated in a randomized clinical trial. If there is an established treatment for the
condition, it is usually unethical to randomize trial participants to no treatment
(placebo), and an active control design must be adopted. There may be settings
where the condition under study is transient and not serious and a placebo could be
justified.
In either case, the candidate regimen needs to be evaluated in the context of
other effective regimens. The FDA guidance (Food and Drug Administration
Center for Drug Evaluation and Research (CDER) 2016) for such settings lays
out three major alternatives: a study which examines the incremental value of the
new therapy combined with established standard of care compared to standard of
care alone, a placebo-controlled trial of the new therapy among those who are not
candidates for the current standard of care, or an active controlled trial which
randomizes participants among the standard of care regimen and the candidate
intervention. When the first two options are not feasible or ethical, then the third
option, the NI trial, is used.
Because the new regimen may offer substantial advantages over the current
standard of care, it would be enough to show that the current regimen is effective.
However, it may not be superior, but we would want to avoid the situation of
introducing the new drug if it is unacceptably worse than the standard of care.
This study objective gives rise to the NI trial design. Specifically, our trial would
have two objectives: to support our claim that the new regimen is superior to the
withholding of the standard of care and that it is not meaningfully less effective than
the standard of care. A major issue here is the choice of the standard of care arm.
NI studies formalize these standards by establishing a margin of NI in a formal
statistical framework to determine when these two objectives are met.
Case Studies
Two cases studies are used to illustrate various aspects of NI trials throughout this
chapter and are introduced here.
69 Noninferiority Trials 1301
DISCOVER
Nearly 1.7 million people are infected with HIV yearly, and no vaccine is currently
available. However, there are abundant safe and potent medications which can
suppress HIV replication. In this context, the paradigm of HIV pre-exposure pro-
phylaxis (PrEP) developed. PrEP involves using anti-HIV medication to prevent
HIV acquisition in an HIV negative person. Several randomized trials showed that
daily use of emtricitabine and tenofovir disoproxil fumarate (F/TDF) was a highly
effective PrEP regimen. There is a vigorous pipeline for the development of anti-
HIV drugs and/or delivery systems (e.g., long-acting injection) as candidates
for PrEP.
The DISCOVER study (Mayer et al. 2020) was a randomized double blind active
controlled trial evaluating the efficacy of daily oral emtricitabine and tenofovir
alafenamide (F/TAF) for PrEP. The trial’s primary objective was to show that,
among adults at high risk of acquiring HIV, F/TAF was effective in preventing
incident HIV infection. With a proven safe effective and available regimen (F/TDF),
it is no longer ethical to evaluate future PrEP candidates in trials with a placebo
control. Given that F/TDF is highly effective, it was considered unlikely that F/TAF
would be superior in preventing HIV infections. Instead, the major motivation for the
adoption of F/TAF is that the reformulation should have less subclinical effects of
tenofovir on kidney and bone density. This led the investigators to adopt a NI
objective.
Participants took two pills daily: F/TAF (or matching placebo) and F/TDF
(or matching placebo). HIV infection was diagnosed in 7 and 15 participants on
the F/TAF and F/TDF arms, respectively, yielding a relative incidence of 0.47 (95%
CI: 0.19–1.15). Since the 95% confidence interval excluded the prespecified NI
margin of 1.62, NI was concluded.
STREAM
Tuberculosis kills more people than any other single pathogen. 1.2 million people
died from tuberculosis with 10 million new cases in 2019 (World Health Organiza-
tion 2020). When the bacteria develop resistance to the main drug, rifampicin, there
are few treatment options, although new drugs are in development. STREAM Stage
1 was a phase III trial conducted to evaluate a novel 9–11-month regimen for the
treatment of rifampicin-resistant TB and was the first phase III trial to specifically
evaluate any regimen for rifampicin-resistant TB (Nunn et al. 2014). A second stage
of the trial was conducted including regimens with new drugs; for the purposes of
this chapter, we refer to STREAM Stage 1 when referring to the STREAM trial. At
the time that the trial was designed, the standard of care, as recommended in WHO
2011 guidelines (World Health Organization 2011), included a cocktail of 4–7 drugs
given for at least 18 months. A series of nonrandomized interventional cohort studies
in Bangladesh (Van Deun et al. 2010) had identified this 9–11-month regimen
1302 P. P. J. Phillips and D. V. Glidden
resulted in low rates of treatment failures and relapses with adequate safety. It was
calculated that just the cost of drugs for this regimen was approximately USD $270
(Van Deun et al. 2010), only one-tenth of the cost of drugs in the
WHO-recommended regimen (Floyd et al. 2012). Given these and other benefits
to patients and the health system of reducing treatment duration by half, STREAM
was designed as a NI trial. The primary efficacy outcome was a favorable status at
132 weeks, defined by cultures negative for Mycobacterium tuberculosis at
132 weeks and at a previous occasion, with no intervening positive culture or
previous unfavorable outcome. The margin of NI was 10%, with the primary
analysis being a calculation of the absolute risk difference in proportion favorable.
There were 424 participants randomized into the trial (Nunn et al. 2019), with twice
as many allocated to the intervention regimen in order to collect more safety data on
the intervention regimen. NI was demonstrated in both coprimary modified intention
to treat (mITT) and per protocol (PP) analysis populations. WHO guidelines for the
treatment of rifampicin-resistant TB were changed to include the STREAM regimen
while the trial was ongoing based on external observational cohort data but were
subsequently changed to remove this regimen as a recommended regimen, despite
the trial results, due to concerns with the injectable agent included in the regimen
(World Health Organization 2019).
Defining Margin of NI
Translating a difference between the investigational and control arms into a state-
ment about the former’s effect against no treatment requires a working estimate
,θCP, of the control effectiveness compared to no treatment in the current trial.
The 95/95 method (Rothmann and Tsou 2003) uses a meta-analysis of studies of
the control regimen as the starting point for such an estimate – ideally randomized
placebo-controlled trials of the control. From the meta-analysis, θCP is taken as the
69 Noninferiority Trials 1303
upper bound of a two-sided 95% confidence interval (in a setting where θCP > 0
indicates a treatment benefit of the control, the lower bound would be taken). Taking
the value closer to the null than the point estimate, so-called discounting, introduces
a conservatism. For example, the DISCOVER trial needed to estimate the effective-
ness of F/TDF compared to placebo and used a meta-analysis of three placebo-
controlled trials in similar populations to derive a meta-analysis of (log) relative
hazards with upper bound of the 95% confidence interval θCP ¼ 0.96.
The M1 margin is synonymous with a test H0 : θEP ¼ 0 which translates to
H0 : θEC + θCP ¼ 0. Thus, the M1 margin is a comparison of the treatment contrast
between investigational and control against a null of θCP. In the case of DIS-
COVER, the M1 margin would be ruling out a log-relative hazard of F/TAF
compared to F/TDF > 0.96 (HR > 2.62).
The M2 margin is derived as a tighter (more conservative) margin which ensures
that some proportion of the control treatment effect (ρ : 0 < ρ < 1) is preserved by
the investigational agent. This functions as a standard of how much worse it can be
compared to the effect of the control agent. The M2 margin is then a comparison
against ρθCP. Note that this margin is closer to the null and thus requires more
evidence to refute. The 95/95 method typically choses 50% effect preservation
which corresponds to ρ ¼ 0.5 In the case of DISCOVER, the M2 margin with
50% requires ruling out a log-relative hazard of F/TAF compared to
F/TDF > 0.96*0.5 ¼ 0.48 (HR > 1.62).
An alternative approach, known as the synthesis method, derives M1/M2 margins
using the uncertainty in completed trials without applying discounting.
Sample sizes for NI trials would be smaller than 95/95 if there were no
discounting and/or they were powered on a M1 margin alone.
Discounting is motivated as a hedge against: (i) selection of a nonoptimal control
therapy, (ii) changes in background treatment, and (iii) publication bias in the meta-
analysis. These are particular concerns in mature fields where many randomized
controlled trials have been conducted and where there are many estimates of the
effectiveness and many possible control comparators. While discounting is clearly
sensible in many settings, the 95% confidence interval has been shown to be highly
conservative (Sankoh 2008; Holmgren 1999; James Hung et al. 2003). It uses a
statistical criterion to handle an unquantifiable uncertainty about the control treat-
ment effect which would have been observed if the NI trial included a putative no
treatment arm. The effect preservation criterion, used to develop the M2 margin,
ensures that the conclusion of NI will only be made if a high proportion of the control
treatment effect is retained by the investigational regimen. The value of the “effect
preservation” standards is further discussed in section “Trial Designs to Evaluate
Different Treatment Durations.”
Combination Therapies
The M1/M2 approach is further complicated when the intervention under evaluation
is not just a single agent but is a combination, as is increasingly common in many
1304 P. P. J. Phillips and D. V. Glidden
disease areas. Where the candidate intervention and standard of care regimen have
common components, the calculation of M1 and derivation of the minimum margin
of NI will be less than if there are no common components.
An example of this is seen in tuberculosis which is usually treated with a
combination regimen with two drugs (rifampicin and isoniazid) given for 6 months
supplemented by two additional drugs (pyrazinamide and ethambutol) in the first
2 months. NI trials have been conducted to evaluate regimens that have one or two
drugs replaced with novel compounds and are given for shorter durations, com-
monly 4 months instead of 6.
The objective of these trials can be reframed as an evaluation of whether the effect
of the new drug (s) has noninferior efficacy to the combined effect of the last
2 months of therapy and the effect of drugs that were replaced, where each is
added to a standard background therapy of the drugs that are common to both
combination regimens in the first 4 months of treatment.
The FDA draft guidance for developing drugs for the treatment of pulmonary
tuberculosis (Food and Drug Administration Center for Drug Evaluation and
Research (CDER) 2013) provides a worked example where the effect of the last
2 months of therapy (for rifampicin-sensitive disease) is shown to be an absolute
difference of M1 (θCP) of 8.4% (95% CI 4.8%, 12.1%) from two previous trials of
4-month regimens, providing support for a margin of NI of 4.8% using this 95/95
approach for NI trials evaluating one- or two-drug substitution trials. A comparable
approach has been used to derive margins of 6% or 6.6% in recent drug-substitution
trials (Dorman et al. 2020; Gillespie et al. 2014; Jindani et al. 2014), using the
M1-type approach, without consideration of the M2.
In contrast, the M1 of the full effect of the entire standard of care regimen is more
like an absolute difference of 50–60% given the high effectiveness of 80–90% of the
standard 6-month regimen compared to an expected 30% cure from untreated
tuberculosis (Tiemersma et al. 2011). For this reason, in trials of new regimens
with only minor or no drugs in common with the standard of care, the M1/M2
approach can be used to justify margins of up to 12%, and consequently much
smaller sample sizes, and still be described as preserving more than 75% (1–12%/
50% ¼ 0.76) of the treatment effect of the standard of care regimen (Tweed et al.
2021). This incongruity can discourage sponsors from including existing drugs in
novel combination regimens that are more readily available with an established
safety profile in favor of only new drugs that are more expensive with less data on
drug safety, often to the detriment of patients and health systems.
In some contexts, the regulatory statistical criteria are not desirable or feasible, and
the margin of NI has been set on substantive grounds. For instance, the US FDA has
defined a SARS-CoV-2 vaccine to be noninferior (Food and Drug Administration
Center for Biologics Evaluation and Research (CBER) 2020) if θEC < 0.1 where the
parameters represent the vaccine efficacy of the new vaccine relative to a control.
69 Noninferiority Trials 1305
Sample Size
Sample size formulae for NI trials are similar to those for superiority trials (see
▶ Chap. 41, “Power and Sample Size”) with a few differences. In addition to
specifying the event rate (for a time to event end point) or proportion of events
(for a binary end point) expected in the control, a key difference is that, instead of
specifying a minimum clinically important difference between arms that the trial will
be powered to detect, one must specify both the margin of NI and the expected event
rate, or proportion of events, in the intervention arm. It is usually assumed that this
event rate is the same as the control arm (namely that both arms have true compa-
rable efficacy). If there is compelling evidence to believe that the intervention arm
will have slightly better efficacy than the control, then this will result in a smaller
sample size, as was done in the STREAM trial, although investigators then run the
risk that the trial will be underpowered if this assumption is incorrect. On the other
hand, it might be prudent to assume that the intervention arm will have slightly lower
1306 P. P. J. Phillips and D. V. Glidden
efficacy than the control (although within the acceptable margin of NI), although the
disadvantage is that this will greatly inflate the sample size. Considerations for the
choice of type I error rate and power are the same for NI trials as for superiority trials.
An oft repeated myth is that NI trials are larger than superiority trials. As a broad
statement, this is incorrect – NI trials can be larger or smaller than comparable
superiority trials, depending on the sample size assumptions. It is, however, true that
the sample size of a trial designed to show superiority to placebo, via the indirect
comparison in a NI trial design, is always larger than a superiority trial comparing the
intervention directly with placebo as noted below in section “Effect Preservation in
Determining the NI Margin.”
For a trial comparing proportions, the most commonly used formulae for sample
size calculations come from Farrington and Manning (1990) (using their formula
based on “maximum likelihood” which is more accurate than the approximate
formula based on “observed values”), and this is implemented in many statistical
software packages for sample size calculations and used in the latest editions of the
fourth edition of a popular sample size formulae textbook (Machin et al. 2018),
although earlier editions used the less accurate approximate formula.
Regulatory Guidelines
The ICH efficacy guidelines (numbered E1 through E20) provide guidelines on
various aspects of the design, conduct, and reporting of clinical trials, a selection of
which specifically addresses NI trials. Of note, many of the documents were
finalized more than 20 years ago when NI trials were less common and consequently
less well described and understood.
Aside from ICH E3 “Clinical Study Reports” (finalized November 1995) which
briefly notes that the evidence in suport of assay sensitivityis important in the NI
clinical study report, ICH E9 “Statistical Principles for Clinical Trials” (finalized
February 1998) addresses NI trials in several areas. The document notes “well
known difficulties” associated with NI trials which include the “lack of any measure
of internal validity. . . thus making external validation necessary,” and that “many
flaws in the design and conduct of the trial will tend to bias the results towards a
69 Noninferiority Trials 1307
Other Guidelines
The CONSORT statement was developed to improve the quality and adequacy of
reporting of the results of randomized clinical trials and has undergone regular
updates in addition to extensions to specific types of trials, including NI trials in
2006 (Piaggio et al. 2006) and most recently in 2012 (Piaggio et al. 2012). This
CONSORT statement provides guidelines exclusively for the reporting of NI, and
compliance is broadly required by most major medical journals prior to publication
of trial results (http://www.icmje.org/recommendations/browse/manuscript-prepara
tion/preparing-for-submission.html). It was noted that the reporting of NI trials is
particularly poor (Piaggio et al. 2006), and more recent reviews have also come to
the same conclusion (Rehal et al. 2016).
Key aspects of trial reporting of NI trials in the CONSORT statement include
particular rationale for the NI design, statement and justification of the margin of NI,
description of how eligibility criteria and choice of control compare to previous
1308 P. P. J. Phillips and D. V. Glidden
superiority trials that established efficacy of the control, and clear description of
which among the primary and secondary efficacy and safety outcomes have NI
hypotheses and which have superiority hypotheses.
Other widely accepted guidelines of note are the SPIRIT statement defining
standard protocol items for clinical trials (Chan et al. 2013) and the guidance
document for the content of statistical analysis plans for clinical trials (Gamble
et al. 2017). Extensions of the SPIRIT guidelines for certain types of trials have
been developed, but there is, currently, no extension specifically for NI trials – this is
clearly a document that should be developed, if development is not already under-
way. Neither SPIRIT nor SAP guidance documents directly address NI trials beyond
instructions in the elaboration documents stating that the protocol and the SAP for a
NI trial should describe the framework (superiority or NI) for the primary and
secondary outcomes.
The EQUATOR network (https://www.equator-network.org/) provides an online
repository of guidelines and reference documents related to the reporting of health
research (equator-network.org). A search on their database (May 2021) for the words
“inferiority” or “equivalence” yields only the CONSORT extension for NI trials
(described above).
There are three textbooks addressing the methodology of the design, conduct, and
analysis of NI trials published in 2010 (Wellek 2010), 2012 (Rothmann et al. 2012),
and 2015 (Ng 2015), curiously all by the same publisher; we are not aware of others.
Additional review articles relating to NI trials include a guidance document on how
to handle NI trials in the context of systematic reviews (Treadwell et al. 2012) and
some general guidelines on the reporting of NI trials that predate the CONSORT
extension (Gomberg-Maitland et al. 2003).
Analysis
In general, the methods of analysis for NI trials do not significantly depart from those
used for superiority trials. The approach is to calculate a point estimate and confi-
dence interval for the treatment effect, using appropriate methods depending on the
type of outcome and study objectives ensuring that the analysis is reflective of the
trial design, and desired level of significance for the confidence interval. Figure 1
shows a plot of confidence intervals against a margin of NI to demonstrate different
outcomes of NI clinical trials with the upper bound denoted by a square since this is
the bound that is the focus of the hypothesis tests. In a superiority trial, if the upper
bound of the confidence interval of the treatment effect is lower than the null value
(0.0 for a difference or log ratio), then this is evidence for superiority; in a NI trial, if
the upper bound of the confidence interval is less than the margin of NI, then this is
evidence for NI. If this condition is not met for a NI trial, the conclusion must be that
there is no evidence of NI, that is, the investigational arm is not noninferior. This is a
somewhat confusing double negative which is sometimes wrongly interpreted as
evidence of inferiority. No evidence of NI is comparable to the situation in a
superiority trial with no evidence of superiority which is not the same as evidence
that the two arms in comparison are equivalent. It is universally true across
69 Noninferiority Trials 1309
Fig. 1 Examples of potential outcomes from NI trials labeled with interpretation. The upper bound
is denoted by a square to show that this is the bound used for determination of NI
Inferiority and NI
Although the conclusion of NI relates only to the upper bound of the confidence
interval (denoted with a square in Fig. 1) since the alternative hypothesis is
one-sided, nevertheless, the role of lower bound of the confidence interval can be
a source of confusion. The interpretation of the results of a NI trial needs careful
consideration when the lower bound of the confidence interval exceeds zero. This
would normally be interpreted as evidence that the intervention has inferior efficacy
as compared to the control, although this interpretation is inappropriate in a NI trial.
The margin of NI defines an acceptable margin of reduction in efficacy, and so
the interpretation must be no evidence of NI if the upper bound is not less than the
margin (scenario G), or actual evidence of NI if the upper bound is less than the
margin (scenario D).
1310 P. P. J. Phillips and D. V. Glidden
For a superiority trial, it is widely recommended that the primary analysis should
include all randomized participants in the treatment groups to which they were
allocated; this is regarded as an “intention-to-treat” (ITT) analysis (International
Conference on Harmonisation of Technical Requirements for Registration of Pharma-
ceuticals For Human Use 1998). This “Full Analysis Set,” as it is also sometimes
described, is preferred for superiority trials not only as it yields “estimates of treatment
effects which are more likely to mirror those observed in subsequent practice”
(International Conference on Harmonisation of Technical Requirements for Registra-
tion of Pharmaceuticals For Human Use 1998) but also because it provides a conser-
vative or protective analysis strategy whereby misclassification of outcomes from
participants that have had protocol violations is likely to dilute the treatment effect
thereby reducing the chance of falsely demonstrating superiority. For exactly this
reason, this ITT analysis set may actually increase the chance of demonstrating NI
and is therefore not uniformly accepted as the default choice for the primary analysis
for a NI trial, or as noted in ICH E9 (R1): “its role in such [NI] trials should be
considered very seriously” (International Conference on Harmonisation of Technical
Requirements for Registration of Pharmaceuticals For Human Use 2019).
An alternative analysis population is the modified-ITT (mITT) population with
limited exclusions of randomized participants, usually those that violated eligibility
criteria but were erroneously randomized, provided entry criteria were measured
prior to randomization and all participants recruited undergo equal scrutiny for
eligibility violations (International Conference on Harmonisation of Technical
Requirements for Registration of Pharmaceuticals For Human Use 1998). A more
common alternative is the “Per-Protocol” (PP) population where participants that did
not adequately adhere to the treatments under evaluation or other important aspects
of the trial protocol are excluded from the analysis. This is sometimes described as
69 Noninferiority Trials 1311
an “As-treated” analysis population, although the latter also implies the additional
criterion of analyzing participants according to the treatment they actually received.
How a PP analysis is defined varies greatly between guidelines and published NI
trials (Rehal et al. 2016), and some recommend a limited interpretation and put an
emphasis on causal inference methodology for analysis to overcome limitations of
postrandomization exclusions (Hernan and Robins 2017). A full discussion of
different analysis sets for clinical trials is outside the scope of this chapter; readers
should look at chapter 7.2.
In the past, many have recommended that the (m)ITT and PP analysis populations
should be coprimary (Piaggio et al. 2006; International Conference on Harmonisation
of Technical Requirements for Registration of Pharmaceuticals For Human Use 1998;
Jones et al. 1996; D’Agostino et al. 2003; Committee for Proprietary Medicinal
Products 2002) for NI trials such that it is necessary to demonstrate NI in both analysis
populations in order to declare NI of the regimen. A more recent commentary also
supports this approach (Mauri and D’Agostino 2017). Other authors, however, rec-
ommend relegating a PP analysis to a secondary analysis (Wiens and Zhao 2007).
There is no mention, for example, of a PP analysis in the 2016 FDA guidance on NI
(Food and Drug Administration Center for Drug Evaluation and Research (CDER)
2016), although an “as-treated” analysis had been included in the earlier 2010 draft.
Much of the discussion of different analysis populations has been replaced by the
emphasis on a clear specification of the estimand of interest in the ICH E9
(R1) Addendum (International Conference on Harmonisation of Technical Require-
ments for Registration of Pharmaceuticals For Human Use 2019) (see ▶ Chap. 84,
“Estimands and Sensitivity Analyses”) which includes attributes specifying choice of
analysis populations and also how intercurrent events (events such as treatment
switching or discontinuation that affect or prevent observation of the primary out-
come) are handled in analysis. In this regard, the ICH E9 (R1) addendum providing
regulatory guidelines on specification of estimands addresses this controversy (Inter-
national Conference on Harmonisation of Technical Requirements for Registration of
Pharmaceuticals For Human Use 2019): “estimands that are constructed with one or
more intercurrent events accounted for using the treatment policy strategy present
similar issues for NI and equivalence trials as those related to analysis of the Full
Analysis Set under the ITT principle.” The addendum also recognizes the importance
of the PP-type analyses: “An estimand can be constructed to target a treatment effect
that prioritizes sensitivity to detect differences between treatments, if appropriate for
regulatory decision making.”
Assay Sensitivity
A concern with any NI trial is the concept of “assay sensitivity.” A trial is said to
have “assay sensitivity” if it would detect the inferiority of an investigational
intervention if it were truly inferior. This is an issue of trial conduct and aligning
the trial context with assumptions. For instance, in the DISCOVER trial, if
1312 P. P. J. Phillips and D. V. Glidden
A major reason that the sample size of a NI trial can be large is the required tightness
of an M2 margin when the 95/95 method is used. Snappin and Jiang (2008) gave an
insightful critique of M2 (preservation of effect) criterion and constructed
69 Noninferiority Trials 1313
Some authors have pointed out that the distinction between NI and “superiority”
trials is artificial in many contexts. Dunn et al. (2018a) notes that in nonregulatory
situations, it can be unclear which regimen is the “control.” Even if one is identified,
the smallest clinically significant difference should govern the choice of the NI
margin and the alternative for superiority. This yields identical sample sizes for
both types of studies. Further, if the data analysis de-emphasizes null hypothesis
significance testing in favor of estimation of treatment contrast with the associated
confidence intervals, then the analysis and conclusions should be identical for NI and
superiority questions. Hence, in many active controlled trials, the distinctions
1314 P. P. J. Phillips and D. V. Glidden
between NI and superiority are not evident. This critique is not strictly relevant when
a regulatory-derived margin is used or when one regimen has such clear ancillary
advantages that the comparison between the arms is greatly asymmetric.
Related to this issue is a document published by the EMA CHMP “Points to consider
on switching between superiority and NI” (Committee for Proprietary Medicinal
Products 2000) that is focused on the relationship between superiority and NI
hypotheses within a single trial. The document is clear that it is acceptable to test
for superiority in a NI trial: “there is no multiplicity argument that affects this
interpretation because. . . it corresponds to a simple closed test procedure,” with
the proviso that “the intention-to-treat principle is given greatest emphasis” in the
superiority analysis which is likely to be different for the NI comparison. In any case,
every analysis plan for a NI trial should include a plan for a superiority test.
The document also notes that it can be appropriate to test for NI in a superiority
trial where superiority has not been demonstrated, provided a margin of NI has been
prespecified in the protocol and that the trial was “properly designed and carried out
in accordance with the strict requirements of a NI trial.” This includes the notion that
“in a NI trial the full analysis set and the PP [per protocol] analysis set have equal
importance and their use should lead to similar conclusions for a robust interpreta-
tion,” which is a departure from what is described in the FDA document which
recognizes the challenges with the ITT analysis, but does not go so far as to describe
a per protocol analysis as of equal importance (see section “Choice of Analysis
Populations and Estimands” above).
efficacy data at regular intervals (see ▶ Chap. 37, “Data and Safety Monitoring
and Reporting”). Aside from adaptive trial designs where any number of features of a
clinical trial could be modified, and review of data quality and trial procedures,
a main task of the DMC will be to recommend whether the trial can continue
or whether it should be stopped before the scheduled end. This latter recommenda-
tion is usually only made when there is sufficient evidence for one of the
following: (1) unacceptable harm to trial participants, (2) overwhelming benefit for
one arm, or (3) lack of benefit of the investigational intervention. In a NI trial,
the consideration of stopping guidelines should be different to those for superiority
trials.
It is unlikely to be appropriate to stop a NI trial early for overwhelming evidence
of NI since the margin of NI is somewhat subjective, and it would normally be better
to continue the trial to get a better estimate of the treatment effect and also to
determine whether the intervention might actually be superior. For this reason, a
superiority comparison is recommended when evaluating evidence for overwhelm-
ing benefit, and one might consider a conditional power approach (Bratton et al.
2012), even in a NI trial (Korn and Freidlin 2018).
It will also be inappropriate to stop a NI trial for lack of benefit since lack
of benefit may still be consistent with a finding of NI. When evaluating
evidence for lack of benefit in a NI trial, the comparison should be against
the margin of NI (effectively evaluating sufficient lack of evidence for NI)
rather than against a null finding as would be usual in a superiority trial
(Bratton et al. 2012).
Bayesian Approaches to NI
Bayesian approaches are particularly compelling for NI trials and have been adapted
in a variety of ways (Simon 1999; Gamalo-Siebers et al. 2016).
In any NI study, margins are derived directly or indirectly based on historical data.
A Bayesian framework provides an intuitive and formal way to incorporate these
data through prior distributions. A key source of uncertainty is the value of θCP about
which there might be expert opinion or preliminary data – an ideal setting for the use
of prior distributions. Bayesian methods can allow for discounting historical data
through shifting the prior toward a null distribution by using skeptical prior distri-
butions (Kirby et al. 2012) or through power priors (Ibrahim and Chen 2000) which
allow development of priors which depend the historical data with a flexible
69 Noninferiority Trials 1317
The evocatively named DOOR/RADAR approach (Evans et al. 2015) was proposed
as “a new paradigm in assessing the risks and benefits of new strategies to optimize
antibiotic use,” particularly to avoid the “complexities of non-inferiority trials” and
consequent large sample sizes when evaluating whether the duration of antibiotic use
can be reduced without a reduction in effectiveness. The idea is to prespecify an
ordinal clinical outcome that combines measures of efficacy and safety and then rank
trial participants by this clinical outcome measure, where those with a similar clinical
outcome are ordered by duration of antibiotic use with shorter duration given a
higher rank. This “Desirability of outcome ranking (DOOR)” is compared between
different antibiotic strategies in a “Response adjusted for duration of antibiotic risk
(RADAR)” superiority comparison, and mean sample sizes needed are much smaller
than comparable trials powered for NI.
There has been some uptake of this methodology, but it suffers from replacing the
complexities of NI trials with a host of new complexities and a number of substantial
limitations (Phillips et al. 2016). These include the introduction of a new metric “the
probability of a better DOOR for a randomly selected participant” with no clinical
interpretation, the same tendency with NI trials where poor quality may increase the
chance of a false positive, and the obscuring of important clinical differences if an
important clinical outcome occurs in only a few trial participants. This latter point
was illustrated by applying DOOR/RADAR to the results of three NI comparisons
from two TB trials that resulted in conclusions of an absence of NI (that were widely
accepted by the clinical community) yet counterintuitively showed clear superiority
in DOOR with p < 0.001 in each case.
A more attractive alternative design for trials evaluating different durations of
therapy involves explicit modeling of the duration-response relationship and selec-
tion of the duration that achieves the desired cure proportion (Quartagno et al. 2018;
Horsburgh et al. 2013).
1318 P. P. J. Phillips and D. V. Glidden
Three-Arm NI Design
One design that can overcome many of the challenges inherent in NI trials is a three-
arm trial with a single investigational intervention arm compared to an active control
intervention in a NI comparison, and compared to a different control of no treatment
in a superiority comparison. This three-arm trial allows for simultaneous demon-
stration that the investigational intervention is superior to no treatment and is
noninferior to the active control standard of care. This trial design is not possible
in many settings with established treatments where it would be inappropriate to
withhold treatment.
A risk with NI trials designed after a first investigational intervention has been
shown to be noninferior which is the phenomenon of biocreep (D’Agostino et al.
2003; Nunn et al. 2008). If a subsequent successful NI trial results in investigational
interventions “not much worse” than the previous investigational interventions, this
leads to an intervention that is considerably worse than the original standard of care
control and consequently not much better than placebo. A three-arm design is useful to
avoid the problem of biocreep. A three-arm NI trial of a second investigational
intervention would include both the first intervention shown to be noninferior as
well as the original standard of care control. The objective would be to demonstrate
NI of the second investigational intervention compared to the original standard of care
control, also allowing an internal randomized comparison between the two investiga-
tional interventions to support decision-making and facilitate informed patient choice.
In the regulatory framework, the M2 margin plays a major role. The M2 margin is
based on some fraction, typically 50%, of effect preservation. While this may seem
conceptually straightforward, measures of effect preservation involve subtle choices;
they depend on the estimands for the control and investigational arms (whether
multiplicative or additive) as well as on what scale effect preservation is judged.
Recent work in the HIV prevention context (Dunn et al. 2018b) has shown this scale
dependency. This work has reinforced that effect preservation is based on a “meta
estimand” and that in some contexts novel meta estimands are both useful and
interpretable, providing information beyond whether a M2 margin has been crossed.
In the context of HIV prevention trials, an alternative measure of effectiveness, the
Averted Infections Ratio (AIR), has been proposed based on the comparison of the
number of averted infections between treatment arms (Dunn et al. 2018b). This
measure is simple to interpret with both clinical and public health relevance and
overcomes limitations of scale dependency.
NI trials are most compelling when there is an investigational intervention which has
theoretical or known advantages over a control intervention. These might include
safety, simplicity, cost, or other desirable features for patients or health systems. The
1320 P. P. J. Phillips and D. V. Glidden
key issue in the NI design is that the investigational intervention is a compelling choice
provided it is not much worse than the control intervention. The NI margin quantifies
this magnitude and is the key design parameter in NI trials. The choice of margin should
be prespecified and transparent. In developing it, consideration should be given to the
public health context and may be developed using data on historic control intervention
effects (when available). If such data is used, consideration must be given to how
applicable they will be to the current trial. At a minimum, margins are chosen to exclude
the possibility of declaring NI if the investigational treatment effect is not superior to no
treatment. Proper trial conduct (avoiding missing/mismeasured data and loss to follow-
up) is essential to preserve assay sensitivity. In addition, sensitivity analyses can also be
useful. Bayesian analyses have numerous advantages for importing historic knowledge
and summarizing conclusions. Clear and complete reporting of key design choices is
lacking in the medical literature of NI trials.
Key Facts
Cross-References
References
Aberegg SK, Hersh AM, Samore MH (2018) Empirical consequences of current recommendations
for the design and interpretation of noninferiority trials. J Gen Intern Med 33(1):88–96
69 Noninferiority Trials 1321
Bratton DJ, Williams HC, Kahan BC, Phillips PP, Nunn AJ (2012) When inferiority meets
non-inferiority: implications for interim analyses. Clin Trials 9(5):605–609
Chan AW, Tetzlaff JM, Gotzsche PC, Altman DG, Mann H, Berlin JA et al (2013) SPIRIT 2013
explanation and elaboration: guidance for protocols of clinical trials. BMJ 346:e7586
Committee for Proprietary Medicinal Products (2000) Points to consider on switching between
superiority and non-inferiority. https://www.ema.europa.eu/en/documents/scientific-guideline/
points-consider-switching-between-superiority-non-inferiority_en.pdf
Committee for Proprietary Medicinal Products (2002) Points to consider on multiplicity issues in
clinical trials. https://www.ema.europa.eu/en/documents/scientific-guideline/points-consider-
multiplicity-issues-clinical-trials_en.pdf
Committee for Proprietary Medicinal Products (2006) Guideline on the choice of the non-inferiority
margin. https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-choice-non-
inferiority-margin_en.pdf
D’Agostino RB Sr, Massaro JM, Sullivan LM (2003) Non-inferiority trials: design concepts and
issues - the encounters of academic consultants in statistics. Stat Med 22(2):169–186
Dorman SE, Nahid P, Kurbatova EV, Goldberg SV, Bozeman L, Burman WJ et al (2020) High-dose
rifapentine with or without moxifloxacin for shortening treatment of pulmonary tuberculosis:
study protocol for TBTC study 31/ACTG A5349 phase 3 clinical trial. Contemp Clin Trials 90:
105938
Dunn DT, Copas AJ, Brocklehurst P (2018a) Superiority and non-inferiority: two sides of the same
coin? Trials 19(1):499
Dunn DT, Glidden DV, Stirrup OT, McCormack S (2018b) The averted infections ratio: a novel
measure of effectiveness of experimental HIV pre-exposure prophylaxis agents. Lancet HIV 5
(6):e329–e334
Evans SR, Rubin D, Follmann D, Pennello G, Huskins WC, Powers JH et al (2015) Desirability of
outcome ranking (DOOR) and response adjusted for duration of antibiotic risk (RADAR). Clin
Infect Dis 61(5):800–806
Farrington CP, Manning G (1990) Test statistics and sample size formulae for comparative binomial
trials with null hypothesis of non-zero risk difference or non-unity relative risk. Stat Med 9(12):
1447–1454
Floyd K, Hutubessy R, Kliiman K, Centis R, Khurieva N, Jakobowiak W et al (2012) Cost and cost-
effectiveness of multidrug-resistant tuberculosis treatment in Estonia and Russia. Eur Respir J
40(1):133–142
Food and Drug Administration Center for Biologics Evaluation and Research (CBER) (2020)
Guidance for industry. Development and licensure of vaccines to prevent COVID-19.
U.S. Department of Health and Human Services. https://www.fda.gov/media/139638/download
Food and Drug Administration Center for Drug Evaluation and Research (CDER) (2013) Guidance
for industry. Pulmonary tuberculosis: developing drugs for treatment, draft guidance.
U.S. Department of Health and Human Services. https://www.fda.gov/media/87194/download
Food and Drug Administration Center for Drug Evaluation and Research (CDER) (2016) Guidance
for Industry. Non-inferiority clinical trials to establish effectiveness. U.S. Department of Health
and Human Services. https://www.fda.gov/media/78504/download
Gamalo-Siebers M, Gao A, Lakshminarayanan M, Liu G, Natanegara F, Railkar R et al (2016)
Bayesian methods for the design and analysis of noninferiority trials. J Biopharm Stat 26(5):
823–841
Gamble C, Krishan A, Stocken D, Lewis S, Juszczak E, Dore C et al (2017) Guidelines for the
content of statistical analysis plans in clinical trials. JAMA 318(23):2337–2343
Gillespie SH, Crook AM, McHugh TD, Mendel CM, Meredith SK, Murray SR et al (2014) Four-
month moxifloxacin-based regimens for drug-sensitive tuberculosis. N Engl J Med 371(17):
1577–1587
Gomberg-Maitland M, Frison L, Halperin JL (2003) Active-control clinical trials to establish
equivalence or noninferiority: methodological and statistical concepts linked to quality. Am
Heart J 146(3):398–403
1322 P. P. J. Phillips and D. V. Glidden
Hernan MA, Robins JM (2017) Per-protocol analyses of pragmatic trials. N Engl J Med 377(14):
1391–1398
Holmgren EB (1999) Establishing equivalence by showing that a specified percentage of the effect
of the active control over placebo is maintained. J Biopharm Stat 9(4):651–659
Horsburgh CR, Shea KM, Phillips P, Lavalley M (2013) Randomized clinical trials to identify
optimal antibiotic treatment duration. Trials 14(1):88
Ibrahim JG, Chen M-H (2000) Power prior distributions for regression models. Stat Sci 15(1):46–60
International Conference on Harmonisation of Technical Requirements for Registration of Phar-
maceuticals For Human Use (1998) Statistical principles for clinical trials (E9). https://database.
ich.org/sites/default/files/E9_Guideline.pdf
International Conference on Harmonisation of Technical Requirements for Registration of Phar-
maceuticals For Human Use (2000) Choice of control group and related issues in clinical trials
(E10). https://database.ich.org/sites/default/files/E10_Guideline.pdf
International Conference on Harmonisation of Technical Requirements for Registration of Phar-
maceuticals For Human Use (2019) Estimands and sensitivity analysis in clinical trials. E9(R1).
https://database.ich.org/sites/default/files/E9-R1_Step4_Guideline_2019_1203.pdf
James Hung HM, Wang SJ, Tsong Y, Lawrence J, O’Neil RT (2003) Some fundamental issues with
non-inferiority testing in active controlled trials. Stat Med 22(2):213–225
Jindani A, Harrison TS, Nunn AJ, Phillips PP, Churchyard GJ, Charalambous S et al (2014) High-
dose rifapentine with moxifloxacin for pulmonary tuberculosis. N Engl J Med 371(17):1599–1608
Jones B, Jarvis P, Lewis JA, Ebbutt AF (1996) Trials to assess equivalence: the importance of
rigorous methods. BMJ 313(7048):36–39
Kaji AH, Lewis RJ (2015) Noninferiority trials: is a new treatment almost as effective as another?
JAMA 313(23):2371–2372
Kirby S, Burke J, Chuang-Stein C, Sin C (2012) Discounting phase 2 results when planning phase
3 clinical trials. Pharm Stat 11(5):373–385
Korn EL, Freidlin B (2018) Interim monitoring for non-inferiority trials: minimizing patient
exposure to inferior therapies. Ann Oncol 29(3):573–577
Machin D, Campbell MJ, Tan SB, Tan SH (2018) Sample size tables for clinical, laboratory and
epidemiology studies, 4th edn. Wiley, Hoboken
Mauri L, D’Agostino RB Sr (2017) Challenges in the design and interpretation of noninferiority
trials. N Engl J Med 377(14):1357–1367
Mayer KH, Molina JM, Thompson MA, Anderson PL, Mounzer KC, De Wet JJ et al (2020)
Emtricitabine and tenofovir alafenamide vs emtricitabine and tenofovir disoproxil fumarate for
HIV pre-exposure prophylaxis (DISCOVER): primary results from a randomised, double-blind,
multicentre, active-controlled, phase 3, non-inferiority trial. Lancet 396(10246):239–254
Neuenschwander B, Capkun-Niggli G, Branson M, Spiegelhalter DJ (2010) Summarizing historical
information on controls in clinical trials. Clin Trials 7(1):5–18
Ng T-H (2015) Noninferiority testing in clinical trials: issues and challenges. Taylor & Francis/CRC
Press, Boca Raton, xvii, 190 p
Nunn AJ, Phillips PPJ, Gillespie SH (2008) Design issues in pivotal drug trials for drug sensitive
tuberculosis (TB). Tuberculosis 88:S85–S92
Nunn AJ, Rusen I, Van Deun A, Torrea G, Phillips PP, Chiang CY et al (2014) Evaluation of a
standardized treatment regimen of anti-tuberculosis drugs for patients with multi-drug-resistant
tuberculosis (STREAM): study protocol for a randomized controlled trial. Trials 15(1):353
Nunn AJ, Phillips PPJ, Meredith SK, Chiang CY, Conradie F, Dalai D et al (2019) A trial of a
shorter regimen for rifampin-resistant tuberculosis. N Engl J Med 380(13):1201–1213
Phillips PP, Morris TP, Walker AS (2016) DOOR/RADAR: a gateway into the unknown? Clin
Infect Dis 62(6):814–815
Piaggio G, Elbourne D, Altman D, Pocock S, Evans S (2006) Reporting of noninferiority and
equivalence randomized trials: an extension of the CONSORTstatement. JAMA 295(10):1152
Piaggio G, Elbourne DR, Pocock SJ, Evans SJ, Altman DG, Group C (2012) Reporting of
noninferiority and equivalence randomized trials: extension of the CONSORT 2010 statement.
JAMA 308(24):2594–2604
69 Noninferiority Trials 1323
Quartagno M, Walker AS, Carpenter JR, Phillips PP, Parmar MK (2018) Rethinking non-inferiority:
a practical trial design for optimising treatment duration. Clin Trials 15(5):477–488. https://doi.
org/10.1177/1740774518778027
Quartagno M, Walker AS, Babiker AG, Turner RM, Parmar MKB, Copas A et al (2020) Handling
an uncertain control group event risk in non-inferiority trials: non-inferiority frontiers and the
power-stabilising transformation. Trials 21(1):145
Rehal S, Morris TP, Fielding K, Carpenter JR, Phillips PP (2016) Non-inferiority trials: are they
inferior? A systematic review of reporting in major medical journals. BMJ Open 6(10):
e012594
Rothmann MD, Tsou HH (2003) On non-inferiority analysis based on delta-method confidence
intervals. J Biopharm Stat 13(3):565–583
Rothmann MD, Wiens BL, Chan ISF (2012) Design and analysis of non-inferiority trials. Chapman
& Hall/CRC, Boca Raton, xvi, 438 p
Sankoh AJ (2008) A note on the conservativeness of the confidence interval approach for the selection
of non-inferiority margin in the two-arm active-control trial. Stat Med 27(19):3732–3742
Simon R (1999) Bayesian design and analysis of active control clinical trials. Biometrics 55(2):
484–487
Snapinn S, Jiang Q (2008) Preservation of effect and the regulatory approval of new treatments on
the basis of non-inferiority trials. Stat Med 27(3):382–391
Spiegelhalter DJ, Freedman LS, Parmar MK (1994) Bayesian approaches to randomized trials. J R
Stat Soc A Stat Soc 157(3):357–387
Tiemersma EW, van der Werf MJ, Borgdorff MW, Williams BG, Nagelkerke NJ (2011) Natural
history of tuberculosis: duration and fatality of untreated pulmonary tuberculosis in HIV
negative patients: a systematic review. PLoS One 6(4):e17601
Treadwell JR, Uhl S, Tipton K, Shamliyan T, Viswanathan M, Berkman ND et al (2012) Assessing
equivalence and noninferiority. J Clin Epidemiol 65(11):1144–1149
Tsui M, Rehal S, Jairath V, Kahan BC (2019) Most noninferiority trials were not designed to
preserve active comparator treatment effects. J Clin Epidemiol 110:82–89
Tweed CD, Wills GH, Crook AM, Amukoye E, Balanag V, Ban AYL et al (2021) A partially
randomised trial of pretomanid, moxifloxacin and pyrazinamide for pulmonary TB. Int J Tuberc
Lung Dis 25(4):305–314
Van Deun A, Maug AKJ, Salim MAH, Das PK, Sarker MR, Daru P et al (2010) Short, highly
effective, and inexpensive standardized treatment of multidrug-resistant tuberculosis. Am J
Respir Crit Care Med 182(5):684–692
Wellek S (2010) Testing statistical hypotheses of equivalence and noninferiority, 2nd edn. CRC
Press, Boca Raton, xvi, 415 p
White IR, Carpenter J, Horton NJ (2012) Including all individuals is not enough: lessons for
intention-to-treat analysis. Clin Trials 9(4):396–407
Wiens BL, Zhao W (2007) The role of intention to treat in analysis of noninferiority studies. Clin
Trials 4(3):286–291
Williams HC, Wojnarowska F, Kirtschig G, Mason J, Godec TR, Schmidt E et al (2017) Doxycy-
cline versus prednisolone as an initial treatment strategy for bullous pemphigoid: a pragmatic,
non-inferiority, randomised controlled trial. Lancet 389(10079):1630–1638
World Health Organization (2011) Guidelines for the programmatic management of drug-resistant
tuberculosis - 2011 update. World Health Organization
World Health Organization (2019) WHO consolidated guidelines on drug-resistant tuberculosis
treatment. World Health Organization, Geneva
World Health Organization (2020) Global tuberculosis report 2020. World Health Organization,
Geneva
Cross-over Trials
70
Byron Jones
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1326
Challenges When Designing a Cross-over Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1329
Efficiency of a Cross-over Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1329
Example 1: 2 2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1330
Plotting the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1332
Two-Sample t-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1334
Fitting a Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1335
Checking Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1336
Testing for a Difference in Carry-Over Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1337
Additional Use of the Random-Effects Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1339
Williams Cross-over Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1340
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1340
Example of a Cross-over Trial with Five Treatments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1342
Example 3: Incomplete Block Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1345
Use of Baseline Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1348
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1349
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1350
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1350
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1350
Abstract
Cross-over trials have the potential to provide large reductions in sample size
compared to their parallel groups counterparts. In this chapter, three different
types of cross-over design and their analysis will be described. In the linear
models used to analyze cross-over data, the variability due to differences between
the subjects in the trial may be modeled as either fixed or random effects, and both
will be illustrated. In designs of the incomplete block type, where there are more
B. Jones (*)
Novartis Pharma AG, Basel, Switzerland
e-mail: [email protected]
treatments than periods, the use of random subject effects enables between-
subject information on the treatment comparisons to be recovered, and how this
may be done will also be illustrated. Finally, if the so-called baseline measure-
ments have been taken at the beginning of each treatment period, these can be
used as covariates to reduce the variability of the estimated treatment compari-
sons. The use of baselines will be illustrated using the incomplete block design.
Keywords
Crossover trials · Fixed effects · Random effects · Incomplete block · Between-
subject information · Baseline information
Introduction
Notation
When discussing trials in a generic sense, those taking part will be referred to as
subjects. Depending on the actual context, these may be either patients or healthy
volunteers.
70 Cross-over Trials 1327
In a cross-over trial, the time the trial takes to complete is divided into a sequence of
time periods. Within each period, a subject receives either one of the treatments to be
compared or no treatment. The periods where no treatments are received are referred to
as either run-in or wash-out periods. The reason for these will be explained later.
Typically, a subject receives a sequence of different treatments spaced out over all time
periods. In the trial as a whole, a particular (usually small) number of different
treatment sequences will be used, and an example of such a trial is given in Table 1,
below. The subjects that are available for the trial are randomly assigned to the
different sequences, and the group of subjects assigned to a particular sequence is
referred to as a sequence group. Often, the random assignment is done in a way that
ensures all sequence groups are the same (or approximately the same) size.
In the cross-over trial illustrated in Table 1, there are two treatments to be
compared (A and B), over four time periods. The first time period is the “run-in”
period where baseline measurements are taken and subjects get acclimatized to the
trial procedures; in the second period, each subject receives one of the two treat-
ments; the third period is a “wash-out” period, and in the fourth period each subject
gets the treatment that was not administered in the second period. The trial has two
sequence groups, defined by the order in which the treatments are administered: AB
and BA.
In general, the number of treatments, periods, and sequence groups in the design
will be denoted as t, p, and s, respectively. The number of subjects in sequence group
i will be denoted as ni, i ¼ 1,2, . . ., s. Within sequence group i, ni, subjects receive all
the treatments in the specified treatment sequence for that Pgroup. Note that the total
number of subjects in the trial is the sum of the ni, i.e., si¼1 ni .
If yijk (i ¼ 1, 2,. . .,s; j ¼ 1, 2, . . ., p; k ¼ 1, 2,. . ., ni) denotes the response observed
on the kth subject in period j of sequence group i, then a typical statistical model used
to explain a continuous response is:
where μ is a general intercept, πj is an effect associated with period j (j ¼ 1,. . ., p), τd[i,
j] is the direct treatment effect associated with the treatment d[i, j], applied in period j
of sequence group i (d[i, j] ¼ 1,. . ., t), sik is the effect associated with the kth subject in
sequence group i (i ¼ 1,. . ., s, k ¼ 1,. . ., ni), and eijk is an independent random error
term, with zero mean and variance σ 2. We refer to this model as Model (1).
Note that the sik represent characteristics of the subjects, not the treatments. For
example, some subjects may have baseline values of the response of interest that are
higher or lower than others. Including the sik in the model takes account of some of
the variability in the response that is just the result of variation in baseline subject
characteristics. The sik do not account for variation in the treatment effects over the
periods: other parameters defined above account for this.
In a cross-over trial, there is the possibility that the effect observed in one period
may persist into the next period. This effect is known as a carry-over effect. If there is a
need to allow for carry-over effects, and these are additive to the treatment effects, then
the above model can be extended to include the carry-over effect term, λd[i, j 1]:
Obviously, there can be no carry-over effect in the first period, i.e., λd[i, 0] ¼ 0. We
refer to this model as Model (2).
Models (1) and (2) are examples of fixed-effects models: In such models, the
effects (for periods, treatments, carry-overs and subjects) are constant, but unknown
values, to be estimated.
An alternative model, referred to here as the random-effects model, assumes that
the sik are independent random effects with mean 0 and variance σ 2s .
The sik have mean zero, because they represent the random relative effect of the
subject characteristics on the response (some values are higher and some are lower).
The amount the subject effects vary is measured by the variance parameter σ 2s (larger
σ 2s implies greater variability).
In addition, the sik are assumed to be independent of the eijk. Of course, it could be
argued, quite rightly, that what is being referred to here as a random-effects model is
actually a mixed-effects model because it contains both random and fixed effects.
However, to maintain consistency with the literature, the term random-effects model
will be used in this chapter.
If not stated otherwise, it is assumed in the following that the sik in Models (1) and
(2) are random variables, as defined above.
Then the responses from periods j and j0( j 6¼ j0), on the same subject,
have
2
variances Var(yijk) ¼ Var(yij0 k) ¼ σ þσ s and covariance Cov yijk , yi j0 k ¼ σ s . This
2 2
means that the correlation between any two responses on the same subject is ρ ¼
σ 2s = σ 2s þ σ 2 . As can be seen, the correlation increases as σ 2s increases.
More complex correlation structures can be defined by making particular assump-
tions regarding the correlation structure of the sik or the eijk, but we do not do this
here. See Chi and Reinsel (1989) for how this may be done to produce an auto-
regressive correlation structure.
When any two measurements on the same subject are positively correlated, the
estimate of any treatment comparison has a smaller variance than would be obtained
from a parallel groups design with the same number of subjects (see Piantadosi
(1997), Sect. 16.2.1). So when used appropriately, the cross-over design requires
fewer subjects than a parallel groups design to achieve a desired level of power to
reject the null hypothesis of no treatment difference.
When the subject effects are assumed to be random, the model parameters can be
estimated using restricted maximum likelihood (REML), as introduced by Patterson
and Thompson (1971).
70 Cross-over Trials 1329
• Is the condition being treated a chronic condition for which a cross-over trial is
appropriate?
• Is the effect of the treatment likely to persist into a following period, i.e., are
carry-over effects expected?
• Can the carry-over effects be removed by including sufficiently long wash-out
periods?
• How many treatments are to be compared?
• How many periods can be used?
• Which treatment comparisons are important?
• What is the maximum sample size, i.e., the total number of subjects required?
An important criterion when choosing between designs, especially when all pairwise
comparisons between the treatments are of equal importance, is the efficiency of the
design. The efficiency of a particular comparison is the ratio of the variance of an
estimated pairwise difference of the treatment effects in the design of interest com-
pared to a theoretical lower bound on that variance (Patterson and Lucas (1962)).
For the calculation of these variances, it is assumed that the sik are fixed effects.
This is because, when comparing designs, the main interest is in maximizing the
within-subject information.
The lower bound is a technical construct that has proved very useful for calibrat-
ing how good a design is relative to the (statistically) best it could be. In this
technically best design, statistical theory states that the estimator of the difference
between two treatment effects that has minimum variance is the difference between
the simple (unadjusted) means of those two treatments. The variance of this estima-
tor is then the variance of the difference of these two means, which can easily be
1330 B. Jones
calculated for any design. For example, in most designs, the planned number of
responses, r, to be recorded on each treatment (A, B, and so on) is the same. If each
sequence group has size n, the planned total number of responses is N ¼ s n p
and r ¼ N/t. The variance of the difference of two means in this case is
2σ 2
V bound ¼ ð3Þ
r
and this is the technical lower bound for the difference of two treatments in the
design under consideration. For some designs, the number of responses on a
particular pair of treatments, i and j, say, may not be the same, and equal ri and rj,
respectively. In this case, the technical lower bound is
V bound ¼ σ 2 =r i þ σ 2 =r j : ð4Þ
V bound
E¼ : ð5Þ
Vd
Basically, E measures how large Vd is when compared to the lower bound. The
larger is Vd, the smaller is E.
The value of Vd depends on the design and the model assumed for the response. It
can be calculated using the formulae for the least squares estimators of the treatment
comparisons in the design under consideration (see Jones and Kenward (2014),
Sect. 3.5, for some examples). Although it involves the within-subject variance σ 2 as
a constant multiplier, this cancels out in the ratio in Eq. (5), as Vbound also includes σ 2
as a constant multiplier. Therefore, it can be calculated prior to the collection of any
data. As already noted, E measures how large the variance of the difference of two
estimated treatment effects is in the design under consideration relative to the lower
bound. In an ideal design, E ¼ 1, as then Vd ¼ Vbound. The reason why Vd may be
greater than Vbound, and E < 1, is that the structure of the design is such that, even
after adjusting for the period and subject effects (and possibly, carry-over effects), in
the statistical analysis (i.e., after fitting Models (1) or (2)), the lower bound on the
variance is still exceeded for some or all estimated pairwise comparisons.
Examples of the efficiency values for two types of design will be given in sections
“Williams Crossover Design” and “Example 3: Incomplete Block Design,”
respectively.
Example 1: 2 3 2 Design
The simplest cross-over design, known as the 2 2 design, compares two treatments
using two active treatment periods. The basic structure of this design is given, as
previously, in Table 1. Here s ¼ 2, p ¼ 2, and t ¼ 2. In this design, as illustrated, there
are four periods: two active treatment periods (labeled as 1 and 2), a run-in period, and a
wash-out period. The purpose of the run-in period, as already noted, may be to
70 Cross-over Trials 1331
familiarize the subjects with the clinical trial procedures, collect baseline measure-
ments, remove the effects of a previous treatment, or confirm that a subject is able to
continue into the first period, for example. The purpose of the wash-out period is to
remove the effects of the drug given in Period 1 before the second drug is given in
Period 2. This should ensure that subjects are in the same clinical state at the start of
Period 2 as they were at the start of Period 1. It should be noted that in some trials the
wash-out period is not included. This may be because its inclusion will extend the time
the trial will take to complete beyond that which is reasonable or because there is
confidence that carry-over effects will not exist. Typically, to remove a pharmacological
carry-over effect of a single oral dose, the wash-out period should be at least five half-
lives of the drug (FDA (2013)). The half-life of a drug (see Clark and Smith (1981)) is
the time it takes for the concentration of the drug in the blood (or plasma) to reduce to
half of what it was at equilibrium. After one half-life, the concentration should drop by
(100 1/2)% ¼ 50%, after two half-lives drop by 100(1/2 + 1/4)% ¼ 75%, and by five
half-lives drop by 100(1/2 + 1/4 + 1/8 + 1/16 + 1/32)% ≈ 97%.
For Models (1) and (2), the fixed effects are displayed in Table 2 and Table 3,
respectively.
The data that will be used to illustrate the analysis of the 2 2 design are based on
actual data from a completed Phase III randomized, double-blind, placebo controlled
trial to assess the effect of drug A on exercise endurance in subjects with moderate to
severe chronic obstructive pulmonary disease (COPD). Drug B is a placebo treat-
ment, i.e., a drug with no active pharmacological ingredients. Exercise endurance
time in seconds was measured using a constant-load cycle ergometry test after 3
weeks of treatment.
After a 1-week run-in period, subjects were randomized to receive either A or B.
After 3 weeks of treatment, the drugs were withdrawn, and there was a 3-week
washout-period. Patients who began on A then crossed over to B, and those who
began on B crossed over to A. Due to the presence of the wash-out periods, it is
assumed that there are no carry-over effects present and Model (1) applies.
Patients who received A first belong to the AB sequence group (Group 1), and
subjects who received B first belong to the BA sequence group (Group 2).
The data in Tables 4 and 5 give examples of observations similar to those
collected in the actual trial. The actual data have been modified to preserve their
confidentiality and to enable key ideas to be presented without too much
Table 4 Example of observations from Group 1(AB) subjects (endurance time in seconds)
Subject Period 1(A) Period 2(B) Difference Sum
1 496 397 99 893
3 405 228 177 633
. . . . .
. . . . .
58 575 548 27 1123
59 330 303 27 633
Table 5 Example of observations from Group 2(BA) subjects (endurance time in seconds)
Subject number Period 1(B) Period 2(A) Difference Sum
2 270 332 62 602
5 465 700 235 1165
. . . . .
. . . . .
54 308 236 72 544
57 177 293 116 470
complication. To save space, only the data for a few subjects in each group are
shown. Table 4 gives the data for the subjects in the AB group (Group 1). It can be
seen that there are two measurements for each subject, one for Period 1 when A was
received and one for Period 2 when B was received. Table 5 gives the corresponding
data for the subjects in the BA group (Group 2), where it is noted that in this group
the subjects received B first and then crossed over to A.
Also given in these two tables are the within-subject differences (Period 1 –
Period 2) and the sum of the two responses of each subject. These will be used to
estimate the treatment difference and the carry-over difference, respectively (see
sections “Two-Sample t-Test” and “Testing for a Difference in Carry-Over Effects”).
By taking the difference between the Period 1 measurement and the Period 2
measurement within each subject, a within-subject comparison of A versus B is
obtained in Group 1 and a within-subject comparison of B versus A is obtained in
Group 2.
In the following, the complete data from the 2 2 trial are analyzed to decide if
treatment A is superior to treatment B. Before that, however, we describe two useful
plots that focus on illustrating the strength of the difference (if any) between the
effects of A and B.
Before conducting formal hypothesis testing, it is always useful to plot the data that
are to be analyzed. For the 2 2 trial, a useful plot is the subject profiles plot, which
for this example is displayed in Fig. 1. The left-hand plot is for the AB group, and the
70 Cross-over Trials 1333
Sequence AB Sequence BA
l l
l
1000
1000
l l
l l
l l
l
l l
l
l l
800
800
l
l
l
l
l l
l
l
l l l
l
Exercise score
Exercise score
l
l l
l
l l
600
600
l l
l
l
l l
l l l
l l l
l l
l l
l
l l l
l l
l l l
l
l
l l
400
400
l
l l l
l
l
l l l
l l l l
l
l l
l l l
l l l
l
l l l
l l
l l
l
l l l
l
200
200
l l
l
l
1 2 1 2
Period Period
right-hand plot is for the BA group. Looking at the plot for the AB group, for
example, it can be seen that there is a pair of points and a line connecting them for
each subject. The left-hand point is the response in Period 1, and the right-hand point
is the response in Period 2. It can be seen that some subjects have a much longer
endurance time on A compared to B (see the second line from the top for the AB
group) whereas others actually have a shorter endurance time on A compared to B
(some substantially so). It is clear that the between-subject variability in the
responses is large, and this is one of the situations where a cross-over trial is likely
to be preferred to a parallel groups design.
However, although it might be possible to get the impression from Fig. 1 that A
increases endurance time compared to B, it is not very clear and a significance test
must be performed to get a definitive answer. As a preliminary to this, the mean of
the responses of the subjects per group and period is calculated, as displayed in
Table 6.
A very useful way to display these means is the groups-by-periods plot, as given
in Fig. 2. The means have been joined in different ways. In the left-hand panel, the
1334 B. Jones
600
600
Mean Exercise time
1A
l 1A
l
500
500
2B 1B
l 2B 1B
l
l l
400
1 2 400 1 2
Period Period
lines connect the same treatment, and in the right-hand panel the lines connect the
same group. The left-hand panel emphasizes the treatment difference in each period:
It can be seen that in each period there is a consistent pattern where A is higher than
B. The right-hand plot emphasizes the within-subject mean changes: In the AB
group, there is a definite decline in endurance from the first period to the second, and
in the BA group there is a definite increase in endurance time from the first period to
the second. This displays the clearest evidence so far, that A is superior to B.
Two-Sample t-Test
Before immediately fitting Model(1), it is instructive to first see how the null
hypothesis of no treatment effect can be tested using the familiar two-sample t-
test. This requires the additional assumption that the within-subject differences in
response are normally distributed. In the absence of an assumption of normality, the
nonparametric Wilcoxon rank-sum test may be used (Jones and Kenward (2014),
Sect. 2.12 and Hollander and Wolfe (1999)), although the t-test is quite robust
against violations of this and other assumptions (Havliceck and Peterson (1974)).
The column headed “Difference” in both Tables 4 and 5 gives the within-subject
difference for each subject in each group. Each within-subject difference in Group 1
is an unbiased estimator of (τ1 – τ2) – (π1 – π2), where it is noted that the subject
70 Cross-over Trials 1335
effects and the general intercept have been canceled out. Similar reasoning leads to
the conclusion that the mean of the differences in Group 2 is an unbiased estimate of
(τ2 – τ1) – (π1 – π2). Consequently, the difference between the mean of the within-
subject differences in Group 1 and the mean of the within-subject differences in
Group 2 has expectation 2(τ1 – τ2). Therefore, a two-sample t-test comparing the
within-subject differences of the two groups is a test of the null hypothesis of no
difference between the treatments, H0: τ1 ¼ τ2.
If dik ¼ yi1k – yi2k denotes the within-subject difference of subject k in group i, and
di: the mean of these differences in Group i, then a pooled estimator of the variance
of the differences, σ 2d , is:
2 X
X ni
2
σ 2d ¼ 2b
b σ2 ¼ d ik di: =ðn1 þ n2 2Þ: ð6Þ
i¼1 k¼1
The analysis of variance table obtained from fitting Model (1) is given in Table 7,
and the corresponding estimated treatment and period effects are given in Table 8.
Although Table 7 does not give any new information regarding the treatment
effect, it does make clear that a large proportion of the total variability in the data is
accounted for by the between-subject variability. By this is meant that the between-
subjects SS are a large proportion of the Total SS (4,090,211/5448153 ¼ 0.75). Also,
there is no evidence of a significant period difference. 2
In addition, from output not shown, b
ρ¼b σ 2s = b σ2 ¼
σs þ b
225262=ð25262 þ 21197Þ ¼ 0:54.
The period difference could also have been tested using the two-sample t-test. To
test for a period effect, the within-subject differences in one of the groups (e.g., Group 2)
1336 B. Jones
Table 7 Analysis of variance for the endurance data [df ¼ degrees of freedom, SS ¼ sums of
squares, MS ¼ mean square, and F¼F-ratio]
Source df SS MS F P-value
Between-subjects 58 4,090,211 70,521
Periods 1 7136 7136 0.34 0.564
Treatments 1 143,639 143,639 6.78 0.012
Residual 57 1,208,209 21,197
Total 117 5,448,153
are multiplied by 1 before the test is applied. This will change the expected value of
the difference in means to (τ1 – τ2) + (π1 – π2) – (τ1 – τ2) + (π2 – π1) ¼ 2(π1 – π2), and
hence a test of the null hypothesis H0: π1 ¼ π2 is obtained. Then the approach used in
section “Two-Sample t-Test” can be followed, giving a t-test statistic of 0.580 and
a two-sided p-value of 0.564, in agreement (as expected) with the corresponding
values in Table 7.
Checking Assumptions
One advantage of fitting a linear model is that it conveniently allows the checking of
assumptions made about the model.
For example, it is easy to check if the residuals from the fitted model are
approximately normally distributed or if there are any outliers. The raw residual is
the difference between the actual response and its prediction from the model. The
standardized residual is the raw residual divided by its standard error.
Figure 3 is a quantile-quantile plot (or Q-Q plot) of the ordered standardized
residuals taken from Period 1, which is a visual check for normality. In this plot, the
quantiles of the sample are plotted against the theoretical quantiles of the normal
distribution. The quantile for a particular value in the dataset is the percentage of data
points below that value. For example, the median is the 50th quantile. It is not
necessary to plot the residuals from both periods because within a subject they add
up to zero and so it is only necessary to take one per subject. If the standardized
residuals are normally distributed, the points in the Q-Q plot should lie on, or close
to, the diagonal straight line. In Fig. 3, it can be seen that there is some deviation of
the points from the line at the extremes, indicating some skewness in the distribution
of the residuals. However, when some of the standard tests for normality are applied
to this sample of standardized residuals, there is no evidence to reject the null
hypothesis that the residuals are normally distributed. For example, the p-value for
70 Cross-over Trials 1337
Period 1 2 l
l
Standardized Residuals
l
l
1 l
l
llll
ll
ll
l
ll
lllll
lll
lll
0 lll
ll
ll
lll
lll
ll
l
l
l
ll
−1 ll
l
l
l
−2 l
l
l
−2 −1 0 1 2
Theoretical Quantiles
the Anderson-Darling normality test ([Anderson and Darling (1952)]) is 0.436, and
the p-value for the Shapiro-Wilk test ([Shapiro and Wilk (1965)]) is 0.531.
There are five standardized residuals that are larger than 1.964 in absolute value,
with values of 2.13, 2.17, 2.44, 2.25, and 2.32, respectively. That is, 5/
59 ≈ 8.5% of the residuals are “large,” compared to the expected percentage of 5%
if they were truly realizations from the standard normal distribution. However, none
of the standardized residuals are excessively large (i.e., > 3), so there should be no
serious concerns regarding the normality of the residuals.
As noted earlier, if the assumption of normality is not fulfilled, a nonparametric
comparison of the treatment effects can be performed using the Wilcoxon rank-sum test
applied to the within-period differences. For details, see Chapter 4 of Hollander and
Wolfe (1999), who also note that the loss of asymptotic relative efficiency when using
the Wilcoxon rank-sum test instead of the t-test is no greater than 13.6%. When the data
are actually independent and normally distributed, the loss in efficiency is 4.5%.
An important assumption regarding the data in this example is that carry-over effects
are not present (or that they are equal and therefore do not enter into the expectation
of bτd ) and effectively Model (1) applies. Although testing for carry-over effects in
the 2 2 design is not recommended, for reasons that will be given shortly, it can be
done. For completeness, an explanation of how to do it is given below.
Suppose it is assumed that Model (2) applies. It will be recalled that the carry-over
parameters for treatments A and B are denoted by λ1 and λ2, respectively. The
difference between the carry-over effects is denoted by λd ¼ λ1 – λ2.
1338 B. Jones
In order to derive a test of the null hypothesis that λd ¼ 0, it is noted that the
subject totals
and
have expectations
and
As the expectations of sik and eijk are both zero, the expectations of t1k and t2k
reduce to E[t1k] ¼ 2μ + π1 + π2 + τ1 + τ2 + λ1 and E[t2k] ¼ 2μ + π1 + π2 + τ1 + τ2 + λ2
respectively.
If λ1 ¼ λ2, then these two expectations are equal. Consequently, to test if λd ¼ 0,
the familiar two-sample t-test can be applied to the subject totals.
The estimate of the difference in carry-over effects is b λd ¼ t1: t2: and has a
variance of:
h i
1 1
Var b
λd ¼ σ 2T þ ,
n1 n2
where
σ 2T ¼ 2 2σ 2s þ σ 2 :
An estimate of σ 2T is
2 X
X ni
σ 2T ¼
b ðtik ti: Þ2 =ðn1 þ n2 2Þ,
i¼1 k¼1
If the results of the trial are such that every subject provides responses in both
periods, the estimates of the treatment effects and the corresponding significance
tests are identical for both the fixed- and random-effects models. This is because the
subject effect cancels out in any within-subject comparison. When this is not the
case, the random-effects model permits the data from the single responses, provided
by some subjects, to be included in the analysis.
To illustrate this, a new dataset is constructed by adding the data given in Table 9,
to the previously used data that were partly shown in Tables 4 and 5. These data are
for subjects who dropped out of the trial after the first period for various (treatment
unrelated) reasons.
The random effects model is fitted using the Kenward-Roger adjustment
[Kenward and Roger (1997)]. This adjustment is a bias correction for the variance
1340 B. Jones
Table 9 Period 1 endurance time (seconds) for subjects without a second period
Group Subject Treatment Time
1 60 1 130
1 61 1 302
1 62 1 653
1 63 1 467
2 64 2 364
2 65 2 155
2 66 2 226
Introduction
A limitation of the 2 2 cross-over design is that it does not permit any carry-over
effects to be estimated using within-subject information. The use of adequate
washout periods, of course, can remove any pharmacological carry-over effects
and make redundant the need to estimate them. However, when more than two
treatments are to be compared over several periods, the use of long wash-out periods
can be impractical as the longer a trial takes, the greater the chance that subjects will
70 Cross-over Trials 1341
drop out before completing all the periods. Longer trials also mean that the time a
successful new drug will take to reach the patients who need it will be extended. If
wash-out periods are removed, or if there is doubt that any carry-over effects can be
removed using the maximum permissible length of wash-out period, then cross-over
designs that allow the estimated treatment effects to be adjusted for the presence of
any carry-over effects will be needed. In other words, if there are any additive carry-
over effects that cannot be removed by the design of the study, suitable alternative
designs will have to be used.
Fortunately, there are many such designs for two or more treatments in two or
more periods. Tables of suitable designs are given in Chapter 4 of Jones and
Kenward (2014), and these should be referred to for examples of designs not
illustrated in this chapter. One class of designs is the Williams design. These designs
fall into two types, depending on whether t, the number of treatments, is even or odd.
If t is even, then the basic design requires t subjects and t periods. If t is odd, then the
basic design requires 2 t subjects and t periods. Examples of these basic designs for
t ¼ 3 and t ¼ 4 are given in Tables 10 and 11, respectively. The basic designs,
obtained from published tables, or by computer search, can be thought of as designs
with one subject allocated to each sequence group. The rows of these tables give the
sequences (which define the sequence groups) to be used in the trial. In the actual
trial, the available subjects are allocated at random to the sequences to form the
sequence groups. Usually, the sequence groups are of the same size. Whereas, the
2 2 design has two sequence groups, the Williams design has t or 2 t sequence
groups. An algorithm to determine the basic sequences in a Williams design for all
values of t is given in Chapter 4 of Jones and Kenward (2014).
As with all clinical trials, at the planning stage, it is necessary to determine how
many subjects in total are needed to achieve a given power, e.g., 90%, to detect a
given treatment difference of interest at a specified significance level (e.g., 0.05). In
the example that follows for t ¼ 5, 80 subjects were needed, requiring, therefore,
8 subjects be assigned at random to each of the 10 sequence groups.
To calculate the sample size required to ensure a given power for a particular
comparison, it is necessary to derive, for the particular design under consideration,
the standard error of the estimated comparison as a function of n, the size of each
sequence group (assuming equal group sizes are required). Once this has been
obtained, standard formulae for sample sizes to compare two groups can then be
modified to include this standard error, rather than the usual standard error for the
difference of two means. How to do this is beyond the scope of this chapter and
typically requires the use of purpose-written software.
The general property of a Williams design is that, if each pair of consecutive periods
is considered, and the counting is over all the sequence groups in the basic design, then
each treatment occurs an equal number of times before each other treatment, except
itself. This ensures that each of the t possible carry-over effects occurs with each of the
other t – 1 treatments an equal number of times. For example, in Table 10, each
treatment occurs twice before each of the other two treatments. In Table 11, each
treatment occurs once before each of the other three treatments. This type of design
balances out the carry-over effects, and as a consequence, the estimates of the
treatment comparisons adjusted for the presence of carry-over effects can be made
using within-subject information. In addition, these designs typically have variances of
the estimated treatment effects that are not excessively large compared to the situation
where carry-over effects are assumed to be absent and are not adjusted for.
The data for this example are a simulated version of those taken from a Phase III,
randomized, double-blind, placebo-controlled trial. The trial compared a novel drug
at two doses (labeled here as B and C, where B is the lower dose) with a placebo drug
(labeled here as A) in subjects with moderate to severe Chronic Obstructive Pulmo-
nary Disease (COPD). Also included in the trial were two other drugs which acted as
positive controls: Salbutamol and a combination of Salmeterol and Fluticasone
(labeled here as D and E, respectively). The trial population consisted of 80 adult
males and females (age 40 years and over) with a clinical diagnosis of moderate-to-
severe COPD. The efficacy measurement of interest was the Forced Expired Volume
in 1 second (FEV1) measured in liters and taken at 5 minutes postdose. The objective
of the trial was to determine if either B or C or both were superior to A (Placebo).
Other pairwise comparisons were considered to be of secondary importance.
As there are five different treatments, this study was designed as a Williams
design with five periods and ten sequence groups as given in Table 12. A wash-out
period of 7 days was used between the treatments periods. Although, the length of
the wash-out periods was considered adequate to remove any carry-over effects,
there was some uncertainty about this, and so a Williams design was used to ensure
that the treatment effects could be adjusted for carry-over effects. The primary
70 Cross-over Trials 1343
analysis model is therefore Model (2) and contains terms for subjects, periods,
treatments, and carry-over effects.
To calculate the efficiency of this Williams design, the lower bound is calculated
using, r, the number of times each treatment occurs in the basic design, i.e., r ¼
N/t ¼ 50/5 ¼ 10. Hence, Vbound ¼ 2σ 2/10 ¼ 0.2σ 2. The variance of a pairwise
comparison in this design, assuming a model without carry-over effects, is also
0.2σ 2. Therefore, the design efficiency, in the absence of carry-over effects, is
100 0.2/0.2 ¼ 100%. If, as in this example, carry-over effects are included in
the model, the variance of a pairwise comparison increases to 0.2111σ 2, giving an
efficiency of 100 0.2/0.2111 ¼ 94.74%.
There is clearly a price to be paid for allowing for the presence of carry-over
effects, but in this case it is not high: The relative increase in sample size is less than
6% (100/94.74 ¼ 1.055). That is, allowing for differing carry-over effects requires
about 6% more subjects.
Eight subjects were randomized to each of the sequence groups. As an illustra-
tion, the data for the subjects in the first sequence group are given in Table 13.
As a first step in the analysis of these data, the raw treatment means obtained from
the complete dataset are plotted in Fig. 4, where the treatments have been labeled as
A ¼ 1, B ¼ 2, . . ., and E ¼ 5.
It can be seen that there is evidence that treatment C (high dose of the novel drug)
gives the highest mean FEV1 response, and treatment A (Placebo) has the lowest
mean.
This will be explored further by fitting Model (2) to the responses. As every
subject gets every treatment, and there are no missing values, it does not matter if the
subject effects are considered to be fixed or random. An example of a design where it
does matter will be given in the next section.
The analysis of variance table obtained by fitting this model is given in Table 14.
From Table 14, it can be seen that the p-value for carry-over is significant at the
0.05 level, indicating that there is evidence of differences between the carry-over
effects. In retrospect, therefore, it was wise that the Williams design had been used
1344 B. Jones
1.5 l
l
FEV1
l l
1.4
l
1.3
A B C D E
Treatment
Table 14 Analysis of variance for Williams design [df ¼ degrees of freedom, SS ¼ sums of
squares, MS ¼ mean square, and F¼F-ratio]
Source df SS MS F P-value
Between-subjects 79 68.3215 0.8648 60.34 <0.0001
Period 4 0.0178 0.0044 0.31 0.8711
Treatment 4 0.6688 0.1672 11.67 <0.0001
Carry-over 4 0.1505 0.0376 2.63 0.0348
Residual 308 4.4145 0.0143
Total 399 73.9643
Table 15 Pairwise differences in the fitted means for the Williams design
Parameter Estimate Standard error One-sided p-value
B-A 0.0470 0.0194 0.0082
C-A 0.1275 0.0194 <0.0001
Quite often it is not possible to use as many periods in a cross-over trial as there are
treatments. This could be because there is a concern that subjects may not want to
stay in the trial long enough to complete all t periods and will drop out before the end
of the trial. Another reason could be that, with the inclusion of lengthy wash-out
periods, the trial may take too long if all t periods have to be completed. In any case,
trials with p < t are not uncommon, and the third illustrative example has t ¼ 4 and
p ¼ 3. Jones and Kenward (2014) give many examples of such incomplete block
designs and their sequence groups. Their associated efficiencies can be obtained by
referring to that book or by using the R package Crossover (Rohmeyer (2014)).
This example is also a multicenter, Phase III, cross-over trial where, for confiden-
tiality reasons, the actual data have been replaced with simulated values. The aim of the
trial was to assess the efficacy of a drug B in subjects with moderate to severe COPD.
The trial also included an active control, C, another treatment of interest (labeled as A)
and a placebo drug D. The main aim of the trial was to compare B with D, with the
other comparisons being of secondary interest. To limit the length of the trial, the four
treatments were given over three periods in an incomplete block design. The study
population consisted of a representative group of adult males and females aged 40 years
and over with a clinical diagnosis of moderate to severe COPD. The efficacy variable of
interest was the FEV1. The structure of the trial included a 14-day run-in period used to
assess eligibility of subjects for the study. Each treatment period lasted 14 days
followed by a 14-day wash-out period (after Periods 1 and 2 only).
1346 B. Jones
The sequences used in this design are given in Table 16, and in the trial four
subjects were allocated to each sequence group. This design is fully balanced in the
sense that each estimated pairwise comparison has the same variance.
To calculate the efficiency of this incomplete block design, the lower bound is
calculated using the number of times each treatment occurs in the basic design, i.e.,
r ¼ 9. Hence, Vbound ¼ 2σ 2/9 ¼ 0.2222σ 2. The variance of a pairwise comparison in
the basic design, assuming a model without carry-over effects, is 0.2500σ 2. There-
fore, the design efficiency, in the absence of carry-over effects, is 100 0.2222/
0.2500 ¼ 88.89%. If carry-over effects are included in the model, the variance of a
pairwise comparison increases to 0.3088σ 2, giving a relatively low efficiency of
100 0.2222/0.3088 ¼ 71.96%. Fortunately, given the relatively long wash-out
periods, there was no expectation in this trial that carry-over effects would be
present.
As an illustration, the data from the first sequence group are given in Table 17.
The FEV1 values obtained at the end of each period are given for each subject,
along with the baseline measurement taken at the end of the run-in period and at the
end of each wash-out period before the start of the following period. These baseline
measurements will be ignored for now, and only the responses given in the columns
that have headings Period 1, Period 2, and Period 3 will be analyzed. The analysis
using baselines will be discussed in section “Use of Baseline Measurements.”
Table 16 Incomplete block cross-over design to compare four treatments in three periods
Group Period 1 Period 2 Period 3
1 A B C
2 B A D
3 C D A
4 D C B
5 A D B
6 B C A
7 C B D
8 D A C
9 A C D
10 B D C
11 C A B
12 D B A
As carry-over effects were not anticipated, they will not be included in the fitted
model. A new feature of the analysis of an incomplete block design is that the
question as to whether the subject effects sik are assumed to be fixed or random
effects is now highly relevant. If they are assumed to be fixed effects, the comparison
of treatments uses only the information that is available from within-subject com-
parisons between the responses on a subject. If the subject effects are assumed to be
random, then some additional information can be obtained from the between-subject
comparisons (often referred to as the inter-block information). This recovery of
information is typically most advantageous when the cross-over design has low
efficiency for the treatment comparisons or a low to moderate correlation between
the repeated measurements on each subject and a large number of subjects. As
already noted, this final example design has quite high efficiency (88.89%), so the
recovery of inter-block information may not make much of a difference to the
analysis of the data. Nevertheless, for the purposes of providing an illustration, the
inter-block information will be recovered. The parameter estimates obtained from
the fixed-subject effects model are given in Table 18. Note that the period effects are
the differences compared to Period 3, and the treatment effects are the differences
compared to treatment D.
When the subject effects are assumed to be random, the REML estimates of the
treatment and period effects are as given in Table 19. It can be seen that the standard
errors of the treatment estimates are slightly smaller when REML is used. It should be
noted that the Kenward-Roger adjustment (Kenward and Roger (1997)) has been used.
As part of the output from fitting this model using standard statistical software
(not shown), estimates of the variance components may also be obtained: b σ2 ¼
361:35 and σ 2s ¼ 1062:78 . The within-subject correlation
between any pair of
repeated measurements on a subject is ρ ¼ σ 2s = σ 2s þ σ 2 and is estimated as
1062.78/(1062.78 + 361.35) ¼ 0.75. Therefore, not only does this design have a
Table 18 Parameter estimates obtained from model with fixed subject effects
Effect Estimate Standard error Two-sided p-value
Period 1 0.4381 3.8798 0.9103
Period 2 0.7450 3.8798 0.8482
Treatment A 31.5485 4.7518 < 0.0001
Treatment B 29.8806 4.7518 < 0.0001
Treatment C 29.1396 4.7518 < 0.0001
high efficiency, but the estimated within-subject correlation is also large, implying a
large between-subject variance. For well-designed cross-over trials, this is often the
case, and in this situation, as has been already mentioned, the recovery of inter-block
information is unlikely to make much difference to the estimates of the treatment
comparisons and their estimated standard errors.
When a measurement of the response is taken prior to the start of each period, these
baseline measurements may be useful in increasing the precision of the estimated
treatment effects. Whether they are useful or not depends on the degree and type of
correlation structure between the response and the baseline measurements. See
Chapter 5 of Jones and Kenward (2014) for more details.
Two approaches that may be considered when making use of baseline measure-
ments are to (1) analyze the change from baseline measurements, i.e., for each
subject and period replace the response by the difference between the response
and the baseline value for that period or (2) include the baseline measurements as
covariates. The inclusion of covariates is the recommended approach. See Senn
(2006) for further discussion on this.
In this section, the data from section “Example 3: Incomplete Block Design” are
reanalyzed, but now making use of the baseline measurements that were taken at the
start of each treatment period. Table 17 gives these baseline values for the subjects in
the first sequence group.
Typically, the analysis of the changes from baseline is only worth considering if
the response and its associated baseline are close together in time, compared to the
gap between periods and if the variability of the baselines is considerably less than
that of the response, which may happen if the baseline is, in fact, the average over
several baseline measurements, for example.
In a fixed subject effects model, it is sufficient to include the baseline (from each
period) as a single covariate.
However, if a random subject effects model is used to recover between-subject
information, two separate covariates must be included for each subject: Covariate (a)
is the average over the p baselines, and Covariate (b) is the difference from this
average of each of the p baseline measurements. See Chapter 5 of Jones and
Kenward (2014) for more details.
To illustrate the construction of the covariates, Table 20 shows their values for the
first subject in Table 17. For example, the value of the Covariate (a) is
(256.25 + 278.53 + 284.06)/3 ¼ 272.9467 ≈ 272.95 and the value of Covariate (b)
for Period 1 is 256.25–272.95 ¼ 16.70.
Assuming that there are no carry-over effects due to the inclusion of the washout
periods (i.e., Model (1) is assumed), the extension of the previous analysis to include
baselines is now illustrated.
Table 21 shows the estimate and its standard error for the comparison of B versus
D, for a selection of models and where the subject effects are either fitted as fixed or
70 Cross-over Trials 1349
Table 20 Factors, response, baselines, and covariates for the first two subjects
Subject Period Drug Baseline Response Covariate (a) Covariate (b)
1 1 1 256.25 265.16 272.95 16.70
1 2 2 278.53 311.23 272.95 5.58
1 3 3 284.06 284.20 272.95 11.11
48 1 4 185.10 203.00 177.07 8.03
48 2 2 190.20 213.08 177.07 13.13
48 3 1 155.90 206.93 177.07 21.17
Table 21 Parameter estimates and standard errors from analyses with and without baselines (B
versus D)
Fixed subject effects Random subject effects
Model Estimate Standard error Estimate Standard error
(A) Response only 29.881 4.752 30.553 4.726
(B) Change from baseline 26.196 3.701 26.820 3.668
(C) Single period baseline 27.392 3.500 27.871 3.324
(D) Both baseline covariates 27.391 3.500 28.034 3.323
random effects. All models contain factors for the subjects, periods, and treatments.
The fitted models are: (A) response only and no covariates; (B) the change from
baseline in each period is used as the response, and no covariates are added; (C) the
response is fitted with each period baseline used as the covariate; and (D) the
response is fitted with Covariates (a) and (b).
It is immediately clear that making use of the baselines does increase the precision
of estimation. For example, the standard error drops from 4.75 to 3.50 when one or
both covariates are added to the fixed effects model. In this example, the use of the
change from baseline is almost as good as using the covariates, although that may not
always be the case. Because it is already known from the analysis reported in the
previous section that there is little advantage in recovering the between-block infor-
mation, there is not a great difference in the respective results for the fixed and random
effects models. Indeed, as explained in Chapter 5 of Jones and Kenward (2014), the
use of both covariates is only needed when there is some incompatibility between the
within-subject and between-subject estimates. If there is little to no between-subject
information on a treatment comparison, as here, then it is not necessary to fit both
covariates. In fact, in such a situation, using the fixed effects model is recommended.
This chapter has concentrated on the design and analysis of continuous data from
cross-over trials. Examples where p ¼ t and p < t have been given, and the use of
period baseline covariates has been illustrated. Designs also exist for p > t, and Jones
and Kenward (2014) should be consulted for examples of these.
1350 B. Jones
As a thorough treatment of the analysis of binary and categorical data from cross-
over trials is beyond the scope of this chapter, the reader is referred to Chapter 6 of
Jones and Kenward (2014) for a detailed coverage. However, it should be noted that
for binary data from the 2 2 cross-over design, simple analyses based on 2 2
contingency tables are available. Chapter 2 of Jones and Kenward (2014) gives the
details.
Key Facts
In a cross-over trial, each subject receives a series of treatments over a fixed number
of periods. When used appropriately, cross-over designs require fewer subjects to
achieve a given level of precision or power compared to their parallel groups
counterparts. When there are more than two treatments or periods, choices between
designs can be made using their efficiencies. Models used to fit data from cross-over
trials may include subject effects as fixed or random variables. For designs of the
incomplete block type, the use of random subject effects permits the recovery of
between-subject information on treatment comparisons. The use of baseline mea-
surements, taken before the start of each period, may be useful to increase the
precision of the treatment comparisons.
Cross-References
References
Altman DG (1991) Practical statistics for medical research. London: Chapman and Hall
Anderson TW, Darling DA (1952) Asymptotic theory of certain “goodness-of-fit” criteria based on
stochastic processes. Ann Math Stat 23:193–212
Bretz F, Hothorn T, Westfall P (2016) Multiple comparisons using R. Boca Raton: CRC Press
Chi EM, Reinsel GC (1989) Models for longitudinal data with random effects and AR(1) errors. J
Am Stat Assoc 84:452–459
Clark B, Smith D (1981) Introduction to pharmacokinetics. Oxford: Blackwell Scientific
Publications
FDA (2013) Guidance for industry: bioequivalence studies and pharmacokinetic endpoints for
drugs submitted under an ANDA. Food and Drug Administration
Freeman P (1989) The performance of the two-stage analysis of two treatment, two period crossover
trials. Stat Med 8:1421–1432
Grizzle JE (1965) The two-period change-over design and its use in clinical trials. Biometrics
21:467–480
Havliceck LL, Peterson NL (1974) Robustness of the t-test: a guide for researchers on the effect of
violations of assumptions. Psychol Rep 34:1095–1114
Hollander M, Wolfe D (1999) Nonparametric statistical methods, 2nd edn. New York: Wiley
70 Cross-over Trials 1351
Jones B, Kenward MG (2014) Design and analysis of cross-over trials, 3rd edn. Boca Raton: CRC
Press
Kenward MG, Roger JH (1997) Small sample inference for fixed effects estimators from restricted
maximum likelihood. Biometrics 53:983–997
Patterson HD, Lucas HL (1962) Change-over designs. North Carolina Agricultural Station, Tech
Bull 147
Patterson HD, Thompson R (1971) Recovery of inter-block information when block sizes are
unequal. Biometrika 58:545–554
Piantadosi S (1997) Clinical Trials. A methodological perspective. New York: Wiley
Rohmeyer K (2014) Crossover. R package Crossover, Version 01-16
Senn S (2006) Change from baseline and analysis of covariance revised. Stat Med 25:4334–4344
Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples).
Biometrika 52:591–611
Factorial Trials
71
Steven Piantadosi and Susan Halabi
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1354
Characteristics of Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355
Interactions or Efficiency, But Not Both Simultaneously . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355
Factorial Designs Are Defined by Their Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1355
Factorial Designs Can Be More Efficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1357
Design and Analysis of Factorial Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1358
Design Without Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1358
Design with Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1360
Designs with Biomarkers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1361
Analysis of Factorial Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1363
Treatment Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1364
Factorial Designs Are the Only Way to Study Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1364
Interactions Depend on the Scale of Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366
The Interpretation of Main Effects Depends on Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1366
Analyses Can Employ Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1368
Examples of Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1370
Partial, Fractional, and Incomplete Factorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1372
Use Partial Factorial Designs When Interactions Are Absent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1372
Incomplete Designs Present Special Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1373
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1373
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1374
S. Piantadosi (*)
Department of Surgery, Division of Surgical Oncology, Brigham and Women’s Hospital, Harvard
Medical School, Boston, MA, USA
e-mail: [email protected]
S. Halabi
Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, NC, USA
e-mail: [email protected]
Abstract
Factorial clinical trials test the effects of two or more therapies using a design that
can estimate interaction between therapies (Piantadosi 2017). (This chapter
revises, updates, and expands upon reference (Piantadosi 2017)) A factorial
structure is the only design that can assess treatment interactions, so this type of
trial is required for those important therapeutic questions. When interactions
between treatments are absent, which is not a trivial requirement, a factorial
design can estimate each of several treatment effects from the same data. For
example, two treatments can sometimes be evaluated using the same number of
subjects ordinarily used to test a single therapy. When possible, this demonstrates
a striking efficiency. For these reasons, factorial designs have an important place
in clinical trial methodology, and have been applied in a variety of setting, but in
particular in disease prevention.
Keywords
Factorial clinical trials · Treatment interactions · Factorial designs
Introduction
Factorial clinical trials test the effects of two or more therapies using a design that
can estimate interaction between therapies (Piantadosi 2017). A factorial structure is
the only design that can assess treatment interactions. When interactions between
treatments are absent, a factorial design can estimate each of several treatment effects
from the same data. For example, two treatments can sometimes be evaluated using
the same number of subjects ordinarily used to test a single therapy. For these
reasons, factorial designs have an important place in clinical trial methodology,
and have been applied in a variety of setting, but in particular in disease prevention.
Historically, control variables in experiments were called factors. For example,
a factor can be defined by the presence or absence of a single drug. A factor can
have more than one level, as indicated by different doses of the same drug. A factor
is not strictly qualitative. The choice between treatments, A and B, is not a factor
(assuming that one is not a placebo). Many factors have only two levels (present or
absent) and are therefore both ordinal and qualitative. In a factorial design, all
factors are varied systematically, with some groups receiving more than one
treatment, and the experimental groups are arranged that may permit testing if a
combination of treatments is better or worse than individual treatments, although
the power is often limited.
The method of varying more than one factor or treatment in a single study was
used in agricultural experiments before 1900. It was developed and popularized by
R. A. Fisher (1935, 1960) and Yates (1935), and used to great advantage in both
agricultural and industrial experiments. In medicine, factorial designs have been
used more in prevention trials than therapeutic studies.
71 Factorial Trials 1355
The simplest factorial design has treatments A and B, and four treatment groups
(Table 1). Assume n subjects are entered into each of the four treatment groups for a
total sample size of 4n and a balanced (equal allocation) design. One group receives
neither A nor B, a second receives both A and B, and the other two groups receive
only one of A or B. This is called a 22 (two by two) factorial design. Although
basic, this design illustrates many of the general features of factorial experiments.
The design generates enough information to test the effects of A alone, B alone, and
A plus B. The efficiencies in doing so will be presented below.
1356 S. Piantadosi and S. Halabi
prevention factorial trials, the treatments tested may target different diseases in the
same cohort.
Although their scope is limited, factorial designs offer certain important efficiencies
or advantages when they are applicable. To illustrate how this occurs, consider the
22 design and the estimates of treatment effects that would result using an additive
model for analysis (Table 3). Assume that the responses are group averages of some
normally distributed response denoted by Y . The subscripts on Y indicate which
treatment group it represents. Note that half the subjects receive one of the treat-
ments. This is also true in higher order designs. For the moment, further assume that
the effect of A is not influenced by the presence of B.
There are two estimates of the effect of treatment A compared with placebo in the
design, Y A Y 0 and Y AB Y B : If B does not modify the effect of A, it is sensible to
combine or average them to estimate the overall, or main, effect of A, denoted here
by βA,
Y A Y 0 þ Y AB Y B
βA ¼ ð1Þ
2
Similarly,
Y B Y 0 þ Y AB Y A
βB ¼ ð2Þ
2
Thus, in the absence of interactions, which means the effect of A is the same with
or without B, and vice versa, the design permits the full sample size to be used to
estimate two treatment effects.
Now suppose that each subject’s response has a variance σ 2and that it is the same
in all treatment groups. We can calculate the variance of βB to be
1 4σ 2 σ 2
varðβA Þ ¼ ¼
4 n n
This is exactly the same variance that would result if A were tested against placebo
in a single two-armed comparative trial with 2n subjects in each treatment group.
Similarly,
σ2
varðβB Þ ¼
n
However, if we tested A and B separately, we would require 4n subjects in each
trial or a total of 8n subjects to have the same precision obtained from half as many
subjects in the factorial design. Thus, in the absence of interactions, these designs
allow great efficiency in estimating main effects. In fact, in the absence of interac-
tion, we get two trials for the price of one. Tests of both A and B can be conducted in
a single factorial trial with the same precision as two single-factor trials using twice
the sample size.
Computing the sample size for a 22 factorial trial assuming no interaction between
the arms is straightforward. One can use the sample size formula for designing a trial
comparing two treatment arms (Moser and Halabi 2015; Rubinstein et al. 1981).
Suppose we are interested in testing the effect of two regimens in men with advanced
prostate cancer. Patients will be randomized with equal allocation to four treatment
groups: standard of care, experimental arm A, experimental arm B, or experimental
arms A+ B (Table 4). Let λij be the hazard rate of the ith factor (i ¼ 1, 2) of treatment
A and the jth factor (j ¼ 1, 2) of treatment B. Overall survival (OS) is the primary
endpoint. Similar to Rubenstein et al. (Peterson and George 1993), we make the
following assumptions:
71 Factorial Trials 1359
(a) There is an accrual period [0, T] where T is the number of years. During this
accrual period, the patients enter the clinical trial according to a Poisson process
with n patients per year. The patients are randomized to k treatment groups with
Pk
probability Pj (0 < Pj < 1) to treatment j, where j ¼ 1,2,...k and P j ¼ 1:
j¼1
(b) The patients are followed-up for a period of τ years. The τ is known as follow-up
period. The total length of the study is T + τ years.
(c) In treatment j, the failure or death times (the times from entry into the trial to
failure or death) are i.i.d exponentials with hazard λj. Moreover, the failure times
across the treatment groups are assumed to be independent.
(d) The censored times (the times from entry into the trial to loss to follow-up) are i.
i.d exponentials with common hazard Φc.
(e) The failure times and censored times are independent.
(f) The censoring mechanism is random censoring.
(g) Constant treatment effect for both treatments A and B over time, i.e., the
proportional hazards assumption.
size is 750 patients. With 434 deaths for, the log-rank test has 85% power to detect a
hazard ratio ¼ 0.75 (assuming that the median OS ¼ 20 months and 26.7 months in
the standard of care and treatment B, respectively) with a one-sided type I error rate
of 0.025. In designing the above trial, we base the sample size on testing the
hypothesis with the smallest effect size (comparing experimental arm A to the standard
of care) since its sample size is larger than what is required for testing the second
hypothesis (comparing experimental arm B to the standard of care). Thus, the target
sample size for this trial is 900 prostate cancer patients.
In the prostate cancer example we could adjust for the type I error rate using the
Bonferroni procedure (α/2) because we are testing two hypotheses. The required
number of events is 863 deaths for testing the first hypothesis, and the log-rank test
has 85% power to detect a 20% decrease in hazard rate (HR ¼ 0.8) with a one-sided
type I error rate of 0.0125. As expected, we observe that the number of events has
increased drastically from a 722 to 863 deaths (approximately 20% increase). If we
assume the same sample size (900 patients) and the same accrual rate of 30 patients/
month, then the trial duration will be doubled from 44 months to 88 months.
Several authors have debated that there is no need to adjust for the type I error rate
in designing a factorial trial when several experimental arms are compared to a
control or the standard of care group (Freidlin et al. 2008; Wason et al. 2014;
Proschan and Waclawiw 2000). Their rationale is that such trials are designed to
answer the efficacy question for each experimental drug separately and as such the
results of one comparison should not influence the results of the other hypothesis.
Peterson et al. developed the sample size required in a 22 factorial trial in the
presence of interaction when the endpoint is time-to-event (Peterson and George
1993).
Suppose we are interested in testing the effect of the interaction of the two
treatment in the prostate cancer example (Table 4) and our objective is to compare
the difference in the hazard rates λ21/λ11 / λ22/λ12 (on a log-scale). Let Δ1 ¼ λ11/λ21,
Δ2 ¼ λ12/λ22, and γ ¼ Δ2/Δ1 be the hazard ratios for treatment effect for patients
receiving treatment A versus standard of care, treatment effect for patients receiving
treatments A and B versus treatment B, and the ratio of the hazard ratios
(or interaction between the two treatments), respectively. The null hypothesis is
that there is no interaction between the two treatment (γ ¼ 1) versus the alternative
hypothesis that there is an interaction between the treatment groups (γǂ1). We assume
that the median OS to be M11 ¼ 19 in patients randomized to standard of care,
M21 ¼ 20 in patients randomized to experimental arm A, M12 ¼ 20 in patients
randomized to experimental arm B, and M22 ¼ 30 months in patients randomized to
experimental arms A and B (Table 4). In order to test this hypothesis, we need to
extend both the accrual period from 30 months to 48 months and the follow-up
period from 14 months to 30 months. The total sample size is now 1,440 patients and
the expected number of events is 1,151 at the end of the trial. The power to detect a
71 Factorial Trials 1361
γ ¼ 1.5 for the interaction between the treatment arms is 85% assuming a one-sided
type I error of 0.025.
Another strategy to consider in the interaction between the treatments is to use
Simon’s approach (Simon and Freedman 1997), who proposed inflating the sample
size by 30%. Thus, the sample size for the prostate cancer trial will be 1,170
(900*30%) and the number of deaths ¼ 934. For the power computation, we assume
an accrual of 30 patients over a 48-month period and a follow-up period of
30 months. With 1,170, the power to detect an interaction γ ¼ 1.5 between the
two factors is 77%. Including Simon’s reasoning to the computation that we
performed above indicates, surprisingly, that the power for testing an interaction
term (γ ¼ 1.5) is around 80% assuming a one-sided type I error of 0.025.
The main drawback to factorial trials is that often trials are not designed to test for
an interaction between the treatment groups and as a result such trials are usually
underpowered. While the examples above were based on testing superiority hypoth-
eses, a factorial trial can be designed testing a superiority and a non-inferiority
(or equivalence) hypothesis. For example, CALGB 80203 (NCT00077233) was
originally designed as a phase III 22 factorial trial to test two hypotheses. The
first hypothesis was to test if the addition of C225 to FOLFOX or FOLFIRI
chemotherapy will improve OS in untreated metastatic colon cancer patients. The
second hypothesis was to test the equivalence of FOLFOX and FOLFIRI in OS in
untreated metastatic colon cancer patients. The trial was closed due to poor accrual.
Recently Freidlin and Korn (2017) argued that in designing factorial trials in
oncology, one needs to consider an interaction between the drugs as it is very likely
that the “no interaction” assumption is not a valid one. Moreover, the authors
advocate for matching the analysis with the trial design to achieve the objectives
of the trial.
where λ0(t) is the baseline hazard, xi ¼ 0 or 1 represents the treatment arm, x2 is the
biomarker level (usually measured as a continuous variable), and x1x2 is the treat-
ment arm-biomarker interaction term. Under the null hypothesis, β3 ¼ 0 indicates
1362 S. Piantadosi and S. Halabi
MARVEL trial, about 1200 non-small lung cancer patients were to be randomized to
either erlotinib or pemetrexed. The primary objective was to evaluate whether there
are differences in progression-free survival between erlotinib and pemetrexed within
the FISH positive and FISH negative subgroups. Unfortunately, the trial was closed
due to slow accrual. A stratified biomarker trial often requires large sample size, that
the cutoff point for the biomarker has been validated and that the prevalence of the
biomarker is appropriate to allow for testing for the treatment-biomarker interaction.
When biomarkers are based on tumor, the true status of the biomarker can be
classified with error. Sample size formula for the stratified biomarker design to
account for the misclassification error has been provided and is a hot research area
(Liu et al. 2014).
Factorial trials have been analyzed inconsistently in the literature (Freidlin and Korn
2017). Some authors maintain the view that an interaction test should be provided
even if the trial was not designed to test for an interaction between the treatments
(Montgomery et al. 2003; Korn and Freidlin 2016). As an example, consider the
ECOG trial (E1199) 22 factorial trial in neoadjuvant breast cancer patients where
they were randomized to: paclitaxel (administered over 3 weeks), paclitaxel
(weekly), docetaxel given every 3 weeks, and docetaxel given weekly (Sparano
2008). The study was designed with 86% power using a two-sided significance level
of 0.05 for testing each of the primary factors (treatment: paclitaxel vs. docetaxel;
schedule: weekly vs. 3 weeks). No statistically significant differences in
DFS were observed between patients randomized to paclitaxel versus docetaxel
( p-value ¼ 0.61) nor weekly treatment versus those who received 3 weeks treatment
( p-value ¼ 0.33). The authors performed a test of interaction between treatment and
1364 S. Piantadosi and S. Halabi
schedule ( p-value ¼ 0.003). Furthermore, the authors compared individual arms and
demonstrated that patients receiving weekly paclitaxel had superior disease-free
survival than patients who received paclitaxel over 3 weeks. The results persisted
in long-term follow-up of these patients (Sparano et al. 2015).
Treatment Interactions
One of the most consequential features of factorial designs is that they are the only
type of trial design that permits study of treatment interactions. This is because the
factorial structure has groups with all possible combinations of treatments, allowing
the responses to be compared directly. Consider, again, the two estimates of the
effect of A in the 22 design, one in the presence of B and the other in the absence of
B. The definition of an interaction is that the effect of A in the absence of B is
different from the effect of A in the presence of B. This difference can be estimated
by comparing
βAB ¼ Y A Y 0 Y AB Y B ð3Þ
with zero. If βABis near zero, we would conclude that no interaction is present. It is
straightforward to verify that βAB ¼ βBA.
An important principle of factorial trials is evident by examining the variance of
βAB. Under the same assumptions as in section “Factorial Designs Can Be More
Efficient.”
σ2
varðβAB Þ ¼ 4
n
which is four times larger than the variance for either main effect when an interaction
is known to be absent. Therefore, to have the same precision for an estimate of an
interaction effect as for a main effect, the sample size has to be four times larger. This
illustrates again why both the efficiency and interaction objectives cannot be simul-
taneously met in the same factorial study.
When there is an AB interaction, we cannot use the estimators given above for the
main effects of A and B (Eqs. 1 and 2), because they assume that no interaction is
present. In fact, it is not sensible to talk about an overall main effect in the presence
of an interaction because Eqs. 1 or 2 would have us average over two quantities that
71 Factorial Trials 1365
are not expected to be equal. Instead, we could talk about the effect of A in the
absence of B,
β0 A ¼ Y A Y 0 ð4Þ
These are logically and statistically equivalent to what would be obtained from
stand-alone trials.
In the 222 design, there are three main effects and four interactions possible,
all of which can be estimated by the design. Following the notation above, the effects
are
1
βA ¼ Y A Y 0 þ Y AB Y B þ Y AC Y C þ Y ABC Y BC ð6Þ
4
for treatment A,
1
βAB ¼ Y A Y 0 Y AB Y B þ Y AC Y C þ Y ABC Y BC ð7Þ
2
for the AB interaction, and
βABC ¼ Y A Y 0 Y AB Y B Y AC Y C Y ABC Y BC ð8Þ
2 2 2
σ
for the ABC interaction. The respective variances are 2n , 2σn , and 8σn . Thus the
precision of the two-way interactions relative to the main effect is 1/4, and for the
three-way interaction is 1/16.
When certain interactions are present, here again it will not be sensible to think of
the straightforward main effects. But the design can yield an alternative estimator for
βA, or βBA,or for other effects.
Suppose that there is an ABC interaction. Then instead of βA, an estimator of the
effect of A in the absence of C would be
1
β0 A ¼ Y A Y 0 þ Y AB Y B
2
which does not use βABC and implicitly assumes that there is no AB interaction.
Similarly, the AB interaction would be
β0 AB ¼ Y A Y 0 þ Y AB Y B
for the same reason. Thus, when high-order interactions are present, we must modify
our estimates of lower order effects, losing some efficiency. However, factorial
designs are the only ones that permit treatment interactions to be studied.
1366 S. Piantadosi and S. Halabi
In the examples just given, the treatment effects and interactions have been assumed
to exist on an additive scale. This is reflected in the use of sums and differences in the
formulas for estimation.
In practice, other scales of measurement, particularly a multiplicative one, may be
useful. As an example, consider the response data in Table 6 where the effect of
treatment A is to increase the baseline response by 5 units. The same is true of B, and
there is no interaction between the treatments on this scale because the joint effect of
A and B is to increase the response by 5 + 5 ¼ 10 units.
In contrast, Table 7 shows data in which the effects of both treatments are to
multiply the baseline response by 2.0. Hence, the combined effect of A and B is a
fourfold increase, which is greater than the joint treatment effect for the additive
case. If the analysis model were multiplicative, Table 6 would show an interaction,
whereas if the analysis model were additive, Table 7 would show an interaction.
Thus, to discuss interactions, we must establish the scale of measurement.
In the presence of an interaction in the 22 design, there is not an overall, or main,
effect of either treatment. This is because the effect of A is different depending on
the presence or absence of B. In the presence of a small interaction, where all
subjects benefit regardless of the use of B, we might observe that the magnitude of
the overall effect of A is of some size and that therapeutic decisions are unaffected
by the presence of an interaction (Fig. 2a). This is known as quantitative interac-
tion because it does not affect the direction of the treatment effect. For large
quantitative interactions, it may not be sensible to talk about overall effects
(Kahan, 2013).
71
Factorial Trials
1.0
1.0
No-No No-No
No-Yes No-Yes
Yes-No Yes-No
Yes-Yes Yes-Yes
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 3 6 9 12 15 18 21 24 0 3 6 9 12 15 18 21 24
Time since Random Assignment (months) Time since Random Assignment (months)
Fig. 2 Hypothetical examples of (a) (left) a quantitative interaction and (b) (right) qualitative interaction
1367
1368 S. Piantadosi and S. Halabi
Motivation for the estimators given above can be obtained using linear models.
There has been little theoretical work on analyses using other models. One exception
is the work by Slud (1994) describing approaches to factorial trials with survival
outcomes. Suppose we have conducted a 22 factorial experiment with group sizes
given by Table 1. We can estimate the AB interaction effect using a linear model of
the form
where the X’s are indicator variables for the treatment groups and βAB is the
interaction effect. For example,
1 for treatment group A,
XA ¼
0 otherwise:
where there are four blocks of n identical rows representing each treatment
group and the columns represent effects for the intercept, treatment A, treatment
B, and both treatments, respectively. The vector of responses has dimension
4n 1 and is
Y 0 ¼ ½Y 01 , . . . , Y A1 , . . . , Y B1 , . . . Y AB1 , . . .
1
β ¼ ðX 0 X Þ X 0 Y :
b
When the interaction effect is omitted, the estimates will be denoted by b
β . The
covariance matrix of estimates is (X0X)1σ 2, where the variance of each observation
is σ 2.
71 Factorial Trials 1369
We have
2 3 2 3
4 2 2 1 1 1 1 1
62 2 1 17 6 1 2 1 2 7
6 7 1 1 6 7
X 0 X 5n 6 7, ðX 0 X Þ ¼ 6 7,
42 1 2 15 n 4 1 1 2 2 5
1 1 1 1 1 2 2 4
and
2 3
Y0 þY A þ Y B þ Y AB
6Y þY AB 7
6 A 7
X 0 Y 5n 6 7,
4 YB þY AB 5
Y AB
which corresponds to the estimators given in Eqs. 3, 4, and 5. However, if the test for
interaction fails to reject and the βc
AB effect is removed from the model, then
2 3
3 1 1 1
6 4 0 Y þ Y þ Y Y
6 4 A 4 B 4 AB 7 7
6 1 1 1 1 7
βb ¼ 6 Y 0 þ YA YB þ Y AB 7 [1].
6 2 2 2 2 7
4 5
1 1 1 1
Y0 Y þ Y þ Y
2 2 A 2 B 2 AB
The main effects for A and B are given above in Eqs. 1 and 2.
The covariance matrices for these estimators are
2 3
1 1 1 1
6 1 2 1 2 7
d σ2 6 7
cov ðβ Þ ¼ 6 7 ð11Þ
n 4 1 1 2 2 5
1 2 2 4
and
2 3
3 1 1
6
6 4 2 2 7
7
d σ2 6 1 7
covðβ Þ ¼ 6 1 0 7: ð12Þ
n 6 2 7
4 5
1
0 1
2
1370 S. Piantadosi and S. Halabi
Factorials trials have received a lot of attention in clinical trials (Sikov et al. 2015).
We list interesting examples of factorial trials in Table 8. Factorial designs are well
suited to prevention trials for reasons outlined above, but many therapeutic trials
have also utilized factorial designs because of the questions being addressed. One
important classic study using a 22 factorial design is the Physicians’ Health Study
(Hennekens and Eberlein 1985). This trial was conducted in 22,000 physicians in the
USA and was designed to test the effects of (1) aspirin on reducing cardiovascular
mortality and (2) β-carotene on reducing cancer incidence. The trial is noteworthy in
several ways, including its test of two interventions in unrelated diseases, use of
physicians as subjects to report outcomes reliably, relatively low cost, and an
all-male, high-risk study population. This last characteristic led to some criticism,
which was probably unwarranted.
In January 1988, the aspirin component of the Physicians’ Health Study was
discontinued because evidence demonstrated convincingly that it was associated
with lower rates of myocardial infarction. The question concerning the effect of
β-carotene on cancer was addressed by continuation of the trial. In the absence of an
interaction, the second major question of the trial was unaffected by the closure of
the aspirin component and showed no benefit for β-carotene.
Another interesting example of a 22 factorial design is the α-tocopherol
β-carotene Lung Cancer Prevention Trial, conducted in 29,133 male smokers in
Finland between 1987 and 1994 (The ATBC Cancer 1994). In this study, lung cancer
incidence was the sole outcome. It was thought possible that lung cancer incidence
could be reduced by either or both interventions. When this trial was stopped in
1994, there were 876 new cases of lung cancer in the study population during the
trial. Alpha-tocopherol was not associated with a reduction in the risk of cancer.
Surprisingly, β-carotene was associated with a statistically significant increased
incidence of lung cancer. There was no evidence of a treatment interaction. The
unexpected findings of this study have been supported by the recent results of
another large trial of carotene and retinol.
The Fourth International Study of Infarct Survival (ISIS-4) was a 2 2 2
factorial trial assessing the efficacy of oral captopril, oral mononitrate, and intrave-
nous magnesium sulfate in 58,050 subjects with suspected myocardial infarction
(McAlister et al. 2003). No significant interactions among the treatments were found
and each main effect comparison was based on approximately 29,000 treated versus
29,000 control subjects. Among the findings was demonstration that captopril was
associated with a small but statistically significant reduction in 5-week mortality.
The difference in mortality was 7.19% versus 7.69% (143 events out of 4,319),
71 Factorial Trials 1371
Table 8 (continued)
Trial Design Cohort Treatments Outcomes
2.0 mg/dL multivitamin or or hospitalization
n ¼ 1,708. placebo for angina.
Adapted from Piantadosi 2017.
Partial, or fractional, factorial designs are those that omit certain treatment groups by
design. A careful analysis of the objectives of an experiment, its efficiency, and the
effects it can estimate may justify not using some groups. Because many cells
contribute to the estimate of any effect, a design may achieve its intended purpose
without some of the cells.
In the 2 2 design, all treatment groups must be present to permit estimating the
interaction between A and B. However, for higher order designs, if some interactions
are known biologically not to exist, certain treatment combinations can be omitted
from the design and still permit estimates of other effects of interest. For example, in
the 2 2 2 design, if the interaction between A, B, and C is known not to exist, that
treatment cell could be omitted from the design and still permit estimation of all the
main effects. The efficiency would be somewhat reduced, however. Similarly, the
two-way interactions could still be estimated without Y ABC . This can be verified
from the formulas above.
Generally, partial high-order designs will produce a situation termed “aliasing” in
which the estimates of certain effects are algebraically identical to completely
different effects. If both are biologically possible, the design will not be able to
reveal which effect is being estimated. Naturally this is undesirable unless additional
information is available to the investigator to indicate that some aliased effects are
zero. This can be used to advantage in improving efficiency, and one must be careful
in deciding which cells to exclude. The reader is referred to Cox (1958) or Mason
et al. (1989) for a discussion of this topic.
The Women’s Health Initiative (WHI) clinical trial was a 2 2 2 partial
factorial design studying the effects of hormone replacement, dietary fat reduction,
and calcium and vitamin D on coronary disease, breast cancer, and osteoporosis
(Assaf and Carleton 1994; Design of the Women’s 1998; Shumaker et al. 1998). The
study accrued 162,000 subjects into multiple clinical trials and finished the initial
study period in 2005 (Rossouw et al. 2002). The hormone therapy trials randomized
27,347 women in an estrogen plus progestin study and an estrogen alone study. The
71 Factorial Trials 1373
dietary component of the study randomized 48,835 women, using a 3:2 allocation
ratio in favor of the control arm and 9 years of follow-up. The calcium and vitamin D
component randomized 36,282 women. Such a large and complex trial was not
without controversy early on (Marshall 1993), and presented logistical difficulties,
questions about adherence, and sensitivity to assumptions that could only roughly be
validated during design.
Treatment groups can be dropped out of factorial plans without yielding a frac-
tional replication. The resulting trials have been called incomplete factorial designs
(Byar et al. 1993). In incomplete designs, cells are not missing by design intent but
because some treatment combinations may be infeasible. For example, in a 2 2
design it may not be ethically possible to use a placebo group. In this case, one
would not be able to estimate the AB interaction. In other circumstances, unwanted
aliasing may occur, or the efficiency of the design to estimate main effects may be
greatly reduced. In some cases, estimators of treatment and interaction effects are
biased, but there may be reasons to use a design that retains as much of the
factorial structure as possible. For example, they may be the only way to estimate
certain interactions.
Summary
Factorial trials are efficient under the assumption of no interaction between the
treatments, and this should be considered at the design stage. Factorial designs
may also be used for the purpose of detecting an interaction between the factors if
the trial is powered accordingly. Therefore, factorial trial designs are useful in two
circumstances. When two or more treatments do not interact, factorial designs can
test the main effects of each using smaller sample sizes and greater precision than
separate parallel group designs. When it is essential to study treatment interactions,
factorial designs are the only effective way to do so. The precision, however, with
which interaction effects are estimated is lower than that for main effects in the
absence of interactions. A factorial trial designed to detect for an interaction has no
advantage in terms of the required sample size compared to a multi-arm parallel trial
for assessing more than one intervention.
When there are many treatments or factors, these designs require a relatively large
number of treatment groups. In complex designs, if some interactions are known not
to exist or are unimportant, it may be possible to omit some treatment groups, reduce
the size and complexity of the experiment, and still estimate all of the effects of
biological interest. Extra attention to the design properties is necessary to be certain
that fractional designs will meet the intended objectives. Such fractional or partial
factorial designs are of considerable use in agricultural and industrial experiments
but have not been applied frequently to clinical trials.
1374 S. Piantadosi and S. Halabi
Ethical and toxicity constraints may make it impossible to apply either a full
factorial or a fractional factorial design, yielding an incomplete design. The proper-
ties of incomplete factorial designs have not been studied extensively, but they may
be the best design in some circumstances.
A number of important, complex, and recent clinical trials have used factorial
designs. Because of the low potential for toxicity, these designs have been more
frequently applied in studies of disease prevention. Examples include the Physicians’
Health Study and the Women’s’ Health Trial. In medical studies, the design is employed
usually to achieve greater efficiency, since the treatments are unlikely to interact.
Factorial trials are efficient only when there is no interaction between the treatments,
and this should be considered at the design stage. Factorial designs must be used when
the intent is to study interactions in which case the trial must be powered accordingly.
Interaction effects have roughly four times the variance of main effects and so require
much larger sample sizes. If many treatments and interactions are possible, factorials
designs may be impractical for therapeutic questions due to their large sample size and
complexity. Other constraints may make these designs unsuitable such as the need to
omit treatments or administer all therapies at full dose in some groups.
Key Facts
Factorial trials represent a structure that can test treatment by treatment interactions. In
the narrow circumstance that interactions are known to be absent, the factorial structure
can test the effects of two treatments using a sample size ordinarily used for a single
treatment. When interactions are the focus, sample size must be increased substantially
because they are estimated with less precision than “main effects.” Factorial designs are
often well suited to prevention questions but have been applied widely.
Cross-References
▶ Biomarker-Guided Trials
▶ Prevention Trials: Challenges in Design, Analysis, and Interpretation of Preven-
tion Trials
References
Apfel CC, Korttila K, Abdalla M, Kerger H, Turan A, Vedder I, . . . IMPACT Investigators (2004) A
factorial trial of six interventions for the prevention of postoperative nausea and vomiting. N
Engl J Med 350(24):2441–2451
71 Factorial Trials 1375
Assaf AR, Carleton RA (1994) The Women’s Health Initiative Clinical Trial and Observational
Study: history and overview. R I Med 77(12):424–427
Byar DP, Piantadosi S (1985) Factorial designs for randomized clinical trials. Cancer Treat Rep 69
(10):1055–1063
Byar DP, Herzberg AM, Tan WY (1993) Incomplete factorial designs for randomized clinical trials.
Stat Med 12(17):1629–1641
Cook NR, Albert CM, Gaziano JM, Zaharris E, MacFadyen J, Danielson E, . . . Manson JE (2007)
A randomized factorial trial of vitamins C and E and beta carotene in the secondary prevention
of cardiovascular events in women: results from the Women’s Antioxidant Cardiovascular
Study. Arch Intern Med 167(15):13–27
Cox DR (1958) Planning of experiments. Wiley, New York
Design of the Women’s Health Initiative clinical trial and observational study. The Women’s Health
Initiative Study Group. (1998) Control Clin Trials 19(1):61–109
Fisher RA (1935) The design of experiments. Oliver and Boyd, Edinburgh/London
Fisher RA (1960) The design of experiments, 8th edn. Hafner, New York
Flather M, Pipilis A, Collins R, Budaj A, Hargreaves A, Kolettis T, . . . et al (1994) Randomized
controlled trial of oral captopril, of oral isosorbide mononitrate and of intravenous magnesium
sulphate started early in acute myocardial infarction: safety and haemodynamic effects. ISIS-4
(Fourth international study of infarct survival) Pilot Study Investigators. Eur Heart J 15(5):608–619
Freidlin B, Korn EL (2014) Biomarker enrichment strategies: matching trial design to biomarker
credentials. Nat Rev Clin Oncol 11(2):81–90
Freidlin B, Korn EL (2017) Two-by-two factorial cancer treatment trials: is sufficient attention being
paid to possible interactions? J Natl Cancer Inst 109(9). https://doi.org/10.1093/jnci/djx146
Freidlin B, Korn EL, Gray R, Martin A (2008) Multi-arm clinical trials of new agents: some design
considerations. Clin Cancer Res 14(14):4368–4371
Gönen M (2003) Planning for subgroup analysis: a case study of treatment-marker interaction in
metastatic colorectal cancer. Control Clin Trials 24(4):355–363
Grant A, Gordon B, Mackrodat C, Fern E, Truesdale A, Ayers S (2001) The Ipswich childbirth
study: one year follow up of alternative methods used in perineal repair. BJOG 108(1):34–40
Green S (2005) Factorial designs with time to event endpoints, pp 181–189
Henderson IC, Berry DA, Demetri GD, Cirrincione CT, Goldstein LJ, Martino S, . . . Norton L
(2003) Improved outcomes from adding sequential paclitaxel but not from escalating doxoru-
bicin dose in an adjuvant chemotherapy regimen for patients with node-positive primary breast
cancer. J Clin Oncol 21(6):976–983
Hennekens CH, Eberlein K (1985) A randomized trial of aspirin and beta-carotene among
U.S. physicians. Prev Med 14(2):165–168
Kahan BC (2013) Bias in randomised factorial trials. Stat Med 32(26):4540–4549
Korn EL, Freidlin B (2016) Non-factorial analyses of two-by-two factorial trial designs. Clin Trials
13(6):651–659
Lamas GA, Boineau R, Goertz C, Mark DB, Rosenberg Y, Stylianou M, . . . Lee KL (2014) EDTA
chelation therapy alone and in combination with oral high-dose multivitamins and minerals for
coronary disease: the factorial group results of the Trial to Assess Chelation Therapy. Am Heart
J 168(1):37.e5–44.e5
Li B, Taylor PR, Li J-Y, Dawsey SM, Wang W, Tangrea JA, . . . Blot WJ (1993) Linxian nutrition
intervention trials design, methods, participant characteristics, and compliance. Ann Epidemiol
3(6):577–585
Liu C, Liu A, Hu J, Yuan V, Halabi S (2014) Adjusting for misclassification in a stratified biomarker
clinical trial. SIM Stat Med 33(18):3100–3113
Lubsen J, Pocock SJ (1994) Factorial trials in cardiology: pros and cons. Eur Heart J 15(5):585–588
Marshall E (1993) Women’s health initiative draws flak. Science 262(5135):838
Mason RL, Gunst RF, Hess JL (1989) Statistical design and analysis of experiments: with
applications to engineering and science. Wiley, New York
McAlister FA, Straus SE, Sackett DL, Altman DG (2003) REVIEWS – analysis and reporting of
factorial trials: a systematic review. JAMA 289(19):2545
1376 S. Piantadosi and S. Halabi
Montgomery AA, Peters TJ, Little P (2003) Design, analysis and presentation of factorial
randomised controlled trials. BMC Med Res Methodol 3(26)
Moser BK, Halabi S (2015) Sample size requirements and study duration for testing main effects
and interactions in completely randomized factorial designs when time to event is the outcome.
Commun Stat Theory Methods 44(2):275–285
Peterson B, George SL (1993) Sample size requirements and length of study for testing interaction
in a 2 x k factorial design when time-to-failure is the outcome [corrected]. Control Clin Trials
14(6):511–522
Piantadosi S (2017) Factorial designs. In: Piantadosi S (ed) Clinical trials: a methodologic perspec-
tive. Wiley, Hoboken, pp 672–687
Proschan MA, Waclawiw MA (2000) Practical guidelines for multiplicity adjustment in clinical
trials. Control Clin Trials 21(6):527–539
Rosenberg J, Ballman KV, Halabi S, Watt C, Hahn O, Steen P, . . . Morris M (2019) CALGB 90601
(Alliance): randomized, double-blind, placebo-controlled phase III trial comparing gemcitabine
and cisplatin with bevacizumab or placebo in patients with metastatic urothelial carcinoma. J
Clin Oncol 37(15_suppl):4503–4503
Rossouw JE, Anderson GL, Prentice RL, LaCroix AZ, Kooperberg C, Stefanick ML, . . . Writing
Group for the Women’s Health Initiative (2002) Risks and benefits of estrogen plus progestin in
healthy postmenopausal women: principal results from the Women’s Health Initiative random-
ized controlled trial. JAMA 288(3):321–333
Rubinstein LV, Gail MH, Santner TJ (1981) Planning the duration of a comparative clinical trial
with loss to follow-up and a period of continued observation. J Chronic Dis 34(9):469–479
Shumaker SA, Reboussin BA, Espeland MA, Rapp SR, McBee WL, Dailey M, . . . Jones BN
(1998) The Women’s Health Initiative Memory Study (WHIMS): a trial of the effect of estrogen
therapy in preventing and slowing the progression of dementia. Control Clin Trials 19(6):604–
621
Sikov WM, Berry DA, Perou CM, Singh B, Cirrincione CT, Tolaney SM, . . . Winer EP (2015)
Impact of the addition of carboplatin and/or bevacizumab to neoadjuvant once-per-week
paclitaxel followed by dose-dense doxorubicin and cyclophosphamide on pathologic complete
response rates in stage II to III triple-negative breast cancer: CALGB 40603 (Alliance). J Clin
Oncol 33(1):13–21
Simon R, Freedman LS (1997) Bayesian design and analysis of two x two factorial clinical trials.
Biometrics 53(2):456–464
Slud EV (1994) Analysis of factorial survival experiments. Biometrics 50(1):25–38
Sparano JA (2008) Weekly paclitaxel in the adjuvant treatment of breast cancer. New Engl J Med
358(16):1663
Sparano JA, Zhao F, Martino S, Ligibel JA, Perez EA, Saphner T et al (2015) Long-term follow-up
of the E1199 phase III trial evaluating the role of Taxane and schedule in operable breast cancer.
J Clin Oncol 33(21):2353–2360
The ATBC Cancer Prevention Study Group, The Alpha-Tocopherol, Beta-Carotene lung cancer
prevention study: design, methods, participant characteristics, and compliance. (1994 ) Ann
Epidemiol 4(1): 1–10
Thrombosis prevention trial: randomised trial of low-intensity oral anticoagulation with warfarin
and low-dose aspirin in the primary prevention of ischaemic heart disease in men at increased
risk. (1998) The Lancet 351(9098):233
Wason J, Mander A, Stecher L (2014) Correcting for multiple-testing in multi-arm trials: is it
necessary and is it done? Trials 15(1):1–7
Yates F (1935) Complex experiments. Suppl J R Stat Soc B2(2):181–247
Within Person Randomized Trials
72
Gui-Shuang Ying
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1378
Rationale for Using Within Person Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1379
The Requirements for Within Person Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1380
No Carry Across Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1380
Within Person Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1381
Trial Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1381
Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1382
Recruitment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1382
Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1383
Generalizability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1383
Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1384
Concurrent Treatment Versus Sequential Treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1385
Alternatives to the Within Subject Control Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1386
Power and Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1388
Sample Size for Continuous Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1388
Sample Size for Binary Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1389
Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1391
Analysis of Continuous Outcome Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1392
Statistical Comparison of Binary Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1393
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1395
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1395
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1396
Abstract
Within person randomized trials (e.g., trials using within subject controls) are
often employed for conditions that affect paired organs or two or more body sites
of a person. In within person trials, the paired organs or body sites of a person
Keywords
Within person trials · Within subject controls · Within person correlation ·
Inter-eye correlation · Paired design · Split-mouth design · Carry across effect
Introduction
Some diseases affect paired organs, body parts, or body sites of a subject (such as eyes,
ears, arms, or breasts) or two sites of a single organ, body part, or body site (such as
teeth or sides of the mouth). This feature provides a unique opportunity for designing
efficient clinical trial by using within subject controls. Different from conventional
parallel group trials where eligible persons are randomized to receive only one of the
study treatments (i.e., randomization unit is per person), within person trials randomize
each organ or body site to treatment (i.e., the unit of randomization is per organ or
body site), and each person receives all study treatments (Paré 1575).
Within person design is efficient in that it enables the comparison between two
interventions within a person, eliminates the between-person variation, and hence
improves the efficiency in estimating the treatment effect. The trials using within
person controls do not have a generally accepted name, although some medical
specialties have their specific terms, such as “contralateral design” or “paired
design” in ophthalmology, “split-mouth design” in dentistry, and “split face” or
“split body” design in dermatology (ref: Machin and Fayers 2010). To encompass
all possible medical specialties and to align with the terminology used in the
published guidelines for Consolidated Standards of Reporting Trials (CONSORT)
(Pandis et al. 2017), trials using within subject controls are called within person trials
in this chapter. In ophthalmology, within person trials randomly assign treatment to
72 Within Person Randomized Trials 1379
one eye and another treatment (or control) to the fellow eye of the same person
(CAPT Research Group 2004). In dentistry, within person trials apply one treatment
to some teeth and applying another treatment to other teeth of the same person
(Pandis et al. 2013).
Within person trials, in which each person receives all study treatments, should not
be confused with trials in which randomization and treatment are at the person level
and all the organs or body sites of a person receiving the single same treatment are in
the same comparison group. For example, in the Age-Related Eye Disease Study
(AREDS), the participants were randomized to one of the four treatment groups – (1)
zinc alone; (2) antioxidants alone; (3) a combination of antioxidants and zinc; or (4) a
placebo (The AREDS Research Group 1999) – to evaluate the effect of high doses of
vitamin C, vitamin E, beta-carotene, and zinc on the progression of age-related
macular degeneration (AMD) and cataract. As two eyes of each participant received
the same systematic treatment (e.g., dietary supplements) and are in the same com-
parison group, the AREDS is not a within person trial; instead it can be viewed as a
type of clustered randomized trials that are not discussed here. Although within person
trials have some similarities to cross-over trials which are also not discussed here, it
differs from cross-over trials in that treatment and outcome measures are at the organ
level or body site level rather than at the person level.
Within person trials have been used to evaluate a variety of preventive and
therapeutic treatments (Pandis et al. 2017). Pandis et al. reported that approximately
2% of published randomized clinical trials employed a within person design (Pandis
et al. 2017). Within person trials are most common in ophthalmology, dentistry, and
dermatology. In dentistry, a review of 413 clinical trials published in 8 high-impact
oral health journals from 1992 to 2012 found 43 (10%) dental trials used split-mouth
design (Koletsi et al. 2014). Another study found that 67 (24%) of 276 trials
published in implant dentistry journals between 1989 and 2011 used the split-
mouth design (Cairo et al. 2012). In ophthalmology, Lee CF et al. found that within
person design was used in 9 (13%) of 69 ophthalmic trials published in top four
general clinical ophthalmology journals (American Journal of Ophthalmology,
Archives of Ophthalmology, the British Journal of Ophthalmology, and Ophthal-
mology) between January and December of 2009 (Lee et al. 2012).
This chapter describes the rationale and the requirements for employing within
person design, the considerations in designing within person trials, the sample size/
power determination, and the appropriate statistical approaches for analyzing corre-
lated data from within person trials. The examples of real within person clinical trials
are used to demonstrate the design, sample size calculation, and statistical analysis
for within person trials.
In the parallel group trials that randomize persons to one of treatments, the treatment
effect is determined through comparing outcome measure between persons random-
ized to one treatment and persons randomized to another treatment (i.e., through
1380 G.-S. Ying
The most important assumption underlying the use of within person design is that the
treatment effect is localized, i.e., there is no spill-over effect (also called no carry
across effect) from therapy in one organ or body side to another. For example, the
treatment in one tooth has no effect on another tooth, or the treatment in one eye has
no effect on the fellow eye. In designing a within person trial to compare surgical
treatment vs. nonsurgical treatment for periodontal disease, it is desirable to dem-
onstrate that the sections of the mouth receiving surgical treatment are not affected
by the sections receiving nonsurgical therapy and vice versa. Unless this indepen-
dence can be demonstrated, the treatment effect may not be surgical compared to
nonsurgical therapy but the effect of surgical treatment in a section in conjunction
with nonsurgical treatment in another section, and it is not possible to obtain an
unbiased, independent estimate of either treatment.
The assumption of no carry across effect may not be met for some within person
trials, even when the treatment is localized to an organ/body site. For example, in the
72 Within Person Randomized Trials 1381
initial One-eyed Trials of the Ocular Hypertension Treatment Study, the topical
β-blocker was given to the eye with higher intraocular pressure (IOP) or a randomly
selected eye if both eyes had the same IOP. After 2–6 weeks topical medication in the
treated eye, it was found that the contralateral fellow eye had mean ( standard
deviation) IOP reduction of 1.5 3.0 mm Hg, as compared to the mean reduction of
5.9 3.4 mm Hg in the treated eye, suggesting the topical β-blocker has contralat-
eral effect (Piltz et al. 2000). This carry across effect is likely due to the systemic
absorption of the β-blocker primarily through the nasolacrimal mucosa, resulting in
the transport of the β-blocker to the contralateral eye through the blood stream (Piltz
et al. 2000).
The measures taken from paired organs or body sites of the same person are usually
correlated. The within person design takes the advantage of high within person
correlation that makes the within person trial more efficient than the parallel group
design. Reported correlation coefficients in ophthalmology (Katz 1988), dermatol-
ogy (Van et al. 2015), and orthodontics (Pandis et al. 2014) were 0.80, 0.80, and
0.50, respectively. Balk et al. (Balk et al. 2012) calculated 811 within person
correlation coefficients from 123 studies. The median within person correlation
value across all studies was 0.59 (interquartile range 0.40–0.81). No heterogeneity
of correlation values across outcome types and clinical domains was observed (Balk
et al. 2012). In ophthalmology, a wide variety of inter-eye correlation coefficients
was reported for various eye diseases and outcome measures (Ying et al. 2017a,
2018; Maguire 2020). The inter-eye correlation in refractive error can be as high as
0.90 in preschoolers but is only 0.43 in patients with neovascular age-related
macular degeneration (Ying 2017). The inter-eye agreement in the referral-warranted
retinopathy of prematurity was reported to be 0.80 (Ying 2017). The gain in the
efficiency from the within person design is positively correlated with the magnitude
of within person correlation (i.e., the higher within person correlation, the more gain
in the efficiency and more reduction in the sample size compared to the parallel
group design).
In the simplest within person trials, two interventions (one of which may be a control
or standard treatment) are applied to two paired organs or body sites of a person
through randomization, either concurrently or sequentially, and the outcome mea-
sures are assessed at each organ or body site. For example, in the Complications of
Age-Related Prevention Trials (CAPT), designed to evaluate whether prophylactic
laser treatment to the retina can prevent the incidence of the advanced-stage AMD,
1052 participants with at least 10 large drusen (>125 u) in both eyes were enrolled
with 1 eye randomized for laser treatment for large drusen and the contralateral eye
1382 G.-S. Ying
as control (i.e., without treatment) (The CAPT Research Group 2004). Each partic-
ipant was followed-up annually for at least 5 years, to compare the incidence rates of
advanced-stage AMD between treated eye and the contralateral observed eye of the
same participant.
Within person randomized trials present some particular challenges. When con-
templating to use within person design for a clinical trial, careful considerations
should be given on issues associated with bias, efficiency, and the consequences on
recruitment and statistical analysis.
Bias
One potential problem of using within subject controls is the possibility of a carry
across effect. For example, an intervention applied to one eye can affect the other eye
systemically (Piltz et al. 2000); treatment in an area of the mouth can affect other
areas of the mouth locally (Lesaffre et al. 2009; Pandis et al. 2013); success or failure
of the first replacement hip in a patient requiring bilateral hip replacement can affect
the success or failure of the second hip (Lie et al. 2004).
The carry across effect has been the main concern in within person trials (Piltz et al.
2000; Lesaffre et al. 2009; Lie et al. 2004). Carry across effect can bias the estimates of
treatment efficacy and tend to dilute the treatment effect. However, the exact magni-
tude of bias due to carry across effect is difficult to estimate (Hujoel 1998); thus, the
true treatment effect from intervention cannot be accurately estimated. What we can
estimate is the treatment effect contaminated with the carry across effect. If the
intervention is thought to have carry across effect, randomizing individual patients
to treatment groups (instead of using within subject control) is preferred.
The carry across effect is similar to the temporal carry over effect in cross-over
trials, in which lingering effects of the first intervention may require adjustment for
different baselines before the second intervention or the use of washout periods. A
within person design is not appropriate to use if a substantial carry across effect or
contamination is expected. For example, in a study of oral lichen planus (Poon et al.
2006), the topical treatments applied to each side of mouth can have serious carry
across effect, so the split-mouth design should not be used. Similarly, in ophthal-
mology, the intravitreal injection of anti-vascular endothelial growth factor (anti-
VEGF) in the study eye can carry across to the contralateral eye (Acharya et al.
2011); the paired design for evaluating the efficacy of one anti-VEGF agent through
randomizing one eye to treatment and contralateral eye as control or for comparing
efficacy of two anti-VEGF agents within two eyes of the same subject is not ideal.
Recruitment
Within person trials require recruitment of individuals with similar disease condition
that affects paired organs or body sites of a person. However, identifying such
participants sometimes can be difficult, thus endangering the recruitment. In
72 Within Person Randomized Trials 1383
ophthalmology, a lot of eye diseases are symmetrical, such as refractive error, age-
related macular degeneration and retinopathy of prematurity (Katz 1988; Quinn et al.
1995). This may not be the case in dentistry. It may be easy to find an individual with
a tooth having a cavity, but it can be challenging to find individual with two teeth
having cavity of similar size, particularly in two sides of the mouth. For example, for
a particular periodontal disease trial, over 1500 patients were screened to find only
12 patients with symmetric periodontal lesions eligible for the study (Smith et al.
1980). This difficulty in identifying subjects with similar disease condition could be
a major obstacle for achieving the sample size required by the within person trial,
even though smaller sample size is required for within person trials compared to the
parallel group trials. The more strict the criterion for the similarity of disease in
paired organs or body sites, the more difficult the recruitment will be. Such very
selective recruitment of participants for within person trials will also hurt the
generalizability of trial findings.
In addition, the requirement of within person trial for each participant to receive
all interventions could potentially make some patients not willing to participate the
trial. Bunce et al. reported that in ophthalmic trials, some patients had very strong
opinions against enrolling both eyes into within person trials because this makes
patients feel like experimental units rather than people. These patients are most
comfortable with enrolling only one eye into the study even though both eyes are
eligible (Bunce and Wormald 2015).
Efficiency
The within person design takes the advantage of within person correlation (ρ) in
outcome measures for gaining efficiency and reducing the sample size compared to
the parallel group design. Assuming all the parameters for sample size calculation
are the same in within person trial and parallel group trial, the ratio between the
sample size (in terms of number of subjects) for the within person trial (Npaired) and
for the parallel group trial (Nparallel) can be calculated using the following formula
(Wang and Bakhai 2006):
From Eq. (1), it is clear that the higher within person correlation, the smaller the
ratio in their sample size, and the more gain in efficiency. If the within person
correlation is low, the gain in efficiency can be minimal, and the within person
design may not be very appropriate.
Generalizability
Within person trials require each participant having similar disease in paired organs
or body sites; it is uncertain whether the within person trial results from patients with
1384 G.-S. Ying
Other Considerations
As outlined in the CONSORT guidelines for within person trials (Pandis et al. 2017),
the design of within person trial also has to consider the following questions:
• What is the eligible criteria for enrollment? The within person design needs to
consider two sets of eligibility criteria including the eligibility of the individual
participants and the eligibility of organ (eyes) or body sites. For example, to be
eligible for the CAPT study, participants had to be at least 50 years of age and free
of conditions likely to preclude 5 years of follow-up (person level eligibility), each
eye had to have presence of 10 or more drusen at least 125 u in diameter within
2 disc diameters of the fovea, and the standardized visual acuity had to be 20/40 or
better in each eye (eye level eligibility) (The CAPT Research Group 2004).
• What is the outcome of the within person trial? The outcome of the within person
trial should be specific to the organ or body site. Within person design is not
appropriate for trial with outcome assessment at person level. For example, in the
Dry Eye Assessment and Management (DREAM) Study (The DREAM Investi-
gator Group 2018), although the dry eye disease is mostly bilateral (>90%
participants have both eyes met the enrollment criteria), because the treatment
is systematic and the outcome measure of the dry eye symptom is the Ocular
72 Within Person Randomized Trials 1385
Surface Disease Index (OSDI) which was measured at the person level, it is not
appropriate to use the within person design for the DREAM Study.
• Can the assessment of outcomes for efficacy and safety be adversely impacted by
the decision to treat the same patients with two different treatments?
• Are the paired organs/sites for each participant similar in terms of baseline
characteristics such as location, anatomy (e.g., tooth type), and severity of
disease?
• Will the treatments be administered concurrently or sequentially to the same
participant? If treatments are given sequentially, will baseline information be
recorded at the time of randomization or at the time of treatment administration?
Similarly, if the treatment were sequential, the outcome of the first intervention
could affect the outcome of the second intervention, and hence the applicability of
the within person trial findings to other settings can be questionable. For example,
early and late loaded implants or one hip replacement at a time can potentially
influence the outcome. In some cases, however, the sequential approach is
standard clinical practice, such as cataract surgery (Vasavada et al. 2012).
• How will the order of treatments and allocation to paired organs/body sites be
determined (e.g., right versus left)? In within person trials, randomization is
needed not only to determine which intervention is applied to which organ or
body site but also to determine which organ or body site is treated first (partic-
ularly if paired organs or body sites are not treated concurrently).
• Will there be any provision to monitor whether the assigned treatment is actually
applied to the correct organ or body site?
• Will the outcome evaluator be masked to the treatment assignment of each organ
or body site, and if so how?
• How will the blinding of treatment assignment to organs/body sites of the same
subject be operated, and will accidental unblinding of treatment of one organ
affect the other organ?
When a subject is assigned to receive two treatments in a within person trial, decision
needs to be made on whether two treatments given to paired organs or body sites are
concurrent or sequential. In the concurrent treatment, the two treatments are deliv-
ered at the same time or within a trivial interval following a specific or random
treatment order, whereas in the sequential treatment, there is a “non-trivial” time lag
between the two interventions. With concurrent treatment, loss to follow-up will
automatically be the same across treatment groups, but side effects (particularly
systemic adverse events) from treatments may be difficult to attribute to a specific
treatment. Another concern in concurrent treatment is the possible confusion as to
which organ or body site receives which treatment, particularly when there is a long
treatment period. The traditional methods for monitoring compliance of treatment
might be insufficient in within person trials when participants are responsible for
administering the treatment (e.g., topical eye drops) by themselves.
1386 G.-S. Ying
Clinical trials for conditions that occur in multiple organs or body sites require
careful consideration of study design, because it has strong implications for patient
enrollment, statistical analysis, and the presentation of results. Besides the within
person design, the possible alternative designs:
• Include only one organ/site per subject either through random selection, use of
organ/site with the most severe disease, or at the discretion of clinician or
patient. For example, in the Complications of Age-related Macular Degenera-
tion Treatment Trials (CATT), 1185 participants with neovascular AMD were
randomized to treat with intravitreal injection of ranibizumab or bevacizumab
on a monthly or PRN schedule. The trial requires each study eye to have active
subfoveal choroidal neovascularization (CNV) and visual acuity of between
20/25 and 20/320. The CATT only enrolled one eye per patient into the trial.
When two eyes of a participant meet the enrollment criteria, the ophthalmolo-
gist and the patients decide which eye will be enrolled (The CATT Research
Group 2011).
72 Within Person Randomized Trials 1387
The advantage of this design is the simplicity of design and statistical analysis of
the trial data, but may lead to the loss of opportunity to efficiently collect more
information.
• Randomize patients to a treatment, and treat paired organs/body sites with the
same treatment. This is a clustered randomized trial in which the clusters are
individual patients. For example: In a multi-center randomized clinical trial to
evaluate the efficacy of intravitreal injection of bevacizumab for stage 3+ ROP,
150 infants with stage 3+ ROP were randomized to receive intravitreal
bevacizumab or conventional laser therapy in both eyes (ref: Mintz-Hittner
et al. 2011).
• Mixture of participants with unilateral and bilateral disease. Although many
diseases occur in paired organs or multiple body sites, the extent and severity
of disease may not be the same. The within person design that requires similar
disease condition in paired organs or body sites may significantly limit the
recruitment potential and also make the results not easily generalizable to the
patients with unilateral disease. In ophthalmology, some trials use hybrid design
which allows both eyes to be randomized if both eyes are eligible and allows one
eye to be randomized if only one eye is eligible (Lee et al. 2012; The ETROP
Cooperative Group 2006; Elman et al. 2010). For example, the ETROP enrolled
infants with prethreshold ROP in both eyes and also infants with prethreshold
ROP in one eye only. For infants with bilateral prethreshold ROP, one eye was
randomized to treatment, and the other (the control eye) was managed conven-
tionally. For infants with unilateral prethreshold ROP, a separate randomization
scheme assigned such infants to either treatment or conventional management
(The ETROP Cooperative Group 2006). The Diabetic Retinopathy Clinical
Research Network also used this hybrid design in several large clinical trials, as
the hybrid approach can lead to faster recruitment and reduced costs considering
the overall number of participants (Glassman and Melia 2015). In the Diabetic
Retinopathy Clinical Research Network Protocol I (Diabetic Retinopathy Clinical
Research Network, 2010), patients with one or two eligible eyes enrolled to
compare four treatments for diabetic macular edema including (A) prompt laser
(N ¼ 293 eyes); (B) 0.5 mg ranibizumab + prompt laser (N ¼ 187 eyes);
(C) 0.5 mg ranibizumab + deferred laser (N ¼ 188 eyes); and (D) 4 mg triam-
cinolone + prompt laser (N ¼ 186 eyes). For 528 patients with only 1 eye eligible,
they were randomized to 4 treatment groups with equal probability, while for
163 patients having both eyes eligible, the right eye was randomized to 1 of the
4 treatment groups and the left eye assigned to group A if right eye was not in
group A, and the left eye was randomized to 1 of the 3 remaining treatments with
equal probability if the right eye was in group A. Such hybrid design can
complicate the statistical analysis for the trial data, because statistical analysis
for the bilateral cases needs to adjust for the inter-eye correlation. Consistency of
treatment effect among bilateral patients and unilateral patients should be checked
before the assessment of overall treatment effect by combining the results from
these unilateral cases and bilateral cases (The ETROP Cooperative Group 2006).
1388 G.-S. Ying
The sample size calculation for within person trials requires an estimate of within
person correlation for the primary outcome measure. This correlation estimate can be
obtained from the previous studies. If such data is not available in practice, the
sample size/power can be calculated assuming various degree of within person
correlation (e.g., moderate correlation with correlation coefficient of 0.50 or high
correlation with correlation of 0.75 etc.) to see how sensitive the sample size
calculation is to assumption for the within person correlation coefficient. However,
such within person correlation should not be ignored in sample size calculation,
because ignoring within person correlation will result in an over-estimation of the
sample size. The degree of over-estimate of sample size is dependent on the within
person correlation as demonstrated in Eq. (1). The higher the within person corre-
lation, the more over-estimation of sample size. It is a common mistake that most
within person trials do not take into account the within person correlation in the
calculation of sample size or statistical power (Lee et al. 2012; Lesaffre et al. 2007;
Lai et al. 2007; Bryant et al. 2006).
Besides the assumption for the within person correlation, assumptions need to be
made for other parameters including the expected mean and variance (σ2) or standard
deviation (SD) of continuous outcome measure and the expected event rate for
binary outcome for each treatment group, the type 1 error rate (α), and the desired
statistical power (1-β).
Assuming for a two-arm within person trial, the expected mean is uA for treatment
A and uB for treatment B, and their variances are the same σ2. Their mean difference
in outcome measure is d ¼ uA uB.
If SD for the within person difference (d ) between treatment groups for the
primary continuous outcome measure is available, the sample size can be calculated
without need of assuming the within person correlation coefficient (ρ). However, if
such data are not available, the within person correlation has to be assumed for
calculating the SD of the within person difference d.
For the given within person correlation ρ, the variance for the within person
difference d can be calculated as
σ 2d ¼ 2 ð1 ρÞ σ 2 ð2Þ
The required sample size N (i.e., total number subjects needed) for detecting
mean difference of d in primary outcome measure with statistical power of 1-β and
type I error rate of α is as follows:
2
σ 2d Z1α=2 þ Z 1β Z 2 1α=2
N¼ þ ð3Þ
d 2 2
72 Within Person Randomized Trials 1389
Example: A within person trial is designed with one eye randomly assigned to
treatment A and the contralateral fellow eye assigned to treatment B. The primary
outcome is visual acuity score (calculated as number of letters read correctly from the
visual acuity chart) at one year after treatment. It is desirable to know how many
patients need to be enrolled to provide 90% power for detecting five-letter difference in
visual acuity score between two treatment groups at 5% type I error rate. Previous
studies suggest the SD of the visual acuity score is approximately 14 letters in each
treatment group.
Assuming moderate inter-eye correlation with ρ of 0.50, the variance for the mean
visual acuity difference between two treatment groups calculated using Eq. (2) is
2 (1 0.50) 142 ¼ 196. The sample size can be calculated using Eq. (3) as:
2
σ 2d Z 1α=2 þ Z 1β Z 2 1α=2 196ð1:96 þ 1:28Þ2 1:962
N¼ þ ¼ þ ¼ 84 ð4Þ
d2 2 25 2
So a total of 84 patients (168 eyes) need to be enrolled. If the parallel group design
is used with 1 eye per patient randomized to either treatment A or treatment B, a total
of 332 patients (332 eyes) is needed to achieve the same statistical power as within
person design.
To demonstrate the impact of inter-eye correlation on sample size, the sample size
at various inter-eye correlations ranging from 0 to 0.9 calculated using equations
(Balk et al., 2012; Bryant et al., 2006) is provided in Table 1. When comparing
sample size of 332 eyes from 332 patients for the parallel group design to achieve the
same statistical power as within person design, Table 1 clearly demonstrates that the
gain in the efficiency from within person trial leads to the reduced sample size and
the magnitude of gain in efficiency is dependent on the within person correlation,
with the higher the inter-eye correlation, the smaller the sample size. For example,
when the inter-eye correlation is 0.5, the percent of reduction in sample size is 75%
in terms of number of patients and 50% in terms of number of eyes.
For a two-arm within person trial with binary outcome (e.g., treatment success or
failure), the 2 2 table (Table 2) can be laid out to estimate the parameters needed
for the sample size calculation.
The odds ratio for success between treatment B relative to treatment A is calcu-
lated as
Ψ ¼ b=c: ð5Þ
Table 1 Comparison of the sample size using within person design and parallel group designa
under various inter-eye correlation for an ophthalmology trial
Within
person Sample size from within % Reduction in number % Reduction in number
correlation person trial: Number of of subjects comparing to of eyes comparing to
(ρ) subjects (number of eyes)b parallel group designa parallel group designa
0.0 166 (332) 50% 0%
0.1 150 (300) 55% 10%
0.2 134 (268) 60% 20%
0.3 117 (234) 65% 30%
0.4 101 (202) 70% 40%
0.5 84 (168) 75% 50%
0.6 68 (136) 80% 60%
0.7 51 (102) 85% 70%
0.8 35 (70) 90% 80%
0.9 18 (36) 95% 90%
a
The parallel group design requires a total of 332 patients (332 eyes) assuming standard deviation of
14 letters, 90% power to detect mean difference of 5 letters in visual acuity between treatment A and
B at type I error rate of 0.05
b
Assume the standard deviation of 14 letters, 90% power to detect mean difference of 5 letters in
visual acuity between treatment A and B at type I error rate of 0.05
Table 2 The 2 2 table for the comparison of paired binary outcome from within person trial with
N participants
Treatment B
Treatment A Failure Success Total Anticipated proportions
Failure a b a+b 1πA
Success c d c+d πA
Total a+c b+d N¼a+b+c+d
Anticipated proportions 1 πB πB
For the within person design, the number of subjects needed (N ) with
two-sided type I error rate (α) and power (1-β) can be calculated using the
following formula:
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
h iffi2
2 2
Z 1α=2 ðψ þ 1Þ þ Z 1β ðψ þ 1Þ ðψ 1Þ π discordant
N¼ ð7Þ
ðψ 1Þ2 π discordant
In order to calculate the sample size N using Eq. (7), assumptions about the
expected odds ratio Ψ and the discordant percentage πdiscordant need to be made.
If the information on Ψ and the πdiscordant are not available, the Ψ and the
πdiscordant can be estimated based on anticipated treatment response rate πA and
πB as follows:
72 Within Person Randomized Trials 1391
π A ð1 π B Þ
ψ¼ ð8Þ
π B ð1 π A Þ
π discordant ¼ π A ð1 π B Þ þ π B ð1 π A Þ: ð9Þ
Example: Suppose a large within person trial similar to the Early Treatment of
Retinopathy of Prematurity (ETROP) (Good and Hardy 2001) will be designed to
test the hypothesis that earlier treatment in selected high-risk cases of acute ROP
results in better visual outcomes than conventional ROP management. The pilot
study in 30 infants with bilateral ROP provided the following data (Table 3):
Based on this pilot data, the large multi-center within person trial is designed to
provide 90% power for detecting odds ratio of Ψ ¼ 9/6 ¼ 1.5 at type I error rate
of 0.05.
From the pilot data, the π discordant ¼ (9 + 6)/30 ¼ 0.5.
Using Eq. (7), the sample size can be determined:
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
h iffi2
1:96 ð1:5 þ 1Þ þ 1:28 ð1:5 þ 1Þ2 ð1:5 1Þ2 0:5
N¼
ð1:5 1Þ2 0:5
¼ 521 ð10Þ
So 521 infants with bilateral ROP (1042 eyes) need to be enrolled with 1 eye
treated using new treatment and the fellow eye using conventional treatment.
Statistical Analysis
One of the major features of the within person trials is that comparisons of outcomes
are made through within person comparison. Because outcome measures from
paired organs or body sites of the same person are correlated, the appropriate
statistical analysis should account for the correlation among measures from the
same person. The common mistake in analysis of data from within person trials is
the lack of adjustment for the within person correlation in statistical comparisons of
trial outcomes, leading to the invalid conclusions (Murdoch et al. 1997; Pandis et al.
2017; Zhang and Ying 2018).
1392 G.-S. Ying
For the within person trial involving two treatments in two paired organs/body
sites of each participants, the statistical methods for analyzing trial outcomes mea-
sured at the end of the trial are usually very standard, such as paired t-test for
comparison of continuous outcome and McNemar test for comparison of binary
outcome. Similar to the other trials, the statistical analyses of data from within person
trial need to consider loss to follow-up or deal with missing data, which can occur in
both organs/sites in each participant (e.g., due to missed follow-up visit) or just at a
single organ/site (e.g., due to poor image quality in one eye). In within person trials
with concurrent interventions, the losses to follow-up are usually equal between
treatment groups, thus unlikely bias the estimate of treatment effect due to missing
data from lost to follow-up, but will decrease the statistical power and limit the
generalization of the trial results.
One advantage of within person trials is the elimination of confounding from
person-level baseline covariates, because these person-level baseline characteristics
are balanced across treatment groups. Statistical analysis and interpretation of results
for treatment effect don’t need to worry about the person-level confounders. How-
ever, the imbalance in the organ-specific or site-specific variables can still occur;
thus, statistical analysis for comparing outcomes between treatment groups needs to
account for imbalance in baseline variables at organ/site level by using the mixed
effects models or marginal models (Laird and Ware 1982; Liang and Zeger 1986;
Ying et al. 2017, 2018).
For within person trials with n participants and each participant received two
treatments (A and B), the outcome measure in individual i is yiA and yiB for
organ/site in treatment A and treatment B, respectively. The within person differ-
ence (d ) in continuous outcome measure between treatments A and B can be
calculated as:
The mean, standard deviation (SD), and standard error (SE) for the difference
between treatments A and B are
Xn
di
d¼ ¼ yA yB ð12Þ
i¼1
n
2
Xn
di d
SDðd Þ ¼ ð13Þ
i¼i
ð n 1Þ
SDðdÞ
SE d ¼ pffiffiffi ð14Þ
n
72 Within Person Randomized Trials 1393
d
t¼ ð15Þ
SE d
Example: The data in Table 4 are from the 11 participants of the Choroidal
Neovascularization Prevention Trial (CNVPT) (The CNVPT Research Group
1998) with equal visual acuity in their paired eyes at baseline. The primary outcome
of the CNVPT is visual acuity score measured at the end of 4-year follow-up. In the
CNVPT, one eye of each participant was randomized to laser treatment for drusen,
and the fellow eye was observed without treatment as control. The calculation using
Eqs. (11, 12, 13, 14) provided mean difference of 0.90 letters, with SD of 12.3 letters
and SE of 3.7 letters. The paired t-test provided t ¼ 0.90/3.7 ¼ 0.24 with degree of
freedom of 10. The two-sided p-value from paired t-test is 0.81. If nonparametric test
is applied, the Wilcoxon signed rank test provided p-value of 0.85.
When the outcome measure of the within person trial is binary (yes/no), it is not
appropriate to use the standard chi-square test because it ignores the within person
correlations. Instead, the McNemar test should be applied for the comparison of
proportions.
A presentation using 2 2 paired tabulation format as Table 2 is desirable, as it
provides the counts of concordant and discordant pairs.
1394 G.-S. Ying
ðjb cj1Þ2
χ2 ¼ ð17Þ
bþc
In large samples, the McNemar test is
ðb c Þ2
χ2 ¼ ð18Þ
bþc
The degree of freedom for McNemar test is 1.
Example: In the CAPT Study (The CAPT Research Group 2004) designed to
evaluate whether prophylactic laser treatment to the retina can prevent the develop-
ment of the advanced-stage AMD, one of the analyses for the primary outcome is to
compare the incidence rate of geographic atrophy (GA) between treated eye and
control eye of the same participant at 4-year follow-up. The cross-tabulation for the
incidence of GA among 997 participants who completed the 4-year follow-up is as
follows:
72 Within Person Randomized Trials 1395
Control eye
Laser-treated eye No GA GA Total
No GA 892 32 924 (92.7%)
GA 29 44 73 (7.3%)
Total 921 76 997
The McNemar test for comparing the GA incidence rate between treated eye and
untreated control eye is:
ð29 32Þ2
χ2 ¼ ¼ 0:1475
29 þ 32
with degree of freedom of 1, and the corresponding two-sided p-value is 0.70.
For diseases that affect paired organs or two body sites, the treatment can be either
systemic or organ-specific. When treatment and outcome measure are specific to the
organ or body site, within person design may be applied to improve the efficiency of
the trial. Such design has been commonly used in in ophthalmology, dentistry, and
dermatology. When the within person correlation is high in outcome measure, such
design can substantially reduce the sample size. However, careful considerations
need to be given on the possibility of carry across effect, the feasibility of recruiting
sufficient patients with bilateral disease, and the limitation in the generalization of
trial results to patients with unilateral disease. The sample size and statistical analysis
also have to account for the within person correlation in outcome measures.
Key Facts
• The within-person design is efficient because comparisons are made within the
same person.
• When within-person correlation for outcome measure is high, within-person
design can substantially reduce the sample size.
• Within-person clinical trials are often used for conditions that affect paired organs
or multiple body sites of a person, such as in ophthalmology, dentistry, and
dermatology.
• Within-person trials pose some challenges including the possible bias from the
carry across effect, the difficulty in recruitment subjects with condition affecting
paired organs or multiple body sites.
• Within-person correlation should be taken into consideration in the sample size
determination and statistical analyses.
1396 G.-S. Ying
References
Acharya NR, Sittivarakul W, Qian Y et al (2011) Bilateral effect of unilateral ranibizumab in
patients with uveitis-related macular edema. Retina 31:1871–1876
Balk EM, Earley A, Patel K, Trikalinos TA, Dahabreh IJ (2012) Empirical assessment of within-
arm correlation imputation in trials of continuous outcomes. Methods Research Report.
(Prepared by the Tufts Evidence-based Practice Center under Contract No. 290-2007-
10055-I.) AHRQ Publication No. 12(13)-EHC141-EF. Agency for Healthcare Research and
Quality, Rockville
Bryant D, Havey TC, Roberts R, Guyatt G (2006) How many patients? How many limbs? Analysis
of patients or limbs in the orthopaedic literature: a systematic review. J Bone Joint Surg Am 88:
41–45
Bunce C, Wormald R (2015) Considerations for randomizing 1 eye or 2 eyes. JAMA Ophthalmol
133:1221
Cairo F, Sanz I, Matesanz P, Nieri M, Pagliaro U (2012) Quality of reporting of randomized clinical
trials in implant dentistry. A systematic review on critical aspects in design, outcome assessment
and clinical relevance. J Clin Periodontol 39:81–107
Diabetic Retinopathy Clinical Research Network (2010) Randomized trial evaluating Ranibizumab
plus prompt or deferred laser or triamcinolone plus prompt laser for diabetic macular edema.
Ophthalmology 117:1064–1077
Elman MJ, Aiello LP, Beck RW et al (2010) Diabetic retinopathy clinical research network.
Randomized trial evaluating ranibizumab plus prompt or deferred laser or triamcinolone plus
prompt laser for diabetic macular edema. Ophthalmology 117:1064–1077, e35
Glassman AR, Melia M (2015) Randomizing 1 eye or 2 eyes: a missed opportunity. JAMA
Ophthalmol 133:9–10
Good WV, Hardy RJ (2001) The multicenter study of early treatment for retinopathy of prematurity
(ETROP). Ophthalmology 108:1013–1014
Hujoel PP (1998) Design and analysis issues in split mouth clinical trials. Community Dent Oral
Epidemiol 26:85–86
Katz J (1988) Two eyes or one? The data analyst’s dilemma. Ophthalmic Surg 19:585–589
Koletsi D, Fleming PS, Seehra J, Bagos PG, Pandis N (2014) Are sample sizes clear and justified in
RCTs published in dental journals? PLoS One 9:e85949
Lai TYY, Wong VWY, Lam RF, Cheng AC, Lam DS, Leung GM (2007) Quality of reporting of key
methodological items of randomized controlled trials in clinical ophthalmic journals. Ophthal-
mic Epidemiol 14:390–398
Laird NM, Ware JH (1982) Random-effects models for longitudinal data. Biometrics 38:963–974
Lee CF, Cheng AC, Fong DY (2012) Eyes or subjects: are ophthalmic randomized controlled trials
properly designed and analyzed? Ophthalmology 119:869–872
Lesaffre E, Garcia Zattera M-J, Redmond C, Huber H, Needleman I, ISCB Subcommittee on
Dentistry (2007) Reported methodological quality of split-mouth studies. J Clin Periodontol 34:
756–761
Lesaffre E, Philstrom B, Needleman I, Worthington H (2009) The design and analysis of split-
mouth studies: what statisticians and clinicians should know. Stat Med 28:3470–3482
Liang KY, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika
73:13–22
Lie SA, Engesaeter LB, Havelin LI, Gjessing HK, Vollset SE (2004) Dependency issues in survival
analyses of 55,782 primary hip replacements from 47,355 patients. Stat Med 23:3227–3240.
https://doi.org/10.1002/sim.1905
Machin D, Fayers PM (eds) (2010) Randomized clinical trials. Wiley-Blackwell, West Sussex
Maguire MG (2020) Assessing Intereye symmetry and its implications for study design. Invest
Ophthalmol Vis Sci 61:27
Mintz-Hittner HA, Kennedy KA, Chuang AZ for BEAT-ROP Cooperative Group (2011) Efficacy
of intravitreal bevacizumab for stage 3+ retinopathy of prematurity. N Engl J Med 364:603–615
Murdoch IE, Morris SS, Cousens SN (1997) People and eyes: statistical approaches in ophthal-
mology. Br J Ophthalmol 82:971–973
72 Within Person Randomized Trials 1397
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1400
Drugs Versus Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1401
Mechanism of Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1401
Safety and Efficacy/Effectiveness Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1401
Skill of the User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1402
Implants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403
Placebo Effect and Sham Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403
Blinding or Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403
Design Considerations for Therapeutic Device Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403
Control Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1403
Blinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1404
Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1405
Clinical Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406
Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1406
Error Rate Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1407
Special Considerations for Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1407
Imaging Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1408
Companion Diagnostic Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1408
Complementary Diagnostic Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1409
Next-Generation Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1409
Bayesian Design for Device Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1409
Prior Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1410
Bayesian Adaptive Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1410
Operating Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1411
Observational (Nonrandomized) Clinical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1411
Comparative Observational Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1412
H. Li (*) · L. Q. Yue
Center for Devices and Radiological Health, U.S. Food and Drug Administration, Silver Spring,
MD, USA
e-mail: [email protected]; [email protected]
P. E. Scott
Office of the Commissioner, U.S. Food and Drug Administration, Silver Spring, MD, USA
Abstract
This section provides an overview of clinical studies for medical devices. Impor-
tant differences between drugs and devices are highlighted. Specific topics
covered include Bayesian design and observational (nonrandomized) studies.
Special considerations are given to diagnostic devices.
Keywords
Medical device · FDA · Diagnostic device · Bayesian · Nonrandomized ·
Propensity score
Introduction
Mechanism of Action
Medical devices do not function in the same manner as drugs. The mechanism of
action of a medical device is physical and localized, whereas the mechanism of
action for drugs is often chemical or biological and the effect can be localized or
systemic. The process of drug development is one of discovery. Drug discovery
involves the screening of candidate compounds to identify those promising enough
for development, further examination, and testing in animals and humans. Once a
drug is discovered, the chemical composition remains unchanged during the entire
development process. Although the dose, indications for use, dosing regimen,
preparation, and release of the drug can change, the chemical entity remains the
same. In contrast to drugs, devices are invented and often evolve through a series of
changes and improvements (Campbell and Yue 2016) during the course of develop-
ment and evaluation. As a result, the device that is marketed may be different from
the one that underwent testing. While any change to drug formulation may impact
the safety and effectiveness profile of the drug, minor device changes may have little
effect on the clinical performance or safety and effectiveness profile of the product.
Clinical studies for medical devices are often preceded by bench/mechanical and
animal testing for reliability and biocompatibility. The premarket clinical testing of
investigational drugs is generally conducted in three stages or phases (I, II, and III).
Phase I trials are early development trials conducted in a small number of people
(e.g., 20–100) to evaluate safety issues such as adverse events and maximum
tolerated dose of the drug. If this early development study demonstrates that the
drug is not toxic, then clinical testing progresses to Phase II. During this middle
development stage, the drug is given to up to several hundred people with the
indicated condition to evaluate short-term safety and efficacy. After demonstration
of short-term safety and efficacy, the comparative treatment efficacy trial (Phase III)
is conducted in large numbers of people to demonstrate safety and efficacy. Typi-
cally, two Phase III trials are needed in order to market a drug.
The clinical testing of devices is generally characterized by the conduct of a
feasibility (pilot) study and a pivotal study (an analog of Phase III drug trial). Pilot
studies for medical devices may include only one investigator at one investigational
site with a small number of patients. The main focus of those studies is on safety
issues. Pilot studies are also used to obtain preliminary data to assess the learning
curve for device use, to generate estimates of effect sizes and variances for sample
size calculation, and to develop and refine study procedures for the pivotal trial.
Generally, one pivotal study is conducted to obtain data to evaluate safety and
effectiveness of devices prior to entry into the consumer marketplace. Both drugs
1402 H. Li et al.
and medical devices have post-market studies referred to as Phase IV for drugs and
post-market studies for devices.
The standard for pharmaceutical drug regulation is one of substantial evidence
of safety and effectiveness obtained from well-controlled investigations (21 CFR
314.126), which historically have been well-controlled Phase III trials. The statu-
tory standard for approval and level of evidence required for devices is one of
reasonable assurance of safety and effectiveness based on valid scientific evidence.
Valid scientific evidence (21 CFR 860.7(c)(2)) is defined as “well-controlled
investigations, partially controlled studies, studies and objective trials without
matched controls, well-documented case histories conducted by qualified experts,
and reports of significant human experience with a marketed device” (21 CFR
860.7). The type of study required is very device specific. It may vary depending
on the indication for use and degree of experience with knowledge of the device. A
reasonable assurance of safety is obtained when “it can be determined, based upon
valid scientific evidence, that the probable benefits outweigh any probable risks,”
and can be demonstrated by establishing “the absence of unreasonable risk of
illness or injury associated with the use of the device for its intended uses and
conditions of use” (21 CFR 860.7(d)(1)). Similarly, a reasonable assurance of
effectiveness is obtained when “it can be determined, based upon valid scientific
evidence the use of the device for its intended uses will provide clinically signif-
icant results” (21 CFR 860.7(e)(1)). These criteria differ from the regulatory
standards set by law for drugs and biological products in the USA, and these
differences lead to a more varied approach in the studies and data required to
support market approval/clearance for devices.
For drugs, the influence of the physician technique or skill is very low on treatment
outcome. Drugs are dispensed with instructions to the patient for medical use. It is
the responsibility of the patient or caregiver to comply with these instructions. As
such, the training for drug trials focuses on the protocol requirement, mechanism of
action of the drug, and potential adverse effects. In contrast, for medical devices, the
skill of physician or device user may play a significant role in the treatment outcome.
For example, the success of the device for implants and other devices that rely on
surgical technique may depend on the skill of the surgeon. The potential learning
curve on the part of the surgeon may have a major impact on the performance of the
product during the clinical study. The impact of the skill of the user can also be seen
in diagnostic trials such as diagnostic imaging devices that rely on skilled radiolo-
gists to read and interpret the images. Consequently, training requirements for device
clinical trials should include hands-on device training in addition to the protocol
requirements. Studies in which patients are treated with a medical device often assess
the contribution of the device user in addition to assessing contributions from the
device, disease, and patient.
73 Device Trials 1403
Implants
Since a drug is metabolized, stopping the medication often addresses many adverse
effects. Devices that are implanted into the body pose a number of unique chal-
lenges. Many implantable devices cannot be easily removed once implanted in the
body. Consequently, the risk of removing the implant need to be weighed against that
of the device remaining in the body.
The use of placebo control arm is common in drug clinical trials. However, in device
clinical trials it is often impractical or unethical to use a placebo (or sham) control
and withhold treatment, especially when considerable risk may be associated with
the sham surgery.
Blinding or Masking
Often it is not possible to blind (or mask) the treatment or implant the patient is
receiving or health care provider is delivering in medical device clinical trials.
Although it is possible to correctly guess the treatment in drug trials, it is more
likely to correctly guess the treatment in device trials especially when the procedures
reveal treatment information.
Like drugs, therapeutic devices are used to treat diseases. However, due to the
distinctions between drugs and devices highlighted in the previous section, certain
design considerations may figure more prominently for device trials. In this section
we discuss some of such considerations. The focus is on pivotal device trials. These
trials are intended as the primary clinical support for a marketing application, just as
Phase III drug trials.
Control Group
Medical devices may be invented for a range of purposes, from providing a therapy
that is superior to those currently available to offering a less invasive treatment
option for surgery. The choice of control group should reflect this purpose. Some-
times a device is meant to do both: serve as a more effective therapy than drugs for
patients too frail to undergo surgery and as a less invasive option for surgery for
1404 H. Li et al.
patients who are less frail. If that is the case, then there need to be two separate trials,
one with medical therapy as control and the other with surgery as control (Svensson
et al. 2013). When a trial compares two different treatment modalities, patients and
physicians/surgeons are impossible to blind, and therefore artifacts such as placebo
effect is difficult to rule out. In such a trial the use of objective endpoints (e.g., death,
stroke, etc.) as primary would be advisable.
For a trial conducted to evaluate subjective endpoints such as pain or function,
one may consider using a so-called sham control when it is ethical and practical to do
so. A sham control has been broadly defined as a treatment or procedure that is
similar with but omits a key therapeutic element of the treatment or procedure under
investigation. The riskier the sham control, the less likely it will be considered
ethical. A very risky sham control that is completely without benefit is seldom
justifiable. Of course, such judgment depends on the context. A recent example of
a sham-controlled device trial is the ORBITA study (Al-Lamee et al. 2018). The
therapy under investigation is percutaneous coronary intervention (PCI), the target
population is patients with stable chronic angina, and the endpoint is the exercise
time. Previously, a randomized trial of PCI (a device therapy) versus medical
management had found benefit on exercise tolerance (Parisi et al. 1992). The
ORBITA trial suggests that most of this apparent benefit is placebo effect, which
prompts the medical community to seriously question an accepted practice.
While clinical trials for first-of-a-kind devices often use medical therapy or
surgery as controls, later devices with similar indications may use other devices as
control. Such trials are often noninferiority trials. Sometimes the control can even be
designated as any device of the same indication that is commercially available,
allowing noninferiority claim to be made to an entire class of devices.
In certain device areas randomized trials with patients as their own control are
conducted. Here we are not talking about crossover trials: the treatments being
compared are not separated in time. Instead, they are administered simultaneously
to different parts of a patient. For example, to test an ophthalmic device, one may
randomly assign one eye to treatment and the other eye to control for each patient.
For such a design the experimental unit is an eye. In other device areas these within-
subject designs may use a limb, a blood vessel, or some other parts of body as
experimental units. The use of the subject as his/her own concurrent control allows
for the advantageous use of the correlation within the subject. This design is only
possible when the experimental device and control intervention effects are local and
do not overlap.
Blinding
Blinding refers to keeping key persons, such as patients, health-care providers, and
outcome assessors, unaware of the treatment administered. The purpose of blinding
is to minimize artifacts and biases coming from various sources, such as placebo
effect, performance bias (sometimes known as Hawthorne effects), and detection
bias (i.e., observer, ascertainment, and assessment bias) (Mansournia et al. 2017).
73 Device Trials 1405
Randomization
Clinical Endpoints
Pivotal device trials evaluate the safety and effectiveness of the device in the
population expected to be indicated. Accordingly, primary endpoints are divided
into one or more safety endpoints and one or more effectiveness endpoints. The
study would be considered successful if both the safety and effectiveness endpoints
are met. Occasionally, a single endpoint may play the dual role of a primary safety
and effectiveness endpoint.
The specification of clinical endpoints for a device trial often involves the concept
of device-related adverse events. These are events directly attributable to the device
itself. Therefore, it is imperative that the investigational device is precisely defined in
the protocol. The classification of whether an adverse event is device related requires
careful adjudication. Distinction is made between device-related and procedure-
related events. The latter are events that occur from the procedure, irrespective of
the device (Ouriel et al. 2013).
Beside primary endpoints, secondary endpoints that are not part of the study
success criteria are usually specified for a device trial. They may serve as the bases of
additional meaningful claims or provide further insight into the device effect or
mechanism of action. Sometimes it is possible to submit the primary endpoint results
to the regulatory agency for device approval while data collection for a secondary
endpoint is still ongoing.
Sample Size
Sample size is usually driven by study power, which is the probability of rejecting
the primary endpoint null hypotheses. Sometimes a powered secondary endpoint
may drive the sample size. While most device trials still adopt a fixed design,
adaptive designs are more and more common where the sample size depends on
outcome data. In an adaptive design the minimal sample size needs to be bigger than
that required from a clinical perspective.
A recent example of a device trial using Bayesian adaptive design is the
SURTAVI trial (Reardon et al. 2017). The objective of the trial is to compare the
safety and efficacy of transcatheter aortic-valve replacement (TAVR) with surgical
aortic-valve replacement in patients who were deemed to be at intermediate risk for
surgery. It is a noninferiority trial with the primary endpoint being a composite of
death from any cause or disabling stroke at 24 months and a planned sample size of
1600. A Bayesian interim analysis was prespecified when 1400 patients had reached
12-month follow-up. Through Bayesian modeling, it was possible to calculate the
73 Device Trials 1407
Type 1 error rate and type 2 error rate (one minus power), also called operating
characteristics, must be controlled in all hypothesis-driven pivotal device trials. Any
valid statistical approach to error rate control applies to device trials. For complex
adaptive designs, controlling the error rates is usually an iterative process. First a
tentative decision rule is set up so that operating characteristics can be obtained via
simulation. If the error rates are not satisfactory, then the decision rule is adjusted,
and simulation is carried out again. This process continues until one arrives at a
decision rule that leads to adequate error rate control. The success threshold for the
posterior probability of noninferiority in the SURTAVI trial was determined in this
fashion. In general, the simulation should be reasonably extensive by covering a
wide range of scenarios.
Imaging Devices
A particular type of in vivo test is diagnostic imaging. Diagnostic imaging tests often
involve readings and/or interpretations by persons who may be referred to as readers
(or operators, evaluators, etc.), and present unique study design problems. In the case
of readers, it is often a question of what information is available and when. In a
so-called sequential design, a reader is provided more information gradually to
observe how their ratings change. This design is typically used for diagnostic
devices that are intended to be adjunctive to the standard imaging evaluation.
Alternatively, a crossover design is typically used to compare a new imaging
modality with a standard modality on reader diagnostic accuracy. In a basic version
of the fully crossed, multi-reader multi-case (MRMC) crossover design, cases are
divided randomly into groups A and B, which are read in both modalities by all
readers in two reading sessions separated by a washout period of time. A crossover
design is usually used to compare a new imaging modality with a standard modality
on reader diagnostic accuracy.
enrichment strategy (US FDA 2012) of enrolling just the test positive patients into a
randomized trial of the drug. In an enrichment trial, a significance test for qualitative
interaction is not possible; the diagnostic is evaluated only for whether it has selected
a population in whom the drug’s benefits outweigh its risks. For an example of a
clinical trial for a companion diagnostic device, see Rosell et al. (2012).
In contrast, a complementary diagnostic device is not required for the use of the drug
but provides information about a population who may derive greater benefit. It can
help inform the discussion between prescriber and patient. A complementary diag-
nostic can be described as having a quantitative interaction with the drug effect
(Beaver et al. 2017). The treatment is beneficial in both test negative and test positive
patients, but the benefit is smaller in test negatives. Note that quantitative interac-
tions can be an artifact of the scale of measurement of the treatment effect. Trial
designs and practical considerations for clinical evaluation of predictive biomarker
tests have received extensive review, including Polley et al. (2013). A general
statistical framework for deciding if a treatment should be given to everyone or to
just a biomarker-defined subpopulation has been proposed (Millen et al. 2012).
Largely because of advances in precision medicine, subgroup analysis and its
various purposes has received renewed interest (Alosh et al. 2015).
Next-Generation Sequencing
Most diagnostic in vitro medical devices are designed to test for a single analyte
associated with disease. In contrast, microarray and next-generation sequencing
(NGS) technologies can be used to measure large numbers of genetic analytes
simultaneously, e.g., gene expressions and single-nucleotide polymorphisms
(SNPs), that may confer useful diagnostic information. This creates a huge simulta-
neous testing problem in need of a multiplicity adjustment. The adjustment need not
be as severe as the Bonferroni correction when correlation between the tests is
considered. Permutation-based methods can be used to take into account the corre-
lation (Dudoit et al. 2002). Alternatively, Bayesian approaches have also been
proposed (Newton et al. 2001; Efron et al. 2001). It is worth noting that the false
discovery rate (FDR) (Benjamini and Hochberg 1995) is increasingly being con-
trolled in such large multiplicity problems, as opposed to the more traditional and
more conservative familywise Type I error rate.
Bayesian statistical methodology has been used for well over 10 years in medical
device clinical trials for premarket submissions. The Center for Devices and Radio-
logical Health (CDRH) has published a guidance document “Guidance for the Use of
1410 H. Li et al.
Bayesian Statistics in Medical Device Clinical Trials” (US FDA 2010). The Bayes-
ian guidance covers many topics on study design and is an essential reference in
designing Bayesian medical device clinical trials that will be reviewed by FDA.
Prior Information
The incremental steps in which improvements are made in device development make
the Bayesian approach particularly suitable. Good prior information is often avail-
able from, for example, trials in other countries, earlier trials on previous device
versions, or possibly bench tests or animal studies. In such situations, the natural
mode of statistical inference is that of Bayesian. A Bayesian clinical trial for a
medical device may include prior information for the investigational device, for the
control therapy, or for both the investigational device and control therapy. Previous
device studies used as sources of prior information should be recent and similar to
current studies in terms of devices used, objectives, endpoints studied, protocol,
patient population, investigational sites, physician training, and patient management.
Covariates such as demographics and prognostic variables can be used to calibrate
previous studies to the current study. The use of prior information often leads to more
precise estimates enabling decision-makers to reach a decision on a device with
smaller and shorter trials.
Bayesian inference is used in medical device trials not only where there is prior
information that can be incorporated into the current trial, but also where a flexible
adaptive clinical trial is being considered (Berry et al. 2011; Campbell 2011, 2013).
When there is no good prior information, the prior distributions used in a Bayesian
adaptive design are usually relatively noninformative. One of the most prominent
advantages of a Bayesian approach to adaptive design over a frequentist one is that
the Bayesian approach allows for the construction of likelihood models that can use
information obtained at the current time to calculate predictive distributions of
observations at later time points. Predictive probabilities are widely used in medical
device clinical trials and they serve many purposes. For example, at each interim
analysis for sample size adaptation, one could decide whether to stop accrual, to
continue enrollment, or to declare futility based on predictive probability of trial
success. A possible decision rule could be: If the predictive probability of trial
success given the data on the enrolled subjects exceeds a prespecified value, then
stop accrual; if the probability of trial success under the maximum sample size is
below a certain value, then declare futility; otherwise, enrollment will continue to the
next stage unless the maximum sample size is reached. Predictions can be made only
if the patients yet to be observed are exchangeable with the patients already
observed. In device trials, patients enrolled later in the study may not be
73 Device Trials 1411
Operating Characteristics
For medical device trial designs submitted to FDA for review, including Bayesian
adaptive designs, it is of paramount importance to thoroughly evaluate its operating
characteristics including type I error rate, power, the distribution of sample size, and
probability of stopping at each interim look. In a regulatory environment, it is
necessary to control type I error rate and to maintain power at appropriate levels,
just as for a frequentist design. In general, when no prior data are used, the type I
error rate is controlled at the customary frequentist level. When prior data are used,
the type I error rate is often controlled at a higher level, with consideration given to
the credibility of the prior data and the knowledge of potential benefit-risk profile of
the investigational device. Due to the inherent complexity in the design of a
Bayesian clinical trial, specifically when prior data are used or a wide variety of
adaptations are planned, an adequate characterization of the operating characteristics
of any particular trial design usually needs extensive simulations. Simulations are
performed under various more or less plausible scenarios of parameters of interest,
evaluating the desirability of the operating characteristics. The process may be
iterative, in the sense that sample size, any interim decision rule, study success
criteria, and priors may need to be adjusted many times to achieve acceptable
operating characteristics.
The advantages of Bayesian methodology have resulted in its increasing use in
the medical device arena (Campbell and Yue 2016). It is the methodology of choice
in situations where there is prior information that can be used to augment the current
trial, or where a flexible adaptive clinical trial is desirable.
There is no standard study design or approach that is applicable to the clinical testing
of all devices. As has been stated earlier, the statutory standard for approval and level
of evidence required for devices is one of reasonable assurance of safety and
effectiveness based on valid scientific evidence, and valid scientific evidence can
be from partially controlled studies or studies and objective trials without matched
controls. This is different from drugs, which require substantial evidence of safety
and effectiveness obtained from well-controlled investigations. Depending on the
specific device, randomized controlled trials (RCT) may be either unnecessary or
unfeasible. As a result, observational (nonrandomized) studies play a substantial role
in the evaluation of investigational medical devices.
Observational studies could be comparative through the explicit use of a control
group or may be carried out without a control group (US FDA 2013).
1412 H. Li et al.
While observational studies could provide potential benefits, such as savings in cost
or time of conducting clinical studies, statistical and regulatory challenges also arise
regarding the validity of study design and the interpretability of study results. For
instance, the lack of randomization in observational studies often leads to a system-
atic difference in the distribution of baseline covariates between investigational
device group and control group, resulting in bias in treatment effect estimation.
There may be more differences in baseline covariate distributions between the
treatment groups in studies using external controls. Such differences lead to doubts
73 Device Trials 1413
about treatment group comparability, and hence the interpretability of study results
(Li and Yue 2008). Fortunately, there exist some statistical methods that could be
used to reduce bias, including traditional matching and stratification on baseline
covariates, regression (covariate) analysis, and the propensity score methodology
developed by Rosenbaum and Rubin (1983, 1984). However, it is important to note
that all statistical methods aforementioned can only adjust for confounding
covariates that are observed and incorporated in the statistical model but not for
unobserved ones. Also, when there are large differences in baseline covariates
between two treatment groups, these statistical methods may not be able to mitigate
the bias. And, of course, none of these statistical methods can adjust for bias caused
by the separation in time between a treated group and its historical control (temporal
bias), or by difference in medical practice among multiregions. Therefore, it is
critical that minimizing bias starts from the design stage of an investigational study.
Outcome-Free Design
In an RCT, study design and the act of outcome data analysis are clearly separated:
outcome data are not available at the design phase in which their analysis is
prespecified. However, traditionally this is often not the case with observational
studies. It has been recognized that designing an observational study with outcome
data in sight could compromise the objectivity of study design and make study
result difficult to interpret (Yue 2007; Yue et al. 2014; Li et al. 2016; Yue et al.
2016). Rubin advocates objective design of observational studies, i.e., prospective
study design without access to any outcome data (Rubin 2001, 2007, 2008). This
outcome-free principle can be realized for propensity score design (Yue et al. 2014;
Li et al. 2016) – in building a propensity score model to balance covariates between
treatment groups, only baseline covariates and the treatment indicator are needed;
outcome data do not need to be accessed. Propensity score design is an iterative
process (Austin 2011). The aim is to derive proper propensity score estimation
model and grouping or weighting method(s) such that adequate balance in covar-
iate distributions is reached. Not accessing the outcome data eliminates the bias
caused by selecting a propensity score model that favors one of the treatments. For
confirmatory investigational studies, outcome-free propensity score design can be
implemented via a two-stage design process (Yue et al. 2014; Li et al. 2016). Stage
I occurs when the clinical protocol is being developed. Key elements of stage I
include (1) selection of appropriate control group or data source for control group,
(2) preliminary estimation of sample size, and (3) specification of covariates to be
collected in the study and used in the second design stage. An independent
statistician to perform the study design is identified at this stage. Stage II of the
design starts ideally as soon as all patients are enrolled and information on all
baseline covariates is available. In this stage, the independent statistician identified
in the first design stage estimates propensity score, matches all patients in the
investigational device group with patients in the control group according to the
estimated propensity score, assesses balance in covariate distributions, and
1414 H. Li et al.
finalizes control group selection and sample size estimation as well as the statistical
analysis plan for future outcome analysis. All these need to be performed without
access to any outcome data (Yue et al. 2014; Li et al. 2016). The two-stage
framework has been successfully applied to medical device clinical studies
(Thourani et al. 2016).
While there are many commonalities between clinical trials for medical devices and
for drugs, device trials do have some unique challenges. It is relatively straightfor-
ward to implement placebo control in a drug trial, but in many cases it is not ethical
to give patients the device equivalent of a placebo, namely a sham device. Blinding
or masking is often impossible to do in a device trial, especially when the treated and
control arms involve different treatment modalities, such as when the comparison is
between device and drug therapies. Due to the rapid pace at which innovations are
made, the product life cycle of a medical device is relatively short. It is not
uncommon for a newly marketed device to become obsolete and replaced by next-
generation technology in a couple of years. This means that large and lengthy
randomized clinical trials are often impractical in the medical device arena. A wide
variety of clinical study designs and statistical methodologies, such as those over-
viewed in this section, have been utilized in medical device clinical trials. We believe
that opportunities for clinical trial and statistical innovation will continue to expand
in the future.
Key Facts
• A medical device is a medical product that does not have a chemical, metabolic,
or biological principle of action. Medical device regulations are different from
drug regulations.
• Due to the rapid pace at which innovations are made, the product life cycle of a
medical device tends to be shorter than that of a drug.
• Bayesian design is more common in device trials than in drug trials.
• Non-randomized studies are more common in the medical device world than in
the drug world.
Cross-References
References
Al-Lamee R, Thompson D, Hakim-Moulay D et al (2018) Percutaneous coronary intervention in
stable angina (ORBITA): a double-blind, randomised controlled trial. Lancet 391:331–340
Alosh M, Fritsch K, Huque M, Mahjoob K, Pennello G, Rothmann M, Russek-Cohen E, Smith F,
Wilson S, Yue LQ (2015) Statistical considerations on subgroup analysis in clinical trials. Stat
Biopharm Res 7:286–304
Austin P (2011) An introduction to propensity score methods for reducing the effects of
confounding in observational studies. Multivariate Behav Res 46:399–424
Beaver JA, Tzou A, Blumenthal GM, McKee AE, Kim G, Pazdur R, Philip R (2017) An FDA
perspective on the regulatory implications of complex signatures to predict response to targeted
therapies. Clin Cancer Res 23:1368–1372
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful
approach to multiple testing. J R Stat Soc B 57:289–300
Berry SM, Carlin BP, Lee JJ, Müller P (2011) Bayesian adaptive methods for clinical trials. CRC
Press, Boca Raton
Campbell G (2011) Bayesian statistics in medical devices: innovation sparked by the FDA. J
Biopharm Stat 21:871–887
Campbell G (2013) Similarities and differences of Bayesian designs and adaptive designs for
medical devices: a regulatory view. Stat Biopharm Res 5:356–368
Campbell G, Yue LQ (2016) Statistical innovations in the medical device world sparked by the
FDA. J Biopharm Stat 26:3–16
Campbell G, Li H, Pennello G, Yue LQ (2018) Medical devices. In: Armitage P, Colton T (eds)
Encyclopedia of biostatistics. Wiley, New York
Dudoit S, Yang YH, Callow MJ, Speed TP (2002) Statistical methods for identifying differentially
expressed genes in replicated cDNA microarray experiments. Stat Sinica 12:111–139
Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray
experiment. J Am Stat Assoc 96:1151–1160
Fleming TR (2015) Protecting the confidentiality of interim data: addressing current challenges.
Clin Trials 12(1):5–11
Fleming TR, Sharples K, McCall J (2008) Maintaining confidentiality of interim data to enhance
trial integrity and credibility. Clin Trials 5(2):157–167
Li H, Yue LQ (2008) Statistical and regulatory issues in non-randomized medical device clinical
studies. J Biopharm Stat 18:20–30
Li H, Mukhi V, Lu N, Xu Y, Yue LQ (2016) A note on good practice of objective propensity score
design for premarket nonrandomized medical device studies with an example. Stat Biopharm
Res 8:282–286
Mansournia MA, Higgins JP, Sterne JA, Hernán MA (2017) Biases in randomized trials: a
conversation between trialists and epidemiologists. Epidemiology 28(1):54
Millen BA, Dmitrienko A, Ruberg S, Shen L (2012) A statistical framework for decision making in
confirmatory multipopulation tailoring clinical trials. Drug Info J 46(6):647–656
Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW (2001) On differential
variability of expression ratios: improving statistical inference about gene expression changes
from microarray data. J Comput Biol 8:37–52
Ouriel K, Fowl RJ, Davies MG et al (2013) Reporting standards for adverse events after medical
device use in the peripheral vascular system. J Vasc Surg 58:776–786
Parisi AF, Folland ED, Hartigan P et al (1992) A comparison of angioplasty with medical therapy in
the treatment of single-vessel coronary artery disease. N Engl J Med 326(1):10–16
Pepe MS (2003) The evaluation of diagnostic tests and biomarkers. Oxford Press, London
Polley MY, Freidlin B, Korn EL, Conley BA, Abrams JS, McShane LM (2013) Statistical and
practical considerations for clinical evaluation of predictive biomarkers. J Natl Cancer Inst 105:
1677–1683
1416 H. Li et al.
Reardon MJ, van Mieghem NM, Popma JJ et al (2017) Surgical or transcatheter aortic-valve
replacement in intermediate-risk patients. N Engl J Med 376(14):1321–1331
Rosell R, Carcereny E, Gervais R et al (2012) Erlotinib versus standard chemotherapy as first-line
treatment for European patients with advanced EGFR mutationpositive non-small-cell lung cancer
(EURTAC): a multicenter, open-label, randomised phase 3 trial. Lancet Oncol 13(3):239–246
Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies
for causal effects. Biometrika 70:41–55
Rosenbaum PR, Rubin DB (1984) Reducing bias in observational studies using subclassification on
the propensity score. J Am Stat Assoc 79:516–524
Rubin DB (2001) Using propensity scores to help design observational studies: application to the
tobacco litigation. Health Serv Outcomes Res Methodol 2:169–188
Rubin DB (2007) The design versus the analysis of observational studies for causal effects: parallel
with the design of randomized trials. Stat Med 26:20–36
Rubin DB (2008) For objective causal inference, design trumps analysis. Ann Appl Stat 2:808–840
Ruschitzka F, Abraham WT, Singh JP et al (2013) Cardiac-resynchronization therapy in heart
failure with a narrow QRS complex. N Engl J Med 369(15):1395–1405
Stone GW, Ellis SG, Cox DA et al (2004) A polymer-based, paclitaxel-eluting stent in patients with
coronary artery disease. N Engl J Med 350(3):221–231
Svensson LG, Tuzcu M, Kapadia S et al (2013) A comprehensive review of the PARTNER trial. J
Thorac Cardiovasc Surg 145(3S):S11–S16
Thourani VH, Kodali S, Makkar RR et al (2016) Transcatheter aortic valve replacement versus
surgical valve replacement in intermediate-risk patients: a propensity score analysis. Lancet 387:
2218–2225
U.S. Food and Drug Administration (2010) Guidance for industry and FDA staff: guidance for the
use of Bayesian statistics in medical device clinical trials. Available at https://www.fda.gov/
downloads/medicaldevices/deviceregulationandguidance/guidancedocuments/ucm071121.pdf.
Accessed 9 Feb 2018
U.S. Food and Drug Administration (2012) Draft guidance on enrichment strategies for clinical
trials to support approval of human drugs and biological products. Available at https://www.fda.
gov/downloads/drugs/guidancecomplianceregulatoryinformation/guidances/ucm332181.pdf.
Accessed 9 Feb 2018
U.S. Food and Drug Administration (2013) Design considerations for pivotal clinical investigations
for medical devices: guidance for industry, clinical investigators, institutional review boards and
Food and Drug Administration Staff. Available at: https://www.fda.gov/downloads/
medicaldevices/deviceregulationandguidance/guidancedocuments/ucm373766.pdf. Accessed
9 Feb 2018
U.S. Food and Drug Administration (2014) In vitro companion diagnostic devices: guidance for
industry and Food and Drug Administration Staff. Available at: https://www.fda.gov/
downloads/MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/
UCM262327.pdf. Accessed 9 Feb 2018
U.S. Food and Drug Administration (2016) Draft guidance: Software as a Medical Device (SAMD):
clinical evaluation. Available at: https://www.fda.gov/ucm/groups/fdagov-public/@fdagov-
meddev-gen/documents/document/ucm524904.pdf. Accessed 9 Feb 2018
Yu T, Li Q, Gray G, Yue LQ (2016) Statistical innovations in diagnostic device evaluation. J
Biopharm Stat 26:1067–1077
Yue LQ (2007) Statistical and regulatory issues with the application of propensity score analysis to
non-randomized medical device clinical studies. J Biopharm Stat 17:1–13
Yue LQ, Lu N, Xu Y (2014) Designing pre-market observational comparative studies using existing
data as controls: challenges and opportunities. J Biopharm Stat 24:994–1010
Yue LQ, Campbell G, Lu N, Xu Y, Zuckerman B (2016) Utilizing national and international
registries to enhance pre-market medical device regulatory evaluation. J Biopharm Stat 26:
1136–1145
Zhou X-H, Obuchowski NA, McClish DK (2009) Statistical methods in diagnostic medicine, 2nd
edn. Wiley, New York
Complex Intervention Trials
74
Linda Sharples and Olympia Papachristofi
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1418
Developing the Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1420
Defining the Intervention Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1420
Development of the Package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1422
Timing of Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1423
Feasibility/Early Phase Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1424
Evaluation/Statistical Methods for Trial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1425
Individually Randomized Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1425
Cross-Classified Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1428
Cluster Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1429
Stepped-Wedge Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1430
Sample Size Estimation for Trials with Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1432
Model Fitting and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1433
Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1434
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1435
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1436
L. Sharples (*)
London School of Hygiene and Tropical Medicine, London, UK
e-mail: [email protected]
O. Papachristofi
London School of Hygiene and Tropical Medicine, London, UK
Clinical Development and Analytics, Novartis Pharma AG, Basel, Switzerland
e-mail: olympia.papachristofi@novartis.com
Abstract
Clinical trial methodology was developed for pharmaceutical drug development
and evaluation. In recent years, trials have expanded to an increasingly diverse
range of interventions.
The term complex intervention describes treatments that are multicomponent
and include clustering due to specific components, such as the healthcare
provider, which cannot be separated from the package of treatment and influence
treatment outcomes. This chapter provides an overview of the main consider-
ations in the design and analysis of complex interventions trials.
Initial development of complex interventions is a multidisciplinary endeavor
and requires rigorous qualitative and quantitative methods. Understanding both
the intervention components and how they interact is crucial for successful
development and evaluation of the intervention.
Published guidance on methods for feasibility, piloting, or early phase trials of
complex interventions is scarce. However, there are well-established methods for
phase III trials of multicomponent interventions that involve clustering. The most
commonly used methods, including individually randomized trials with
random effects for clusters, cluster randomized trials, and stepped-wedge cluster
randomized trials, are described. Analysis focuses on generalized linear (mixed)
models; methods for sample size estimation that accommodate the extra variance
related to clustering are also provided for a range of designs in this setting.
With careful attention to the correlation structure induced by the chosen
design, results can be analyzed in standard statistical software, although small
numbers of clusters, and/or small within-cluster sizes, can cause convergence
problems.
Statistical analysis results of complex interventions trials, including
those relating to components of the intervention, need to be considered
alongside economic, qualitative, and behavioral research to ensure that complex
interventions can be successfully implemented into routine practice.
Keywords
Randomized · Cluster randomized · Clinical trial · Complex intervention ·
Multicomponent · Clustering · Healthcare provider · Stepped-wedge
Introduction
In recent years, clinical trial methodology has been applied in diverse healthcare
settings and to a wide range of interventions beyond fixed-dose, single-drug
treatments. For example, novel surgical procedures (Sharples et al. 2018),
multidisciplinary packages of care for chronic diseases (education, physiotherapy,
medicine) (Dickens et al. 2014), mental health coping strategies (Mohr et al. 2011),
and public health interventions (Emery et al. 2017) have all been the subject of
74 Complex Intervention Trials 1419
randomized controlled trials (RCTs). The interventions in these trials are generally
termed complex interventions.
Although complex interventions have no single, universally accepted definition,
there are two main issues that characterize them: (i) the multicomponent nature of the
intervention itself and (ii) a level of dependency between trial participants treated by
the same healthcare provider, or in a group setting, that induces clustering of
outcome measurements. Note that, in this chapter, we use the general term provider
for any person who delivers all or part of a complex intervention, including
surgeons, physicians, therapists, nurses, physiotherapists, and so on.
In (i), what distinguishes complex from non-complex interventions is the fact that
they are made up of a number of components, which may be interdependent (e.g.,
use of rescue treatment after failure of initial therapy) or independent (e.g., treatment
packages comprising psychotherapy, physiotherapy, and health education). Some
components of the intervention package may be of interest themselves and thought
of as fixed effects (e.g., physiotherapy or not). Other components may not be
of interest themselves (nuisance parameters), but may introduce some level of
dependency between trial participants, or clustering. For example, two patients
treated by the same surgeon may have more similar outcomes than two patients
treated by different surgeons, due to differences between surgeons in experience and
expertise, as well as local (to the hospital) population attributes. In general, specific
surgeons are not of interest themselves, but they represent a population of surgeons
who might ultimately provide the intervention of interest (see later on for an example
of (ii)). In this case, providers are better defined and analyzed as random effects (e.g.,
surgeon performing a procedure, therapist providing group cognitive behavioral
therapy). What links these (fixed and random effects) components is that they can
all have an influence on the effectiveness of the package that makes up the complex
intervention, and differences (between patients) in the specific treatment package
received may manifest as heterogeneity in the outcome of interest. The multi-
component nature of the intervention may also mean that multiple outcomes are
used to assess the success of its different components, which may complicate the
choice of primary outcome (Richardson and Redden 2014).
Complex intervention trials may be further complicated due to other factors in
addition to the intervention itself. The context in which interventions are delivered
may be complex, an obvious example being the operating rooms in which surgical
procedures are performed; these contain many instruments, machines, and devices,
operated by multiple interacting clinical disciplines (Blencowe et al. 2015). Public
health trials are clearly dependent on the setting in which they are conducted and on
how that affects the delivery of the intervention (Emery et al. 2017). In particular,
trials in hard-to-reach populations may require designs such as snowballing (index
cases identifying other cases for study participation) that introduce dependency
between participants (Yuen et al. 2013). Low- and middle-income countries may
have diverse healthcare infrastructure, staffing, cultural, and economic factors that
add to the complexity of an intervention and its evaluation (Cisse et al. 2017).
All these factors affect the way a novel intervention is developed, the amount of
standardization required for each component, the assessment of treatment adherence,
1420 L. Sharples and O. Papachristofi
the fidelity to the intervention as defined, and the statistical design and analysis of the
trial. In particular, any random effects or clustering inherent in the delivery of the
complex intervention will increase the sample size compared to a design in which a
simple intervention is applied at the patient level, as independent and identically
distributed patient outcomes cannot be assumed.
General characteristics of complex interventions have been described, with the
Medical Research Council (MRC) in the UK providing an early general framework
for design of such trials in 2000 (Medical_Research_Council 2000), which
was updated in 2008 (Medical_Research_Council 2008). Although useful, these
guidelines do not provide detail on specific methods of assessment. It is generally
appreciated that the design and analysis of complex intervention trials require
rigorous quantitative and qualitative research methods, with collaborative work
between, for example, statisticians, behavioral scientists, and clinical experts.
While this chapter provides some general discussion on the use of mixed methods
in this context, the focus is primarily on statistical methods in a broad sense.
Pharmaceutical interventions are manufactured under strict quality control so that the
dose of active drug is known exactly (e.g., see (Pocock 1983)). Moreover, delivery
of the drug rarely depends on the context, as it is typically identical across physicians
and settings. In contrast, complex intervention trials are often embedded in the
clinical or public health setting that they aim to address, which influences the trial
design. Moreover, there may be components of the intervention that can be left to the
discretion of the provider, say the type of sutures used in a surgical intervention or
the number of sessions required in a psychotherapy trial. For these reasons, including
differences in the way the intervention is delivered between patients, complex
interventions are naturally evaluated using pragmatic trials (Loudon et al. 2013).
Although there are exceptions (such as evaluations of new diagnostic tests), in many
cases, the evaluation aims to reflect how the intervention would perform in the
setting for which it is intended, rather than in a tightly controlled setting, with highly
selected patients, as is the case in trials assessing drug efficacy.
Nevertheless, to maintain scientific quality and ensure that the intervention can be
reproduced by other providers, a clear definition of exactly what constitutes the
complex intervention, and how it will be delivered in the planned setting, is
paramount. The Template for Intervention Description and Replication (TIDieR)
checklist and guide provides general advice on the reporting of interventions
(Hoffmann et al. 2014); this section provides guidance specific to complex
interventions.
Focusing on surgical trial design, Blencowe et al. (2016) highlighted the impor-
tance of describing each component of a surgical procedure, as well as the level of
standardization and flexibility permitted within the surgical intervention package.
74 Complex Intervention Trials 1421
For example, consider the Amaze trial of ablation to treat abnormal heart rhythm in
patients already scheduled for cardiac surgery; the control group received the
scheduled cardiac surgery alone (Sharples et al. 2018). The complete procedure
involved treating 12 different sections of the heart, but because Amaze was designed
to reflect the treatment as it was currently used in the UK (pragmatic trial), the
protocol allowed surgeons to use their judgment and treat as many sections as
considered necessary. In addition, surgeons were also able to use their expertise to
decide whether to apply co-interventions, such as electrical stimulation (cardiover-
sion), if the initial operation was not fully effective. All other procedure components
were fixed according to protocols. The flexibility inherent in Amaze is common in
surgical trials where procedure success depends heavily on the training, skill, and
expertise of the surgeon; as a result, the intervention is a combination of both the
operation delivered and the surgeon who performs it. Similarly, psychotherapy
typically involves a set of protocols and techniques, which must be well-defined,
together with a psychotherapist delivering them. The therapist will have an influence
both on the content and delivery of the package, as well as on patient adherence
(Walwyn and Roberts 2017).
The level of standardization to be considered when defining the intervention has
been discussed in surgical (Blencowe et al. 2016), behavioral (Mars et al. 2013), and
public health (Perez et al. 2018) contexts. The main factors to consider can be
summarized as follows:
(i) Which intervention components are necessary and which are optional?
(ii) Under what circumstances is each component mandatory, prohibited, or
optional?
(iii) For each component, what delivery methods are mandatory, prohibited, or
optional?
(iv) For each component, which delivery methods have a strict definition and which
can be applied flexibly?
(v) What training or competency is required for providers of each component?
differences between cardiac surgeons. Thus, no two patients can be assumed to have
had an identical procedure, and the surgeon can be considered integral to the
intervention package; therefore, consideration must be given to the criteria for
including a surgeon in the trial. This principle is important for healthcare providers
delivering complex interventions in general.
However, establishing when a provider is sufficiently experienced to be
randomized in an RCT is not straightforward (discussed in detail in the next
section). For instance, learning assessments are complicated because as surgeons
gain experience, they are likely to undertake more high-risk cases. Their perfor-
mance might seem not to improve or even to deteriorate with time, but this may
be due to an increase in the risk in the cases they handle, so that the duration of
their learning period (and hence when they can be randomized in an RCT) may
be overestimated.
Complex interventions cover such diverse areas that detailed discussion of the initial
derivation of the intervention itself is beyond the scope of this chapter. General
guidance on the design and evaluation of complex interventions can be found in the
MRC complex intervention guidance (Medical_Research_Council. 2008);
the IDEAL (Idea, Development, Exploration, Assessment, and Long-term Study)
publications provide a phased development of the intervention, from conception to
final evaluation, for surgical trials (Mcculloch et al. 2009), and MOST (Multiphase
Optimization Strategy) describes a framework for behavioral interventions (Collins
et al. 2007).
Development of the intervention package requires input from a range of clinical
and research disciplines. As with all interventions, it is also important to have
a comprehensive knowledge of the state of the art. This may involve some or all
of the following activities:
The following section focuses on how statistical methods may contribute to the
development of complex interventions; details of the use of all above methods in this
context are provided in a review by Richards and Hallberg and references therein
(Richards and Hallberg 2015).
74 Complex Intervention Trials 1423
Timing of Evaluation
c) d)
Measure of
performance
Measure of Measure of
experience experience
1424 L. Sharples and O. Papachristofi
focus is on both hypothesis testing and Bayesian methods to inform the decision to
continue to the phase III trial. Challenges that have been identified include:
(i) The small number of clusters (often defined by the few innovators who
developed the intervention) at early stages
(ii) The small number of patients in each cluster
(iii) The lack of information on cluster sizes and the ICC
(iv) The number of endpoints that may be under investigation, with no clear
decision about the appropriate phase III primary outcome
(v) The lack of data to inform (compound) hypothesis tests and/or Bayesian
utilities when assessing multiple outcomes (e.g., both efficacy and safety)
Although this work and references therein provide some useful guidance, early
phase trial statistical methodology is not yet established in the field.
The pivotal phase of evaluation is the phase III RCT assessing clinically important
outcomes, usually in a large sample of patients: see, for example, Pocock (1983). As
the phase III trial aims to estimate treatment efficacy or effectiveness, all aspects of
the intervention package, including treatment protocols, treatment duration, and
safety profiles, should have already been established. In what follows, the most
commonly used designs for phase III complex intervention trials are described.
Consider first trials in which randomization, treatment, and outcomes assessment are
all conducted at the individual patient level. In situations where the intervention is to
be evaluated as a package of care, ignoring any random effects, standard trial design,
and analysis methods can be used. In order to introduce some notation, assume a
generalized linear model for patient i, i ¼ 1, . . ., m, with xi a categorical covariate
representing treatment allocation:
h i
g1 E yij ¼ β0 þ β1 xij þ uj ð2Þ
where uj N~ 0, σ 2u are the cluster-specific random intercepts for providers. For thecase
of a linear model, the residual error terms for patient i in cluster j, eij j uj N~ 0, σ 2e are
now independent conditional on cluster occupancy. Such a model is termed a hierar-
chical or nested design (see Fig. 2a). Normally distributed random effects (on some
scale) are almost universally used in the trials’ literature for the primary analysis, with
other distributional assumptions explored in sensitivity analyses.
For the simple nested model (2) the ICC is given by
σ 2u
ρ¼ : ð3Þ
σ 2u þ σ 2e
We refer to this as the simple ICC. The ICC can be interpreted as the proportion of
the total variation that is attributable to between-cluster variance and is an important
74 Complex Intervention Trials 1427
Fig. 2 Illustration of b)
a)
individually randomized trials
with clustering in (a) both trial Experimental Control
arms, (b) the experimental
arm only, (c) cross-classified, Surgeons Surgeons
and (d) multiple membership
multiple classification
Patients Patients Patients
c) d)
Patients Patients
parameter for sample size estimation for phase III trials (see section “Sample Size
Estimation for Trials with Clustering” below). High values for ρ indicate that
intervention delivery is quite heterogeneous between clusters relative to the
within-cluster variation and vice versa.
There are two main alternative scenarios to this simple design: first, the random
components affect outcomes differently in the intervention and control arms, and,
second, the treatment effect varies between clusters within the same arm (i.e.,
random coefficient for treatment).
The first scenario might arise in trials with very different treatment arms, for
example, a trial of a new technically demanding surgery (high variation) compared
with standard surgery (low variation) and can be modeled by
h i
g1 E yij ¼ β0 þ β1 xij þ uj δ xij ¼ 1 þ u0j δ xij ¼ 0 ð4Þ
where uj N~ 0, σ 2u and u0j N~ 0, σ 2u0 are the cluster-specific random effects in the
treatment and control arms respectively, and δ is the treatment arm indicator function;
for a linear model, residual error terms are again expressed as eij j uj Nð0, σ 2e Þ. Such
trials are described as partially nested if the control arm random effects are all zero
(e.g., surgery versus medical management trial; see Fig. 2b). Assuming equal numbers
of patients per cluster and equal numbers of clusters per arm, the ICC can be written as
0:5 σ 2u þ σ 2u0
ρ¼ ð5Þ
0:5 σ 2u þ σ 2u0 þ σ 2e
and comprises three variance terms; the two random effects variances are considered
independent since they are estimated in different clusters.
The second scenario might arise in a novel surgery versus standard surgery trial,
where the between-surgeon variation in outcomes manifests through heterogeneity
in the treatment effect; this can be modeled by
1428 L. Sharples and O. Papachristofi
h i
g1 E yij ¼ β0 þ β1 xij þ uj þ u0j xij , ð6Þ
where uj N~ 0, σ 2u and u0j N~ 0, σ 2u0 are the cluster-specific random effects on the
intercept and treatment coefficient, respectively; for a linear model, eij j
uj , u j0 N~ 0, σ 2e are residual error terms. In the random coefficient model, correlation
between the two random effects parameters is possible ðσ uu0 ¼ rσ u σ u0 Þ and the ICC
can be written as
σ 2u þ σ 2u0 þ 2σ uu0
ρ¼ : ð7Þ
σ 2u þ σ 2u0 þ 2σ uu0 þ σ 2e
Cross-Classified Designs
σ 2u þ σ 2v
ρ¼ : ð9Þ
σ 2u þ σ 2v þ σ 2e
When the two providers are not independent, resulting in correlated random
effects (σ uv ¼ rσ uσ v), the ICC can be written as
74 Complex Intervention Trials 1429
σ 2u þ σ 2v þ 2σ uv
ρ¼ : ð10Þ
σ 2u þ σ 2v þ 2σ uv þ σ 2e
In this case of two correlated random components, the ICC requires three
variance and one covariance components; if further random effects components
were to be accommodated, the number of terms in the models and the complexity
of the interrelationships between components would increase. Therefore, investiga-
tion of crossed components should focus on a small number of components that have
been identified as most important during trial design.
When designing complex intervention trials, it is essential to have a clear
understanding of the different components of variation and how these affect the
treatment effect estimates. For interventions with several components, high-level
interactions between fixed and random effects are difficult to robustly estimate. Our
recommendation is to rank components according to the level of clustering in the
primary outcome(s) and investigate them in a stepwise manner (Papachristofi et al.
2016b)
A generalization of the cross-classified model is the multiple membership multi-
ple classification (MMMC) model, which considers a second random component
that operates across several elements of the first (see Fig. 2d). An example of this
would be a surgical intervention that requires more than one surgeon to be involved,
or a psychological intervention provided by more than one therapist, say in a group
session. Each provider might work with a number of other providers during the trial.
Details of these models can be found in Browne et al. (2001).
The parallel cluster randomized controlled trial (CRCT) is an established design for
evaluating interventions that are either randomized to all patients within a predefined
group (cluster) or where the intervention is delivered at a group level (Eldridge and
Kerry 2012). Examples include trials involving care of dementia patients (Surr et al.
2016), primary care (Emery et al. 2017), and hospitals where interventions are
applied to individual wards or clinics (Erasmus et al. 2011). CRCTs are also an
attractive option when individuals within the same geographical or clinical area can
access information from other trial participants, so that individual randomization
may lead to contamination of treatment effects. For example, patients attending the
same clinic may share educational resources or coping strategies from a multi-
component intervention, despite being assigned to different treatment arms.
In the simplest CRCT, assuming common cluster size m, the statistical model for
patient i, i ¼ 1, . . ., m in cluster j, j ¼ 1, . . ., c, with xj the categorical covariate
representing treatment allocation for cluster j, can be specified as follows:
h i
g1 E yij ¼ β0 þ β1 xj þ uj ð11Þ
1430 L. Sharples and O. Papachristofi
where yij is a measure of outcome, g is an appropriate link function, and uj N~ 0, σ 2u
are the cluster-specific random effects; for a linear model, eij j uj N~ 0, σ 2e are residual
error terms. In this model, all patients in a cluster receive the same treatment, and the
treatment effect is common across clusters. Note that further patient- or cluster-
specific covariates have been omitted for clarity, although it is straightforward to
include them in these models; generalized linear models are again used for illustra-
tion, but methods for clustered time-to-event outcomes are also available.
Model (11) reflects a trial in which the outcome is assessed at the individual
participant level, but the same framework (with a suitable link function and error
structure) can accommodate cluster-level outcomes, which are common in CRCTs.
For example, the model in (11) is appropriate if hospitals as a whole are randomized
to active treatment or control, but outcomes are assessed for each patient, perhaps so
that patient-level covariates can be adjusted for. Alternatively, analysis could be
based on hospital-level outcomes, such as the proportion of patients per cluster that
have a successful outcome. These aggregated outcomes can be considered indepen-
dent and analyzed using standard statistical methods, with the outcomes weighted by
their precision or the number of cases within each cluster. While this approach results
in a simple analysis, it does not allow for inclusion of patient-level covariates and
thus may be less efficient than an individual patient data analysis. The associated
ICC is identical to the simple ICC for the two-level hierarchical model in Eq. (3).
Stepped-Wedge Designs
Time period
Step 1 2 3 4 5 6
1
2
3
4
5
Fig. 3 Illustration of a stepped-wedge cluster randomized trial. Each cell represents a cluster, and
each time period after the first cell represents a step. Darker cells indicate clusters having the
experimental treatment, and lighter cells represent control clusters
74 Complex Intervention Trials 1431
in that new patients are recruited in each time period, and even though they may be
followed up over time for an event, the outcome is analyzed according to the period
of recruitment (e.g., see the FIT trial of hand hygiene compliance (Fuller et al.
2012)). An alternative, but less frequently used, design is the cohort SWD, in which
all patients are recruited in the first period and followed throughout the trial, with a
proportion of the patient clusters randomized to switch to the active intervention at
the end of each period. For example, Jordan et al. (2015) conducted a trial in which
dementia patients in care homes were switched to nurse-led prescribing of medica-
tions in a series of steps, until all patients were in the intervention arm.
The choice between the SWD and the traditional CRCT depends on the context
and nature of the intervention. The SWD is particularly suited to service delivery or
healthcare policy interventions, for which traditional CRCTs are logistically difficult
to implement. However, as in the SWD all clusters are exposed to the active
intervention by the end of the trial. It is more difficult to revert to the control
treatment if the experimental intervention proves to be ineffective. Thus, the SWD
is only appropriate if the intervention has negligible side effects and if it is consid-
ered very unlikely to be less effective than the standard of care. For example, it is
difficult to imagine how provision of sterile hand wash could introduce substantial
risk of side effects or an increase in infectious episodes.
Because SWD trials are conducted over a series of periods, they are not appro-
priate for interventions requiring long-term follow-up in order to establish an effect
or when treatment effects vary over time.
Analysis for a SWD uses a model with random effects for clusters, and fixed
effects for time periods, i.e., for patient i in cluster j at time period k:
h i
g1 E yijk ¼ β0 þ β1 xijk þ ωk þ uj ð12Þ
Analysts may consider additional exploratory analyses that relax these assump-
tions using strata (groups of clusters)-by-period interactions, strata-by-treatment
1432 L. Sharples and O. Papachristofi
The sample size for trial designs with clustering is affected both by the cluster size
and the number of clusters. Methods for sample size estimation have been published
for a wide range of individually randomized trials with clustering (Walwyn and
Roberts 2010), CRCTs (Eldridge and Kerry 2012), and SWDs (Hemming and
Taljaard 2016). The general principles and some simple calculations based on
normally distributed random effects are provided in this section.
Typically, sample size estimation for trials with clustering involves calculation of
a design effect (DE), used to inflate the sample size estimate from the corresponding
trial design for independent, identically distributed outcomes; the simplest version is
given below.
The standard sample size estimate n for each arm of an individually randomized
parallel two-group trial, with 1:1 allocation ratio and a normally distributed outcome,
is given by
2
Zα=2 þ Z1β 2σ 2e
n¼ , ð13Þ
Δ2
where Δ is the target mean difference in the outcome between the two arms, Zp is the
pth percentage point of the standard Normal(0, 1) distribution, and α and β are type I
and II error probabilities, respectively.
In a similar CRCT, where all patients in a cluster receive the same treatment, and
the cluster size is fixed and known, the number of patients per arm n can be
calculated using
2
Zα=2 þ Z 1β 2σ 2e
n¼ ð1 þ ðm 1ÞρÞ, ð14Þ
Δ2
where m is the fixed cluster size and ρ the ICC; the DE term (1 + (m 1)ρ) is the
inflation factor due to clustering (Donner et al. 1981). In this simple case, the number
of clusters required per arm would be c ¼ n/m. Note that the DE increases with the
number of patients per cluster and the size of the ICC. For clusters of size 1 or ICC of
zero (independence between clusters), the DE reduces to one, and the sample size
74 Complex Intervention Trials 1433
formula reverts to that for the independent, identically distributed outcomes design.
In general, keeping the number of clusters fixed and increasing the within-cluster
sample size are not efficient, in that such a strategy will require more trial partici-
pants overall than if the number of clusters were to be increased.
When the cluster sizes are unequal, as it is the case for most trials, the DE relies on
knowing the mean and coefficient of variation of cluster sizes and becomes
DE ¼ 1 þ m 1 þ cv2 1Þ ρ, ð15Þ
where m is the average cluster size and cv is the coefficient of variation for the cluster
sizes (see, e.g., Eldridge et al. (2006)). The number of patients required per arm is
larger than for the equal cluster size case and increases as the variability between the
cluster sizes increases. The DE for unequal cluster sizes given in Eq. (15) is the most
commonly used; Eldridge et al. (2006) provide alternative formulations and further
discussion.
Note that the above inflation factors can be applied to any sample size calcula-
tions for which the treatment effect is estimated from a generalized linear model; for
example, Δ in Eq. (14) may represent the log odds ratio or log rate ratio for the
treatment effect, with appropriate values for σ 2e .
Advocates of SWD have argued that the additional time periods involved render
them more efficient than parallel CRCTs, requiring fewer patients (Woertman et al.
2013). However, this does not appear to be true in all cases, with the relative efficiency
of the two approaches dependent on design parameters such as the ICC, the size and
number of clusters, and the number of periods (Hemming and Taljaard 2016).
Assuming that the number of time periods (steps) s, and the cluster size m have
been fixed, the DE for a simple SWD has been derived as
1 þ ρðsm þ m 1Þ 3sð1 ρÞ
DE ¼ ðs þ 1Þ : ð16Þ
1 þ ρðsm=2 þ m 1Þ 2ðs2 1Þ
The sample size needed per time period can be obtained by equating this with the
total sample size for a SWD (m(s + 1)c) and solving a quadratic equation (Hemming
and Taljaard 2016). Close inspection of the DE shows that this will be most efficient
with large numbers of time periods, or steps, and small numbers of clusters at each
step.
Reporting
Implementation
Interest in complex interventions has grown substantially in recent years. They are
characterized by multicomponent treatment packages and clustering due to specific
components, such as the provider and implementation setting, which cannot be
separated from the package of treatment and have an influence on the outcome of
treatment. Development of complex interventions is a multidiscipline endeavor and
requires a mixture of rigorous qualitative and quantitative methods. Although research
on feasibility, piloting, and early phase trials is sparse, there are well-established
methods for phase III trials of multicomponent interventions that involve clustering.
The most commonly used methods are individually randomized trials with random
effects for clusters, cluster randomized trials, and, more recently, stepped-wedge
cluster randomized trials. Sample size estimation methods exist for a range of designs.
With careful attention to the correlation structure induced by the chosen design, results
can be analyzed in standard statistical software. Post-trial implementation in routine
practice will depend on statistical, economic, qualitative, and behavioral analysis.
Key Facts
• Both qualitative and quantitative methods are required in the development and
evaluation of complex interventions.
• Timing of the definitive evaluation of a complex intervention needs careful
consideration, taking into account its stage of development and treatment
equipoise of both patients and healthcare providers.
• Complex interventions are usually evaluated in pragmatic trials since they are
often embedded in the healthcare context they are intended for.
• A range of well-established trial designs for complex interventions are available,
including clustered individually randomized trials, cluster randomized trials, and
stepped-wedge designs.
• With careful definition of the correlation structure resulting from their design,
trials can typically be analyzed using standard statistical software packages.
Cross-References
References
Austin PC (2010) Estimating multilevel logistic regression models when the number of clusters is
low: a comparison of different statistical software procedures. Int J Biostat 6
Blencowe NS, Brown JM, Cook JA, Metcalfe C, Morton DG, Nicholl J, Sharples LD, Treweek S,
Blazeby JM, Members of the, M. R. C. H. F. T. M. R. N. W (2015) Interventions in randomised
controlled trials in surgery: issues to consider during trial design. Trials 16:392
Blencowe NS, Mills N, Cook JA, Donovan JL, Rogers CA, Whiting P, Blazeby JM (2016)
Standardizing and monitoring the delivery of surgical interventions in randomized clinical trials.
Br J Surg 103:1377–1384
Boutron I, Altman DG, Moher D, Schulz KF, Ravaud P, Group, C. N (2017) CONSORT Statement
for randomized trials of nonpharmacologic treatments: a 2017 update and a CONSORT
extension for nonpharmacologic trial abstracts. Ann Intern Med 167:40–47
Brown CA, Lilford RJ (2006) The stepped wedge trial design: a systematic review. BMC Med Res
Methodol 6:54
Browne WJ, Goldstein H, Rasbash J (2001) Multiple membership multiple classification (MMMC)
models. Stat Model 1:103–124
Campbell MK, Piaggio G, Elbourne DR, Altman DG, Group, C (2012) Consort 2010 statement:
extension to cluster randomised trials. BMJ 345:e5661
Cisse MBM, Sangare D, Oxborough RM, Dicko A, Dengela D, Sadou A, Mihigo J, George K,
Norris L, Fornadel C (2017) A village level cluster-randomized entomological evaluation of
combination long-lasting insecticidal nets containing pyrethroid plus PBO synergist in Southern
Mali. Malar J 16:477
Collins LM, Murphy SA, Strecher V (2007) The multiphase optimization strategy (MOST) and the
sequential multiple assignment randomized trial (SMART): new methods for more potent
eHealth interventions. Am J Prev Med 32:S112–S118
Cook JA, Ramsay CR, Fayers P (2004) Statistical evaluation of learning curve effects in surgical
trials. Clin Trials 1:421–427
74 Complex Intervention Trials 1437
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1440
Example: AD Design Versus Two-Arm RCT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1442
Notations and Major Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1442
Conventional Randomized Clinical Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1444
Amery-Dony Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1445
Symmetric AD-Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1446
Sample Size Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1447
Clinical Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1448
Generalization of Results to the Source Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1448
Ethical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1449
Classification Hurdles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1450
Place of RDT Designs in the Family of Enrichment Trial Designs . . . . . . . . . . . . . . . . . . . . . . . 1450
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1451
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1451
Abstract
Randomized discontinuation trials are usually considered as a special case of
population enrichment trials. Most often they consist of a few stages with the
early ones targeting the selection of a subpopulation that responds or may respond
to an experimental treatment and allowing an early escape of patients doing
poorly. The succeeding stages validate the treatment’s superiority to placebo or
more generally to a comparator for the selected (enriched with potential
responders) subpopulation. The approach increases the trial feasibility and the
chances of correct response-to-treatment detection, often at the expense of
decreased ability to extrapolate results over the initially targeted population.
The first randomized discontinuation trials were used exclusively to test long-
term, non-curative therapies for chronic or slow evolving diseases. The develop-
ment of stochastic longitudinal models combined with Bayesian ideas and
V. V. Fedorov (*)
ICON, North Wales, PA, USA
Keywords
Enrichment trials · Population enrichment · Randomized discontinuation trials ·
Screening · Subpopulation selection
Introduction
The basic idea of randomized discontinuation trials (RDTs) can be traced back to
the 1970s when Amery and Dony (1975) proposed the trial design (see Fig. 1) that
consists of an open-label stage with all patients exposed to the experimental treat-
ment, while at the second stage, all responders are randomized to placebo or an
experimental treatment arm. They showed that under rather mild assumptions, such
designs reduce patients’ exposure to placebo and mitigate the impact of placebo
responders on drug efficacy evaluation. Initially, RDTs were intended for chronic
and slow progressing diseases such as various types of angina, psychiatric disorders,
early stages of Parkinson’s and Alzheimer’s diseases, pain mitigation, and
• The experimental treatment will not cure the condition during the open-label
stage.
• The treatment effect(s) of the open-label stage will not be carried over to the
second stage. This assumption can be satisfied with adding a washout period
between the two stages if it is ethically admissible.
• Following Amery and Dony (1975), most publications assume that a treatment
effect is binary: either there is or there is not a response to treatment. Respectively,
the patients are labeled as either responders or nonresponders.
The later versions of RDT designs use slightly modified assumptions. For instance,
instead of “responders and non-responders,” one may consider patients with “positive
response, stable disease, and negative response,” or instead of stable tumor size, one
may consider a stationary tumor growth rate. The introduction of longitudinal models
as in Trippa et al. (2012) is a crucial part of the successful design of such RDT. The list
of different settings can be continued but in all of them there are two major steps:
population enrichment and treatment validation for the enriched subpopulation.
Often, the observed responses to treatment may be continuous (e.g., blood
pressure, duration of anginal pain), discrete (e.g., frequency of angina), or ordinal
(e.g., various scores in pain studies). Their dichotomization may lead to a significant
information loss, see Fedorov, Mannino, and Zhang (2008), Uryniak et al. (2011),
and may be questionable when it is based solely on “investigator’s judgment of
success,” see Temple (1994). To minimize the impact of dichotomization, one can
use dichotomized data only to guide patient assignment to different treatment arms
and perform the final statistical analysis using original, non-dichotomized data.
1442 V. V. Fedorov
Capra (2004) compared the power of RDT with that of RCT when the primary
endpoint is time-to-disease progression. Kopec et al. (1993) evaluated the utility and
efficiency of RDT when the endpoints are binary. They compared the relative sample
size required for a fixed power of RDT versus RCT under different scenarios and
parameter settings. The approaches are based on the outcomes solely from the
second stage, treating the open-label stage as a screening process. This simplifies
the statistical analysis, but the information contained in the open-label stage is
mostly wasted. Examples when the statistical analysis includes the information
from both the open-label and the treatment validation stages can be found in Fedorov
and Liu (2005, 2014), and Ivanova, Qaqish, and Schoenfeld (2010).
The next section presents the statistical aspects of the Amery-Dony design
(AD design) and its symmetric version, complimented with their comparison to
conventional randomized trial designs. Both types of AD designs are very basic
versions of RDT designs. However, their consideration allows illumination and
discussion of major properties that are common across all RDT designs published
so far, and/or currently used in clinical trials. Section “Clinical Applicability” will
address the major concerns associated with RDT implementation in medical
practice.
If the assumption of random sampling holds, then the following working model is
used:
• There exist infinitely many eligible patients of three mutually exclusive types:
treatment-only responders, placebo responders, and nonresponders.
• At each draw, the probabilities that a sampled patient belongs to one of these
categories are πt, πp and π ¼ 1 πt πp respectively.
• An experiment consists of n random drawings with n ¼ nt + np + n, where nt, np,
and n ¼ n nt np are numbers of treatment-only responders, placebo
responders, and nonresponders. In what follows, the triplet {n, nt, np} will be
called the “complete data set.”
75 Randomized Discontinuation Trials 1443
• The sampled nt, np, n have a trinomial distribution, see Chap. 35, Johnson, Kotz,
and Balakrishnan (1997), with parameters (n, πt, πp, π).
If nt and np are known exactly, then the maximum likelihood estimators (MLE)
of πt, πp, and π are very simple and readily available, cf. Chap. 35.6, Johnson, Kotz,
and Balakrishnan (1997):
nt np 1 nt np
b
πt ¼ πp ¼ , b
,b π ¼ : ð1Þ
n n n
The variance-covariance matrix of the first two (the third one π is their linear
combination):
1 π t ð1 π t Þ π t π p
Var bπt, b
πp ¼ : ð2Þ
n π t π p πp 1 πp
Given n, this matrix provides the lower bounds for the variances of any estimators
of πt, πp, or any function of them. The latter is often called the “estimand” (cf. FDA
Guidance for Industry 2019a).
ψ(π), π ¼ (πt, πp) , is smooth enough, then the variance
T
If the function of
interest
of its MLE ψ b¼ψ b πt, b
π p is
@ψ @ψ
b ’
Var½ψ Varðb
πÞ : ð3Þ
@π T @π
For a linear function ψ, formula (3) is exact. Otherwise, it is valid asymptotically
when n ) 1. For a reasonably large sample size n and moderate πt and πp, formula (3)
serves as a good approximation and is often dubbed the “delta rule” or “delta method,”
cf. Oehlert (1992). For the two popular cases ψ(π) ¼ πt + πp ¼ π+ and ψ(π) ¼ πt/
(πt + πp) ¼ Rt, the delta method readily provides that
π þ ð1 π þ Þ h i R ð1 R Þ
Var½b
πþ ¼ and Var Rbt ¼ t t
: ð4Þ
n n πt þ πp
Formulae (2), (3), and (4) will be used as benchmarks for the statistical efficiency of
various trial designs when this is expressed in terms of variances or more generally
covariance matrices of parameter estimators. It should be emphasized that these bench-
marks are reachable when it is possible to extract the complete data set {n, nt, np} from
the original data. In most cases, it is not.
The conventional two-arm RCT, see Fig. 1, is set up as follows. Suppose that
n patients be randomly sampled from a legitimate population and out of them n1
patients are randomized to the treatment arm and the rest n2 ¼ n n1 patients to the
placebo arm.
75 Randomized Discontinuation Trials 1445
Let n1+ and n2p be the numbers of responders observed on the treatment arm and
the placebo arm, respectively. Note that from these two outcomes, the exact values
of nt or np are not available and estimator (1) cannot be used. However, the MLE of
π+, πp, and πt together with their variances can be built using the observed n1+ and
n2p:
π ð1 π Þ
b n b
π þ ¼ 1þ and Var b
b πþ ¼ þ þ
, ð5Þ
n1 n1
np π 1 π
b
πp ¼
b and Var b
b
πp ¼
p p
, ð6Þ
n2 n2
π ð1 π Þ π 1 π
b b
πt ¼ b
b πþ bb
π p and Var b
πt ¼ þ
b þ
þ
p p
, ð7Þ
n1 n2
see notations in Fig. 1. As expected, the estimators presented in Eqs. (5), (6), and (7)
have variances greater than the respective lower bounds (2) and (4) for any choice
of n1 and n2, n1 + n2 ¼ n. The validity of this statement for the first two is obvious.
To verify the same statement for the treatment effect estimator b b
π t , one may use the
following chain of inequalities:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
2
min Var b
b
πt ¼ π þ ð1 π þ Þ þ π p 1 π p
n1 , n2 ; n1 þn2 ¼n n
1 1
π þ ð1 π þ Þ þ π p 1 π p π t ð1 π t Þ,
ð8Þ
n n
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
where the optimal allocation ratio is n1 =n2 ¼ π þ ð1 π þ Þ=π p 1 π p ,
cf. Piantadosi (1997, Chap. 9.6). Note, that the lower bound πt(1 πt)/n is reached
only when πp ¼ 0 and n1 ¼ n , i.e., for a trial with all patients allocated to the
treatment arm.
In a practical setting, the variance of the estimated fraction πt of treatment
responders significantly exceeds its lower bound n1πt(1 πt), especially in the
presence of numerous placebo responders. This fact was a motivation for the
development of trial designs that allow mitigation of the negative impact of placebo
responders, on statistical properties of trials. Usually placebo responders and treat-
ment nonresponders are withdrawn from the next trial stage to avoid their exposure
to useless or harmful treatment(s).
Amery-Dony Design
In the Amery and Dony design (AD design) at the open-label stage, all qualified
n patients are be assigned to the treatment arm, see Fig. 2. After completion of this
stage, all nonresponders n leave the trial, while n+ all responders, including placebo
responders, are randomized between placebo and treatment arms. Let n1+ and n2+ ¼
n+ n1+ be the numbers of patients assigned to the treatment and placebo arms,
1446 V. V. Fedorov
respectively. It is assumed that one of the restricted randomization methods (for instance,
randomization in blocks or an urn method, cf. Piantadosi (1997, Chap. 9.3) is applied to
keep the ratio n2+/n1+ as close as possible to the targeted allocation rates ratio γ/(1 γ).
Most statements/derivations that follow will use allocation rates ratio.
The results of the second stage are the number of placebo responders n2p out of
n2+ responders assigned to the placebo arm, the number of treatment-only responders
n2t ¼ n2+ n2p, and the number of responders n1+ on the treatment arm, which is
identical to the number of randomized patients n1+, and does not provide any new
information. Whence, the most informative, albeit not very ethical, AD design
should be executed with γ ¼ 1.
The straightforward application of the maximum likelihood method leads
(cf. Fedorov and Liu 2014) to the following MLE of πt:
^ n2t ^ 1 1 γ πt πp
πt ¼ and Varðπ t Þ ¼ π t ð1 π t Þ þ : ð9Þ
γn n γ πt þ πp
Symmetric AD-Design
The analogy between RDT and cross-over designs opens the way to various mod-
ifications of RDT. For instance, similar to the conventional RCT, on the onset of the
trial, patients can be randomized to two compound arms. The first one, treatment-
placebo arm (TP arm), starts with the experimental treatment followed by discon-
tinuation of nonresponders and by the placebo assignment for responders. The
second one, the placebo-treatment arm (PT arm), starts with the placebo run-in
followed with the discontinuation of placebo responders and by the experimental
treatment assignment for the rest. See Fig. 3 for details and notations.
Unlike traditional cross-over design (cf. Piantadosi 1997, Chap. 16; Senn 1997,
Chap. 17), the number of patients at the second stage is less than at the first stage:
nonresponders leave the TP arm and placebo responders leave the PT arm. Under
assumptions made in section “Notations and Major Assumptions,” the withdrawal of
patients (n1 from the TP arm and n2p from the PT arm) does not lead to any loss of
information but improves the ethical profile of the respective clinical trials.
Let n1 and n2 patients be randomized to the TP and PT arms, respectively. As in
the previous section, n ¼ np + nt + n and only n is known before the trial. After the
75 Randomized Discontinuation Trials 1447
completion of the first stage, the numbers of nonresponders n1 for the TP arm and
the numbers of placebo responders n2p for the PT arm will be known. These two
groups leave the trial. All n1+ ¼ n1p + n1t responders to treatment are assigned to
placebo and all n2 + n2t placebo nonresponders are assigned to the experimental
treatment. One can observe that
i.e., n, nt, np, and n are known, and the MLE (Eq. 1) can be calculated:
Required Variance
Given the sample size, the variances of estimated parameters (πt, πp, Rt, etc.) can be
found using formulae (2), (3), (4), (5), (6), (7), (8), (9), and (10). The same equations
allow calculation of sample sizes that are needed to reach a required variance V of
the estimated parameters. For instance, for estimating πt, the sample size is
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
v0 1 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2
n0 ¼ ¼ π þ ð 1 π þ Þ þ π p 1 π p , ð11Þ
V V
v00 1
n00 ¼ ¼ π ð1 π t Þ, ð12Þ
V V t
1448 V. V. Fedorov
for the symmetric AD design, see Eq. (2) and comments to Eq. (10).
For an illustration, consider a toy scenario
pffiffiffiffiffiwith
ffi the prior guesses for πt ¼ 0.2,
πp ¼ 0.2, and the required V ¼ 0.044, i.e., V ¼ π t =3. From Eqs. (11) and (12) it
follows that v 0 ¼ 0.79, v 00 ¼ 0.16, and respectively n0 ¼ 180 and n00 ¼ 36 (after
rounding to the next integer).
The inverses of v 0 and v 00 can be interpreted as the (Fisher) information values
gained per one subject/observation, cf. Atkinson et al. (2014) and Fedorov (1972).
Formulae (2), (3), (4), (8), and (9) provide v 0 and v 00 for various designs and
estimands by choosing n ¼ 1. Note, that ratio n0 /n00 depends only on v 0 /v 00 :
hpffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
i 2
n0 v0 π þ ð1 π þ Þ þ π p 1 π p
¼ ¼ ð13Þ
n00 v00 π t ð1 π t Þ
and the design efficiency ordering is the same either in terms of variances given
sample size or in terms of sample sizes given variance.
Hypotheses Testing
Let us consider testing of hypotheses and let
H 0 : ψ ¼ ψ 0 and H A : ψ ψ A ,
where ψ is a parameter/estimand of interest. Let α and β be the targeted type one and
type two error rates, respectively. Under assumption of the asymptotic normality of
b the required sample size for a one-sided test is
estimators ψ
Z 1α þ Z1β 2
n’v , ð14Þ
ψA ψ0
where, as before, v is the variance of a single observation, cf. Fedorov and Liu
(2014). Let us continue with the previous example, i.e., with prior guesses πt ¼ 0.2,
πp ¼ 0.2 and respectively with v 0 ¼ 0.79 and v 00 ¼ 0.16. Let ψ 0 ¼ 0, ψ A ¼ πt ¼ 0.2
and let α ¼ 0.025, β ¼ 0.1, i.e., Z1 α ¼ 1.96, Z1 β ¼ 1.28. Substituting all these
numbers into Eq. (14) provides n0 ¼ 208 (the sample size for the optimized RCT
design) and n00 ¼ 42 (the sample size for the symmetric AD design). As in the
previous subsection n0 /n00 ¼ v 0 /v 00.
Clinical Applicability
until the estimands of prime interest and their estimators are carefully defined
(cf. FDA Guidance for Industry 2019a).
For instance, if the fraction πt of the total source population is the estimand of
major interest, then the enrichment trial based on the symmetric AD design allows
estimation of this quantity, see either Eq. (10) or (11). Both formulae use data from
stages 1 and 2. The data from stage 2 are not sufficient to estimate πt but the
estimation of the subpopulation fraction of treatment responders ϕt ¼ πt/π+ is still
possible. In general, the data from all stages of RDT provide information that allows
scientific inferencing for the source population, often more precisely than from
the RCT with the same sample size. Apparently, it is not true for the data collected
from the enriched subpopulation only. The statement that RDT, like conventional
RCT, may generate valid statistical inferences for the same estimands related to the
source population is not equivalent to the statement that the respective estimates will
be quantitatively and qualitatively close. RDT and CRT have different statistical and
operational profiles. The respective estimators may have different statistical proper-
ties and different operational biases. The RCT designs were specifically developed to
avoid disclosure of any information that may influence the behavior of either patients
or trial staff, or both through careful randomization, double blinding, etc. to mini-
mize operational bias (see Piantadosi 1997, Chap. 5.3; Pocock 1983, Chap. 4). At the
same time, the RDT designs, which use open label run-in stages and interim
analyses, are more prone to such biases, see examples in Freidlin, Korn, and Abrams
(2018) and Kopec et al. (1993).
Ethical Aspects
The initial motivation for developing the RDT methodology was the very
ethically sound intention to find “a clinical trial design avoiding undue placebo
treatment,” see Amery and Dony (1975). They managed to reduce the prolonged
exposure of the patient(s) to placebo treatment, which is a typical ethical problem
in conventional randomized clinical trials. Returning to Figs. 1 and 2, one can
observe that with randomization rates equal for placebo and treatment arms, the
total exposure times for RCT an RDT are TRCT ¼ n/2 [treatment duration]RCT
and TRDT+ ¼ n+/2 [treatment duration at stage 2]RDT. Usually, n > n+, and
therefore TRCT > TRDT.
Another ethically attractive feature of RDT is an opportunity to withdraw patients
who do not benefit from the experimental treatment. However, some critics
have argued that reassigning the confirmed treatment responders to the placebo
arm, as in RDT, is an unethical move and unacceptable, for instance, in oncology
trials. Thus, there are many pro and con aspects that need a thorough professional
discussion before and during crafting a trial protocol. In general, balancing the pros
and cons is and should be very specific for each therapeutic area. Interesting
examples can be found, for instance, in Daugherty et al. (2008), Fava et al. (2003),
FDA Guidance for Industry (2019b), Ratain et al. (2006), Sonpavde et al. (2006),
Stadler (2007), Stadler et al. (2005), and Temple (1994).
1450 V. V. Fedorov
Classification Hurdles
The concept of “responder” is essential for RDT, and is based on the dichotomiza-
tion of continuous or discrete responses (cf. Uryniak et al. 2011; Fedorov et al.
2008). The poor selection of cutoff levels may lead to nonzero probabilities p1 of
false-negative (a “true” responder is assigned to the group of nonresponders) and p2
of false-positive classifications (a “true” nonresponder is assigned to the group of
responders). These probabilities can be appreciable for some response types: blood
pressure in the cardiovascular area and scores in psychiatry are typical examples. In
other therapeutic areas, it is natural to assume that the open-label stage of RDT takes
a shorter time than a single stage of a conventional RCT and therefore practitioners
resort to surrogate endpoints (cf. Burzykowski et al. 2005), which do not provide
results identical to measurements at the end of RDT. For instance, in oncology, the
tumor size change at the end of stage 1 might be used to separate responders and
nonresponders, while the actual endpoint could be the tumor size change at the end
of stage 2 that can be of the opposite sign. In general, false classification should be
taken into account whenever within-patient variability is expected to be comparable
with the population variability.
The statistical RDT superiority (lower variances of estimated parameters, smaller
sample sizes in hypothesis testing) to RCT diminishes with the increase of proba-
bilities of false classifications: the “enriched” subpopulation will miss some
“responders” falsely claimed as “non-responders” and will include some pseudo
“non-responders.” As a result, erroneously classified patients will be assigned to the
wrong treatment arms. For the relatively simple AD designs, it was shown in
Fedorov and Liu (2005, 2014) that the set of scenarios where RDTs dominate
RCTs decreases with the increase of p1 and p2.
The negative role of the false classification could be mitigated if the inferential
part of data analysis were performed with original (non-dichotomized) data. The
reporting component may include the post-analysis dichotomization to make con-
clusions more comprehensible and transparent for the larger audience.
As was previously pointed out, randomized discontinuation trial designs are com-
monly viewed as special cases of enrichment strategies, see, for instance, Fedorov
and Liu (2007), FDA Guidance for Industry (2019b), Hallstrom and Friedman
(1991), Pablos-Méndez et al. (1998), and Temple (1994). The difference
between RDT and other types of population enrichment trials is that their design
and analysis does not rely on any disease-related labels (e.g., biomarkers and social
markers). All others rely on partitioning source populations into subpopulations with
specific labels. This partitioning is based on prior/historic data or some intelligent
guesses. The major goal of the respective enrichment strategies is the identification
of the treatment responsive subpopulations and thorough validation of the respective
efficacy-toxicity profiles.
75 Randomized Discontinuation Trials 1451
RDTs pursue a seemingly easier goal, which is the proof of the existence of such
a subpopulation. However, it has to be reached from a more remote starting point
where the disease-informative population partitioning does not exist. The proof that
some patients benefit from the experimental treatment is very encouraging but not
sufficient for the prediction of outcomes for future patients. Further, to move closer
to precision medicine, the RDT methodology has to be complemented with statis-
tical methods that allow the posttrial selection of disease informative markers. The
situation is similar to that of unsupervised and supervised learning in artificial
intelligence or, more specifically, in the machine learning paradigm. In the
unsupervised case, given unlabeled/unmarked data, the algorithms try to make
sense by extracting patterns (responsive subpopulations) on their own. In the
supervised case, the algorithms learn on a labeled dataset (i.e., build “knowledge”)
and provide a prediction of what may happen to subsets with specific labels (sub-
populations with specific biomarkers), followed by field verification of the validity/
accuracy of this prediction on newly accrued data.
Key Facts
References
Amery W, Dony J (1975) Clinical trial design avoiding undue placebo treatment. J Clin Pharmacol
15:674–679
Atkinson AC, Fedorov VV, Herzberg AM, Zhang R (2014) Elemental information matrices and
optimal experimental design for generalized regression models. J Statis Plan Inference 144:
81–91
Burzykowski T, Molenberghs G, Buyse M (2005) The evaluation of surrogate endpoints. Springer,
New York
1452 V. V. Fedorov
Capra WB (2004) Comparing the power of the discontinuation design to that of the classic
randomized design on time-to-event endpoints. Control Clin Trials 25:168–177
Chiron C, Dulac O, Gram L (1996) Vigabatrin withdrawal randomized study in children. Epilepsy
Res 25:209–215
Daugherty CK, Ratain MJ, Emanuel EJ, Farrell AT, Schilsky RL (2008) Ethical, scientific, and
regulatory perspectives regarding the use of placebos in cancer clinical trials. J Clin Oncol 26:
1371–1378
Fava M, Eveins A, Dorer D, Schoenfeld D (2003) The problem of the placebo response in clinical
trials for psychiatric disorders: culprits, possible remedies, and a novel study design approach.
Psychother Psychosom 72:115–127
FDA Guidance for Industry (2019a) ICH E9 (R1) addendum on estimands and sensitivity analysis
in clinical trials to the guideline on statistical principles for clinical trials. https://www.fda.gov/
media/108698/download
FDA Guidance for Industry (2019b) Enrichment strategies for clinical trials to support determina-
tion of effectiveness of human drugs and biological products. https://www.fda.gov/ucm/groups/
fdagov-public/@fdagov-drugs-gen/documents/document/ucm332181.pdf
Fedorov VV (1972) Theory of optimal experiments. Academic, New York
Fedorov VV, Liu T (2005) Randomized discontinuation trials: design and efficiency.
GlaxoSmithKline biomedical data science technical report, 2005–3
Fedorov VV, Liu T (2007) Enrichment design. In: Wiley encyclopedia of clinical trials. Wiley,
Hoboken, pp 1–8
Fedorov VV, Liu T (2014) Randomized discontinuation trials with binary outcomes. J Stat Theory
Pract 8:30–45
Fedorov VV, Mannino F, Zhang R (2008) Consequences of dichotomization. Pharm Stat 8:50–61
Freidlin B, Simon R (2005) Evaluation of randomized discontinuation design. J Clin Oncol 23(22):
5094–5098
Freidlin B, Korn EL, Abrams JS (2018) Bias, operational bias, and generalizability in phase II/III
trials. J Clin Oncol 36(19):1902–1904
Grieve AP (2012) Discussion: Bayesian enrichment strategies for randomized discontinuation
trials. Biometrics 68:219–224
Hallstrom AP, Friedman L (1991) Randomizing responders. Control Clin Trials 12:486–503
Ivanova A, Qaqish B, Schoenfeld A (2010) Optimality, sample size, and power calculations for the
sequential parallel comparison design. Stat Med 30:2793–2803
Johnson NL, Kotz S, Balakrishnan N (1997) Discrete multivariate distributions. Wiley, New York
Kopec J, Abrahamowicz M, Esdaile J (1993) Randomized discontinuation trials: utility and
efficiency. J Clin Epidemiol 46:959–971
Korn EL, Arbuck SG, Pulda JM, Simon R, Kaplan RS, Christian MC (2001) Clinical trial designs
for cytostatic agents: are new approaches needed? J Clin Oncol 19:265–272
Oehlert GW (1992) A note on the delta method. Am Stat 46:27–29
Pablos-Méndez A, Barr RG, Shea S (1998) Run-in periods in randomized trials. J Am Med Assoc
279:222–225
Piantadosi S (1997) Clinical trials: a methodologic perspective. Wiley, New York
Pocock SJ (1983) Clinical trials: a practical approach. Wiley, New York
Ratain MJ, Eisen T, Stadler WM, Flaherty KT, Kaye SB, Rosner GL, Gore M, Desai AA, Patnaik A,
Xiong HQ, Rowin-sky E, Abbruzzese JL, Xia C, Simantov R, Schwartz B, Dwyer PJ (2006)
Phase II placebo-controlled randomized discontinuation trial of sorafenib in patients with
metastatic renal cell carcinoma. J Clin Oncol 24:2505–2512
Rosner GL, Stadler WM, Ratain MJ (2002) Randomized discontinuation design: application to
cytostatic antineoplastic agents. J Clin Oncol 20:4478–4484
Senn SJ (1997) Statistical issues in drug development. Wiley, New York
Sonpavde G, Hutson TE, Galsky MD, Berry WR (2006) Problems with the randomized discontin-
uation design. J Clin Oncol 24:4669–4670
75 Randomized Discontinuation Trials 1453
Stadler WM (2007) The randomized dicontinuation trial: a phase II design to access growth-
inhibitory agents. Mol Cancer Ther 6:1180–1185
Stadler WM, Rosner G, Small E, Hollis D, Rini B, Zaentz SD, Mahoney J (2005) Successful
implementation of the randomized discontinuation trial design: an application to the study of the
putative antiangiogenic agent carboxyaminoimidazole in renal cell carcinoma – CALGB 69901.
J Clin Oncol 23:3726–3732
Temple RJ (1994) Special study designs: early escape, enrichment, study in non-responders.
Commun Stat Theory Methods 23:499–531
Trippa L, Rosner GL, Müller P (2012) Bayesian enrichment strategies for randomized discontin-
uation trials. Biometrics 68:203–225
Uryniak T, Chan ISF, Fedorov V, Jiang Q, Oppenheimer L, Snapinn SM, Teng CH, Zhang J (2011)
Responder analyses – a PhRMA position paper. Stat Biopharm Res 3:476–487
Platform Trial Designs
76
Oleksandr Sverdlov, Ekkehard Glimm, and Peter Mesenbrink
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1456
Background on Platform Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1457
General Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1457
Single-Sponsor Platform Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1460
Multisponsor Platform Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1461
Statistical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1462
Choice of a Control Arm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1463
Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467
Data Monitoring and Interim Decision Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1470
Sample Size and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1472
Data Analysis Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1474
Examples of Platform Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1476
EPAD-PoC Study in Alzheimer’s Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1476
I-SPY COVID-19 Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1477
GBM AGILE Study in Glioblastoma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1478
FOCUS4 Study in Metastatic Colorectal Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1479
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1480
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1480
Abstract
Modern drug development is increasingly complex and requires novel
approaches to the design and analysis of clinical trials. With the precision
medicine paradigm, there is a strong need to evaluate multiple experimental
therapies across a spectrum of indications, in different subgroups of patients,
while controlling the chance of false positive and false negative findings. The
O. Sverdlov (*) · P. Mesenbrink
Novartis Pharmaceuticals Corporation, East Hannover, NJ, USA
e-mail: [email protected]; [email protected]
E. Glimm
Novartis Pharma AG, Basel, Switzerland
e-mail: [email protected]
concept of master protocols provides a new approach to clinical trial design that
can help drug developers to enhance efficiency of clinical trials by addressing
multiple research questions within the same overall trial infrastructure. There are
three general types of trials requiring a master protocol: basket trials, umbrella
trials, and platform trials. The present chapter provides an overview of platform
trial designs. We discuss operating models for implementing platform trials in
practice, as well as some important statistical considerations for design and
analysis of such trials. We also discuss four real-life examples of platform trials:
the EPAD-PoC study in Alzheimer’s disease; the I-SPY COVID-19 study for
rapid screening of re-purposed and novel treatments for COVID-19; the GBM
AGILE study in glioblastoma; and the FOCUS4 study in metastatic colorectal
cancer.
Keywords
Master protocols · Multi-arm randomized controlled trials · Multiple comparisons
Introduction
General Definitions
on platform trials. However, many of the considerations given here also apply to
basket and umbrella trials.
Several recent systematic literature searches have revealed the growing popularity
and use of master protocol trials, and platform trials in particular (Siden et al. 2019;
Park et al. 2019; Meyer et al. 2020). More specifically, the number of identified
platform trials/total number of identified master protocols were 25/99 (Siden et al.
2019), 16/83 (Park et al. 2019), and 12/50 (Meyer et al. 2020). All of these references
highlight a rapid increase in the number of master protocols over the past 5 years,
and this trend is expected to continue.
Figure 1 presents an example of an open platform master protocol.
We consider a randomized, placebo-controlled, open platform trial evaluating
therapeutic effects of various investigational treatments (agents) in a selected indi-
cation. Figure 1a shows the structure of the master protocol. The core part (Sections
1 to 16) describes key design elements that remain the same across all agents in the
study. Section 17 details any information or procedures that are specific to a
particular agent. The platform trial will enroll patients in cohorts/substudies. In our
example, the study starts with Cohort 1, in which eligible patients will be random-
ized to TRT1 or Control (Fig. 1b). Interventions for future cohorts will become
available over time, and subsequent Cohorts 2, 3, and 4 are planned to be added to
the master protocol. In each cohort, there is a control group (not displayed in
Fig. 1a), that is assumed to be the current standard-of-care (SOC) treatment. We
assume that each subsequent cohort may include up to three investigational agents;
for example, in Cohort 2, we provision for TRT2, TRT3, and TRT4. Within each
cohort, eligible patients will be randomized among the available active treatment
arms or control.
At some prespecified time points in the study, interim analyses (IAs) will be
performed (Fig. 1b). At each interim analysis (IA1, IA2, IA3, ...), accrued data will
be analyzed and a predetermined statistical decision rule will be applied. The nature
of decisions will generally depend on the trial design and the study objectives.
Fundamentally, both the analyses and the types of decisions should be prespecified
in the protocol (not made on an ad hoc basis) to maintain the integrity and the
validity of the results.
Our considered example in Fig. 1 is typical for a phase II platform trial. In this
case, the following decisions can be considered for any given investigational
treatment arm:
(i) Advance the arm for further development (outside of the current master proto-
col), if it exhibits sufficient evidence of activity. Or
(ii) Drop the arm from the study, if there is sufficient evidence of lack of activity. Or
(iii) Continue the arm in the study to the next decision point, if the results are
indeterminate and the maximum sample size for this treatment arm has not been
reached.
A more complex scenario is a phase II/III platform trial, which may include IAs
during both phase II and phase III parts of the study. The IA decisions during a
76 Platform Trial Designs 1459
phase II part of a phase II/III trial often would be based on a surrogate outcome
measure, such as some biomarker predictive for clinical efficacy, and these deci-
sions can be described using items (i)–(iii) above. However, additional IAs can be
also considered during a confirmatory (phase III) part of the study. In this case, the
interim decisions will be made based on accrued clinical outcome data, which may
be the primary efficacy endpoint. For any investigational treatment arm, the
decisions may be:
1460 O. Sverdlov et al.
(iv) Declare superior efficacy of the treatment arm over the control and stop the trial
early, if clinical efficacy results are outstanding for this arm. Or
(v) Terminate the treatment arm for futility, if there is sufficient evidence of lack of
efficacy for this arm. Or
(vi) Continue the treatment arm to the next decision point (IA or end of study), if
evidence for efficacy and futility is inconclusive and the maximum sample size
for this arm has not been reached.
The efficacy decision would typically require a formal decision rule such as a
statistical test with type I error control. In contrast, the futility rule - even in phase III
– would typically not require the same level of formality. In addition to efficacy
assessments, the usual safety monitoring rules would commonly be applied, such
that interventions or the entire study may be stopped due to safety findings.
The open-endedness of the platform trial allows adding and removing of interven-
tions as the study is ongoing. If it is decided to introduce a new arm, an additional ISA
would be added to the master protocol (Fig. 1a). The randomization weights must be
updated accordingly, and new eligible subjects will be randomized to a specific ISA
and then to a specific treatment within the ISA (Fig. 1b). (Other approaches for
implementing randomization can be considered; e.g. subjects may be randomized
among all available study arms, as done in a parallel multi-arm trial.) In our example,
the initial randomization in Cohort 1 is 1:1 (TRT1 or Control), and thereafter it can be
modified, with potentially increased allocation ratio to novel experimental agents. The
choice of a randomization algorithm (e.g., if response-adaptive randomization is
utilized) should be discussed with the health authorities and it must be carefully
justified in the master protocol. We shall discuss response-adaptive randomization
(RAR) in more detail in section “Response-Adaptive Randomization”; for now, we
just make an important remark that RAR has both merits and limitations, and it may
potentially be utilized in a phase II platform trial or during the phase II screening part
of a phase II/III platform trial, but not during its confirmatory stage.
Platform trials can provide a valuable framework for the development of novel
therapies within a biopharmaceutical company. While platform trials can be
designed in both early and late clinical development, the concept may be more
appealing in the early stages (e.g., phase II), where the objective is to perform a fast
screening of many investigational agents, many of which are potentially not effica-
cious. Suppose we have an indication with an unmet medical need for treatment and
assume that there are multiple potential candidate compounds for this indication
within the drug development portfolio of Company X. Once the safety of a particular
compound has been established (e.g., with first-in-human data and sufficient toxi-
cology data), it is ready to be further assessed in the clinical proof-of-mechanism and
clinical proof-of-concept (PoC) trials. A traditional 1:1 randomized controlled PoC
trial explores whether the investigational drug is likely to achieve the desired
therapeutic effect and whether it merits testing in a large-scale confirmatory trial in
76 Platform Trial Designs 1461
patients. However, given a large variety of candidate compounds and their combi-
nations, running multiple PoC trials may be infeasible. A phase II platform trial is an
attractive option, if:
There are several options with respect to the structure of a clinical trial team
(CTT) for a platform trial within a single pharmaceutical Sponsor. One approach
would be to build upon some existing clinical teams in the given disease area. This
will ensure that the clinical lead and key team members are the same across
compounds, which helps establish consistency and efficient communication; how-
ever, it also requires considerable commitment and continuous support from the
team members. Another approach is to designate an “independent” platform trial
team that would collaborate with different compound teams within the company,
thus providing integrated efforts to develop the master protocol and ensure that the
design properly accommodates each compound. A third approach is to have an
external group run the trial on behalf of the company and tap into their disease area
knowledge. All of these approaches require substantial upfront planning and invest-
ment, greater than one would expect for a standard clinical development path
(Schiavone et al. 2019; Hague et al. 2019; Morrell et al. 2019).
As an example, consider a Novartis-sponsored phase II open-entry platform trial
evaluating efficacy and safety of novel spartalizumab combinations in previously
treated unresectable or metastatic melanoma (ClinicalTrials.gov Identifier:
NCT03484923). The study design is described in detail by Racine-Poon et al.
(2020). The design consists of two parts: the exploratory Part 1 in which candidate
treatments are evaluated for activity in a randomized manner, and the confirmatory
Part 2 in which the “winner” treatment arms from Part 1 are expanded to achieve the
desired level of predictive power for confirmatory statistical hypothesis testing on the
objective response rate (ORR). The study core team was formed on the basis of the
Novartis clinical oncology group, capitalizing on internal knowledge and relevant
subject matter expertise.
Taking a broad perspective, the overall success rates of new drug development have
been disappointingly low (Scannell et al. 2012; Wong et al. 2019), despite rising
developmental costs (DiMasi et al. 2016). The need for innovation and moderniza-
tion of drug development through collaboration among multiple public and private
1462 O. Sverdlov et al.
entities has become apparent over the years (Woodcock and Woosley 2008).
Crowdsourcing or multisponsor models, where different biopharmaceutical compa-
nies are working in a coordinated manner to develop new medicines for high unmet
medical needs, may provide a very useful framework for modern drug development
(Bentzien et al. 2015). One sensible model is when an academic research unit (or a
network of several academic centers) acts as the coordinator of the platform trial
activities. In this case, the academic research unit may: (i) secure funding for this
research through grants, (ii) develop the master protocol, (iii) build the trial infra-
structure, (iv) attract different pharma/biotech companies to participate and contrib-
ute their investigational compounds for the trial, etc.
Multisponsor platform trials are increasingly common in clinical research. Some
notable examples include the I-SPY 2 trial of novel neoadjuvant therapies in breast
cancer (Barker et al. 2009; Esserman et al. 2019), the I-SPY COVID-19 study of
promising therapeutic agents in critically ill COVID-19 patients (https://clinicaltrials.
gov/ct2/show/NCT04488081), the Systemic Therapy for Advancing or Metastatic
Prostate Cancer (STAMPEDE) study (James et al. 2009), just to name a few.
Multisponsor platform trials require more upfront planning than single-sponsor
trials, because of the need to build the operational platform infrastructure, obtain
alignment across the stakeholders, and get all necessary authorizations from health
authorities. In fact, the FDA guidance for industry “Adaptive designs for clinical trials
of drugs and biologics” (Food and Drug Administration 2019) states this explicitly:
...Because these (adaptive platform) trials may involve investigational agents from more
than one sponsor, may be conducted for an unstated length of time, and often involve
complex adaptations, they should generally involve extensive discussion with FDA...
Statistical Considerations
The design of a platform trial poses scientific, statistical, operational, and regulatory
challenges. In addition, the choice of a study design will depend on the disease area,
the competitive landscape, the established industry practices, and the development
76 Platform Trial Designs 1463
The use of a control group is a fundamental principle of the design of any compar-
ative clinical trial. The main purpose of a control group is to minimize confounding
of the treatment effect with other factors (such as the natural history of the disease),
thereby improving the quality of statistical inference on the treatment effect. The
importance of the choice of a control group in clinical trials is well acknowledged
and is documented in the ICH E10 guideline (International Conference on
Harmonisation E10 2001). In platform trials, many of which are designed to evaluate
the effects of various experimental treatments, considerations on the control group
are particularly important. The FDA guidance on master protocols has the following
statement in this regard (Food and Drug Administration 2018):
...FDA recommends that a sponsor use a common control arm to improve efficiency in
master protocols where multiple drugs are evaluated simultaneously in a single disease (e.g.,
umbrella trials). FDA recommends that the control arm be the current SOC so that the trial
results will be interpretable in the context of U.S. medical practice...
In a recent literature review, Meyer et al. (2020) found that among 50 identified
master protocol trials, the majority (28 out of 50, 56%) had no control group. More
specifically, among the 12 identified platform trials, five trials were designed using
concurrent control, six trials included nonconcurrent control, and one trial had no
control group. The similar numbers for nine identified umbrella trials were four
(concurrent control), one (nonconcurrent control), and four (no control). Let us
discuss different possibilities for the control group in more detail.
Historical Controls
Historical data (e.g., from previous clinical trials in the same indication) provides
valuable information that may potentially supplement evidence from a new RCT
(Pocock 1976). However, one cannot simply rely on historical controls as a basis for
1464 O. Sverdlov et al.
comparison, because there might be differences in the populations, for example, due
to change in medical care over time (Byar 1980).
There are different methods to utilize historical control data both in the design and
the analysis of clinical trials (Viele et al. 2014; Chen et al. 2018). Many phase II trials
in oncology simply use a historical reference value of the objective response rate
(ORR). For instance, a common approach to evaluate the activity of a new com-
pound is through Simon’s two-stage optimal design (Simon 1989) to test the
hypotheses H0: ORR ¼p0 vs. H1: ORR ¼p1, where p0 is the historical reference
value of the ORR, and p1 > p0 is some threshold representing promising activity.
In a platform trial, one is interested in evaluating multiple candidate treatments,
and so the study design may involve randomization to one of the treatment arms, but
the analysis for each arm is standalone (i.e., involves no comparison against control).
This approach was implemented in the platform trial in metastatic melanoma
(NCT03484923; Racine-Poon et al. 2020), where no adequate SOC is currently
available. In that study, the primary analysis for each “winner” arm that has been
promoted from Part 1 to Part 2 involved testing H0: ORR ¼0.10 vs. H1: ORR ¼0.30.
The lower bound of the 95% confidence interval using Clopper-Pearson’s exact
method for ORR was used as a criterion to decide whether a treatment warranted
further investigation in pivotal studies. Alternatively, the analysis could incorporate
relevant historical control data using some Bayesian borrowing technique, such as
hierarchical modeling (Viele et al. 2014). Such analysis would account for uncer-
tainty in the historical ORR, but it would also require careful assessment of assump-
tions necessary for a valid treatment comparison.
Concurrent Controls
In clinical settings where it is not feasible to run a series of standard adequately
powered two-arm RCTs (e.g., in rare diseases), a multiarm randomized platform trial
with a shared control group may be an appealing and efficient approach (Saville and
Berry 2016). For instance, platform trials evaluating multiple treatments from
different sponsors can benefit from borrowing of data from the pooled placebo
group for individual treatment comparisons. Since platform trials evaluate novel
treatments perpetually, some special considerations on the shared control are
required.
For illustrative purpose, consider a hypothetical platform trial with five experi-
mental treatment arms and Control (Fig. 2).
Suppose for each comparison of experimental vs. control, 100 patients per arm
provide sufficient sample size to test treatment difference. The trial starts with
randomizing initial 100 patients equally between TRT1 and Control. After that,
two new arms are added, and additional 200 patients are randomized among TRT1,
TRT2, TRT3, and Control (50 per arm). At that point, TRT1 achieves its target
sample size, and the randomization is shifted to TRT2, TRT3, or Control such that
additional 150 patients are randomized (50 per arm). Thereafter, a new arm TRT4 is
added (ISA2) and the next 100 patients are randomized between TRT4 and Control
(50 per arm). Finally, after TRT5 is added (ISA3), the trial continues with random-
izing additional 150 patients among TRT4, TRT5, or Control, and the last
76 Platform Trial Designs 1465
Fig. 2 A hypothetical platform trial with five experimental arms and a shared control arm
100 patients between TRT5 and Control. Overall, in this hypothetical study, each
treatment arm has 100 patients, and the Control arm has 300 patients.
Assume the primary outcome is available soon after randomization, and the data
analysis for each arm takes place after the target number of subjects have been
randomized and treated. In the analysis, different strategies for utilizing control data
are possible.
1466 O. Sverdlov et al.
1. All accrued data in the Control arm at the time of analysis is utilized, treating all
observations as if they had been concurrently obtained. In our example, the size of
the control group for treatment comparison is 100 for TRT1, 150 for each of the
TRT2 and TRT3, 250 for TRT4, and 300 for TRT5. A larger size for the control
arm would enable more robust inference. A major assumption is that there are no
hidden confounders such as a time trend.
2. Only data from the Control arm that was part of the randomization sequence
concurrent with the given experimental treatment arm is utilized. The argument
here is that it is difficult to justify pooling of control observations that are
separated by some time interval. In our example, first 100 allocations to control
are concurrent with TRT1, allocations 51–150 to control are concurrent with
TRT2 and TRT3, allocations 151–250 to control are concurrent with TRT4, and
allocations 201–300 to control are concurrent with TRT5. Therefore, in this case,
the size of the control group for each treatment comparison is 100.
3. Pooling of data from the Control arm in the study is performed using some
statistical methodology. Several recent papers discuss approaches that may be
relevant in this context (Yuan et al. 2016; Galwey 2017; Hobbs et al. 2018; Jiao
et al. 2019; Tang et al. 2019; Normington et al. 2020). The methods include “test-
then-pool” strategy, dynamic pooling, Bayesian hierarchical modeling, to name a
few. These methods can be applied not only to the shared internal control arm, but
also to some historical control data or some relevant concurrent external data that
may become available as the platform trial is ongoing. They provide a compro-
mise between approaches #1 and #2 in that either historical information is down-
weighted, but not entirely discarded, or is included in the analysis only if
sufficiently similar to the concurrent data (which could also be interpreted as a
form of down-weighting, since it is included in the analysis with a probability less
than 1).
Some additional important notes should be made here. First, it may be difficult to
justify upfront the analytic strategy for handling control data, and several approaches
may have to be designated to ensure robust analysis. This argument applies to both
phase II and phase II/III platform trials. In fact, it is increasingly recognized (even in
phase III trials) that a single, albeit carefully prespecified, primary analysis may be
insufficient and it is prudent to have several sensitivity analyses. For instance, the
estimand framework (ICH E9(R1), 2020; Jin and Liu 2020) suggests to designate a
main estimator to serve as primary analysis and several sensitivity estimators
targeting the same estimand but under different assumptions for missing data
and/or censoring. Second, the investigational treatments may have different mech-
anism of action and/or different routes of administration (e.g., oral vs. injections), in
which case the concurrent placebo group may be different across experimental arms
and this should be accounted for in the analysis. Third, in some platform trials, if a
current active treatment shows evidence of superiority over SOC, then this treatment
may become the SOC and the control group would have to be changed for subse-
quent cohorts. This was the case, for instance, in PREVAIL II trial in Ebola (Dodd
76 Platform Trial Designs 1467
et al. 2016; PREVAIL II Writing Group 2016) and in the currently ongoing I-SPY
COVID-19 trial (NCT04488081).
Randomization
Note that this example provides only one possibility of modifying the control
allocation ratio over time. A major assumption was that the randomization ratios
were prefixed (e.g., 1:1 up to patient 100, 1:1:2:2 for additional 300 patients, etc.)
such that there is no selection bias issue. If, however, these decisions are made
“pragmatically,” whenever a new treatment arm is added or dropped, then it is
important to ensure that the selected new randomization ratios are not dependent
on the observed response data; otherwise the procedure can no longer be regarded as
“fixed,” but it rather becomes response-adaptive, for which special considerations
are required; see section “Response-Adaptive Randomization.”
To implement the chosen equal or unequal allocation ratio, the simplest and most
common approach is the permuted block randomization which sequentially random-
izes cohorts of study participants in the desired ratio until the target sample size is
reached. Other randomization procedures with enhanced statistical properties can be
considered (Kuznetsova and Tymofyeyev, 2011; Kuznetsova and Tymofyeyev 2014;
Ryeznik and Sverdlov 2018).
Response-Adaptive Randomization
Response-adaptive randomization (RAR) can be applied in platform trials to
increase the chance of trial participants to receive an empirically better treatment
while maintaining important statistical properties of the trial design. Here, “empir-
ically better treatment” refers to a treatment that has been more successful in a
nonstochastic sense (e.g., simply has a greater observed proportion of responders) in
view of the data accrued in the trial thus far. RAR has a long history in the
biostatistics literature and it has been used occasionally in clinical trials (Hu and
Rosenberger 2006). In platform trials, RAR can potentially increase trial efficiency
in the sense that efficacious treatment arms can be identified quicker and quite
reliably (Saville and Berry 2016). There are both advantages and disadvantages of
RAR, and its implementation always requires careful considerations (Robertson
et al. 2020). For instance, one motivation for using RAR is to maximize the expected
number of successes in the trial, which may be particularly important in trials of rare
and life-threatening diseases with limited patient horizon (Palmer and Rosenberger
1999). Another possibility for application of RAR is trials of highly contagious
diseases such as Ebola where the hope is that the disease may be eradicated by the
investigational treatment or vaccine (Berger 2015). In all, various stakeholders’
perspectives should be taken into account when assessing the possibility of incor-
porating RAR in the design. A general consensus is that RAR may be useful in phase
II exploratory settings but less so in phase III confirmatory settings. It is also
instructive to quote the following recent perspective on RAR from the FDA (Food
and Drug Administration 2019):
...Response-adaptive randomization alone does not generally increase the Type I error
probability of a trial when used with appropriate statistical analysis techniques. It is
important to ensure that the analysis methods appropriately take the design of the trial into
account. Finally, as with many other adaptive techniques based on outcome data, response-
adaptive randomization works best in trials with relatively short-term ascertainment of
outcomes...
1470 O. Sverdlov et al.
It should be noted that RAR designs rely on certain assumptions on responses (e.g.,
statistical model linking responses with effects of treatments and biomarkers, fast
availability of individual outcome data to facilitate model updates, and modifications
of randomization probabilities) and require calibration through comprehensive simu-
lations before they are implemented in practice. It is also important to acknowledge
that RAR designs may potentially have deteriorating performance if outcome data are
affected by time trends (Thall et al. 2015), and special statistical techniques are
required to obtain robust results in the analysis (Villar et al. 2018). However, different
RAR procedures vary in the statistical properties, and some issues pertinent to
particular RAR procedures, for example, high variability and potential loss in statis-
tical power of the randomized play-the-winner rule (Wei and Durham 1978), should
not be overgeneralized to all RAR procedures (Villar et al. 2020).
Several recent papers provide simulation reports on RAR for multiarm trials with
and without control arm (Wathen and Thall 2017; Viele et al. 2020a, b). One sensible
RAR approach is to skew allocation to the empirically best arm (if it exists) while
maintaining some allocation to the control (Trippa et al. 2012; Wason and Trippa
2014; Yuan et al. 2016). This would provide sufficient power to formally compare
the effects of the most successful experimental treatment against the control. One
extra challenge, however, is that new experimental arms are added over time and
RAR requires some burn-in period to ascertain estimates of treatment effects to
facilitate adaptations. Some efficient RAR designs for multiarm controlled platform
trials where experimental arms can be added/dropped during the course of the study
are available; see papers by Ventz et al. (2018), Hobbs et al. (2018), Kaizer et al.
(2018), Normington et al. (2020), to name a few.
An increasingly useful idea in RAR platform trials is inclusion of stratification using
genetic signatures or some other predictive biomarkers. In this case, RAR probabilities
for an individual participant are adjusted such that the participant has increased
probability to be assigned to the treatment that is putatively most efficacious given
their baseline biomarker profile. The research question is: which compound/biomarker
pairs are most promising to be taken further in development to more focused confir-
matory phase trials? This approach was applied, for instance, in the I-SPY 2 trial in
breast cancer (Barker et al. 2009) and in the BATTLE trial in nonsmall cell lung cancer
(Zhou et al. 2008; Kim et al. 2011). Both I-SPY 2 and BATTLE trials carried
hypothesis-generating value and aimed at identifying targeted therapies, but not for-
mally testing their clinical efficacy. A more elaborate design is GBM AGILE – an
ongoing seamless phase II/III platform trial in glioblastoma, which combines data from
promising treatments identified during a phase II multiarm Bayesian RAR part with the
data for these treatments during a phase III part to formally test clinical efficacy with
respect to overall survival and enable submissions (Alexander et al. 2018).
Platform trials involve data monitoring and various interim decisions. A key princi-
ple of any adaptive design is that adaptations must be carefully preplanned to ensure
76 Platform Trial Designs 1471
statistically valid results. The FDA guidance on master protocols states (Food and
Drug Administration 2018):
...Master protocols evaluating multiple investigational drugs can add, expand, or discontinue
treatment arms based on findings from prespecified interim analyses or external new data.
Before initiating the trial, the sponsor should ensure that the master protocol and its
associated SAP describe conditions that would result in adaptations such as the addition of
a new experimental arm or arms to the trial, reestimation of the sample size based on the
results of an interim analysis, or discontinuation of an experimental arm based on futility
rules.
randomizes patients in a 1:1:1 ratio. The first IA is planned after about 10 patients per
arm contribute data for evaluation of ORR. Subsequent IAs are planned approxi-
mately every five months thereafter. The maximum number of patients per arm in
Part 1 is capped at 30. To facilitate decision making in part 1, the ORR for each
treatment arm is modeled using a standard Bayesian beta-binomial model with
uniform prior. At a given IA, an arm can be: (i) expanded into Part 2, if Pr
(ORR > 0.20| data) > 0.70; (ii) stopped for futility, if Pr(ORR < 0.15| data) > 0.70;
or (iii) continued in Part 1, if neither (i) nor (ii) is met. If an arm has reached its cap of
30 patients and neither (i) nor (ii) is met, the arm is stopped and not pursued further.
If a decision to expand an arm is made, the sample size for Part 2 is determined
adaptively, using Bayesian shrinkage estimation to mitigate treatment selection bias
and to ensure >70% Bayesian predictive power to obtain significant final results.
The final analysis for each treatment arm in Part 2 is done using standard frequentist
methodology (exact binomial test), based on cumulative data from Part 1 and 2 for
this arm.
All decision rules/criteria in OPTIM-ARTS design are calibrated through Monte
Carlo simulation under various true values of ORR, to achieve desirable statistical
characteristics, such as reasonably high correct decision probabilities in part 1, and
high power and control of the type I error rate in Part 2. A combination of Bayesian
monitoring in Part 1 with formal hypothesis testing for selected treatment arms in
Part 2 allows flexible and statistically rigorous design.
Sample size determination is an integral part of any clinical trial design, and the
platform trial is no exception. Some important considerations for the sample size
planning include the study objectives, the choice of a research hypothesis, primary
endpoint, study population, control and experimental treatment groups, statistical
methodology for data analysis, etc. The common statistical criteria for sample size
planning are statistical power and significance level (probability of a type I error);
however, additional criteria such as estimation precision, probabilities of correct go/
no-go decisions may be considered as well.
At the design stage, the sample size planning will likely be an iterative process
that may involve a combination of standard calculations and simulations. Suppose
we have K 1 experimental treatment arms and a control arm, and we decide to use
equal randomization with m patients per arm. There are different ways to character-
ize power in a multi-arm setting (Marschner 2007). One way is to consider null
hypotheses on individual treatment contrasts (experimental vs. control) as follows:
ð jÞ ð jÞ
H 0 : Δ j ¼ 0 vs. H 1 : Δ j > 0, where Δj ¼ μj μ0 and j ¼ 1, . . ., K. Assuming
individual responses on the kth treatment are normally distributed with mean μk and
variance σ 2, the sample size m ¼ 2σ 2(z1 α + z1 β)2/Δ2 per arm (where zu is the
100uth percentile of the standard normal distribution and Δ > 0 is some clinically
relevant value of the mean treatment difference) provides power of (1 β) for each
76 Platform Trial Designs 1473
The theory of adaptive designs (see e.g., Wassmer and Brannath 2016) allows for
many modifications (such as dropping or adding treatment arms, restricting recruit-
ment to subpopulations, changing sample size or randomization ratios, in theory
even changing endpoints) while maintaining the family-wise error rate. However,
these methods were originally not developed for very frequent adaptations; hence,
power loss can be severe when applying them in platform trials with many design
adaptations.
The uncertainty on the final sample size numbers for the chosen platform trial
design should always be quantified; ideally, not only the values of the expected
sample size, but also the entire distribution of the sample size per arm and overall in
the study should be obtained and presented via simulations. The choice of the
experimental scenarios for simulations should be comprehensive but it will never
be exhaustive. There are some good industry practices on simulation of adaptive
trials in drug development (Mayer et al. 2019) that can be useful for sample size
planning for platform trial designs.
The analysis of any clinical trial should be reflective of the trial design. The statistical
analysis plan for a platform trial should include details of the planned analyses, both
interim and final. Since many platform trials have adaptive elements, some important
principles for adaptive designs naturally apply for platform trials. The FDA guidance
for industry “Adaptive designs for clinical trials of drugs and biologics” (Food and
Drug Administration 2019) makes the following statement that applies to all clinical
trials intended to provide substantial evidence of effectiveness:
. . .In general, the design, conduct, and analysis of an adaptive clinical trial intended to
provide substantial evidence of effectiveness should satisfy four key principles: the chance
of erroneous conclusions should be adequately controlled, estimation of treatment effects
should be sufficiently reliable, details of the design should be completely prespecified, and
trial integrity should be appropriately maintained...
The strong control of the type I error rate is a major requirement for any clinical
trial with a confirmatory component. Various interim decisions can inflate the type I
error rate. Thus, special statistical techniques are required to ensure the overall type I
error is maintained at a prespecified level. Some design methodologies, such as
group sequential designs (Jennison and Turnbull 2000) and adaptive designs
(Wassmer and Brannath 2016), specifically address the issue of the type I error
control by properly selecting interim stopping boundaries. Adaptive designs with
treatment (or subgroup) selection at interim, known as seamless phase II/III designs
(Bretz et al. 2009; Wassmer and Brannath 2016), provide ways to properly combine
data from the exploratory and confirmatory parts of the trial in the analysis (i.e.,
inferentially seamless designs) while controlling the type I error. For other designs
and analysis techniques, simulations can be used to evaluate the probability of false
76 Platform Trial Designs 1475
positive findings. For a platform trial with a confirmatory component, the control of
the type I error rate is more complex due to uncertainty on the number of experi-
mental treatment arms that will be tested in the study and possibly multiple regis-
trations that may follow. Industry best practices on type I error considerations in
master protocols with shared control are still emerging (Sridhara et al. 2021).
Another important aspect is estimation of treatment effects. Design adaptations
such as selection of an arm that exhibits the best interim results introduce positive
bias in the final estimation of treatment effect. In order to quantify this bias, bias-
corrected estimates accounting for design adaptations (Bowden and Glimm 2008;
Stallard and Kimani 2018) can be reported. These methods correct for the selection
bias generated by specific types of selections such as “pick-the-winner” or “drop-
the-loser” in multiarm situations. The insistence on unbiasedness inflates the mean
squared error (MSE) which, for several of these methods, is larger than that of the
corresponding maximum likelihood estimator (MLE). However, shrinkage estima-
tion techniques (which reduce, but not entirely eliminate bias) have been proven to
have lower MSE than the MLE (Carreras and Brannath 2013; Bowden et al. 2014).
The magnitude of the bias is situation dependent. It generally depends on the
“severity” of the selection (e.g., the number of treatment arms from which a winner
is picked), the size of the study, and the similarity of the underlying true but
unknown treatment effects. In a well-planned, large study with limited selection
options, it will often be small. However, in studies with a wide range of potential
selection decisions, it can be substantial.
In addition to point estimates, confidence intervals are of interest. Construction of
confidence intervals accounting for multiple interim looks at the data and design
adaptations has been discussed in the literature (Neal et al. 2011; Kimani et al. 2014;
Kimani et al. 2020); however, no fully satisfactory construction method exists and
applications are very diverse. Overall, it may be prudent to report both stage-wise
unadjusted estimates and confidence intervals and adjusted quantities based on
combined data. The assessment of data homogeneity from different stages (both
baseline characteristics and the outcome data) is very important for interpretation of
the study results (Gallo and Chuang-Stein 2009; Friede and Henderson 2009). This
is explicitly documented in the EMA “Reflection paper on methodological issues in
confirmatory clinical trials planned with an adaptive design” (European Medicines
Agency 2007):
...Using an adaptive design implies that the statistical methods control the pre-specified type
I error, that correct estimates and confidence intervals for the treatment effect are available,
and that methods for the assessment of homogeneity of results from different stages are
pre-planned. A thorough discussion will be required to ensure that results from different
stages can be justifiably combined...
Some special considerations are required for reporting of the results of a platform
trial. For instance, how should the results of completed treatment arms be reported
while the main master protocol is still ongoing? In this regard, a good example is the
STAMPEDE (Systemic Therapy for Advancing or Metastatic Prostate Cancer) study
1476 O. Sverdlov et al.
(James et al. 2009), which has been ongoing since 2005 while providing periodic
updates on the investigational treatments (comparisons) that have been completed in
due course.
By the end of 2019, EPAD-LCS recruited and deeply phenotyped more than 2000
participants; however, the PoC study (EPAD-PoC) to test new interventions did not
take place due to lack of drug sponsors to run trials. The EPAD initiative finished in
late 2020. Overall, this case study represents the complexity of clinical research in
challenging indications such as Alzheimer’s disease and reinforces the importance of
lessons learned in this context.
The COVID-19 pandemic has been a major public emergency since February 2020.
Global efforts are taken worldwide to develop effective vaccines and treatments
against COVID-19 infections. Clinical development for COVID-19 treatments poses
several challenges:
Several platform trials for rapid testing of various re-purposed and novel treat-
ments for COVID-19 were initiated in 2020 and are now ongoing. Here we discuss
just one of them, the I-SPY COVID-19 trial (Identifier: NCT04488081). This is an
open-label, randomized, multiarm, active-controlled, Bayesian adaptive phase II
platform trial to rapidly screen promising agents for treatment of critically ill
COVID-19 patients. Eligible patients are stratified based on their status at entry
(ventilation vs. high-flow oxygen) before randomization. The primary endpoint is
time to recover to a durable (at least 48 h) level of 4 or less on the
WHO-recommended COVID-19 ordinal scale (WHO 2020) (time frame: up to
28 days).
The trial design (NCT04488081) describes four experimental arms (combinations
of novel agents with remdesivir) and an active comparator (remdesivir plus SOC),
and there is a provision to add more experimental agents to the study, depending on
the recruitment and the time course of COVID-19 in the USA. The sample size per
experimental arm is capped at 125 patients. The arms can be dropped early for
futility after enrollment of 50 patients. The arms exhibiting strong efficacy signals
can qualify for further development, in which case the enrollment to these arms will
cease, and new investigational arms can be added.
The I-SPY COVID-19 trial is a massive collaborative effort that involves several
university medical centers in the USA, pharma/biotech industry, and the FDA, with
the estimated enrollment of up to 1500 participants and estimated primary comple-
tion date of July 2022.
1478 O. Sverdlov et al.
Alexander et al. (2018) described the design of the Glioblastoma (GBM) Adaptive
Global Innovative Learning Environment (AGILE) – an international, multiarm,
randomized, open platform, inferentially seamless study to identify effective thera-
pies for newly diagnosed and recurrent GBM within different biomarker-defined
patient subtypes (ClinicalTrials.gov Identifier: NCT03970447).
The trial employs a master protocol that allows multiple novel experimental
therapies and their combinations to be evaluated within the same trial infrastructure.
The design consists of two parts:
The GBM AGILE design has several innovative features that are worthy
elaborating upon:
• Study participants are stratified into three subtypes of GBM: newly diagnosed
methylated (NDM), newly diagnosed unmethylated (NDU), or recurrent disease
(RD). Each experimental arm can have one enrichment biomarker, thought to be
predictive to the outcome for the given arm. A combination of stratification and
enrichment biomarkers creates up to six different subtypes for patient randomi-
zation. Within each subtype, different SOC control arms and different experi-
mental drug combinations are considered.
• Bayesian adaptive randomization is applied such that within each stratum, 20% of
participants are randomized to the control, and allocation to experimental arms is
skewed such that greater proportions are assigned to arms with evidence of pro-
longed overall survival compared to the control group given the patient’s subtype.
• A longitudinal model linking the effects of treatments, covariates, biomarkers,
and overall survival is developed to predict the individual survival time. The
model can potentially be used to “speed up” Bayesian response-adaptive ran-
domization algorithm that otherwise relies on survival times that are observed
with natural delay.
• During phase II screening part, treatment efficacy is assessed within predefined
biomarker “signatures” (that may be different from the stratification subtypes),
such that an experimental arm exhibiting very promising results for a particular
signature will be expanded into phase III confirmatory part within this signature.
In summary, GBM AGILE study provides an open platform for clinical investi-
gation with both exploratory and confirmatory components. It can potentially enable
76 Platform Trial Designs 1479
faster, more efficient, and more ethically appealing development of therapies for
glioblastoma.
FOCUS4 website (www.focus4trial.org) that the recruitment into the study was
suspended in March 2020 due to COVID-19 pandemic and the study closed
follow-up of all patients on October 31st 2020.
References
Adams R, Brown E, Brown L, Butler R, Falk S, Fisher D, Kaplan R, Quirke P, Richman S,
Samuel L, Seligmann J, Seymour M, Shiu KK, Wasan H, Wilson R, Maughan T, FOCUS4 Trial
Investigators (2018) Inhibition of EGFR, HER2, and HER3 signalling in patients with colorectal
cancer wild-type for BRAF, PIK3CA, KRAS, and NRAS (FOCUS4-D): a phase 2-3 randomised
trial. Lancet Gastroenterol Hepatol 3(3):162–171
76 Platform Trial Designs 1481
Alexander BM, Ba S, Berger MS, Berry DA, Cavenee WK, Chang SM, Cloughesy TF, Jiang T,
Khasraw M, Li W, Mittman R, Poste GH, Wen PY, Yung WKA, Barker AD, GBM AGILE
Network (2018) Adaptive global innovative learning environment for glioblastoma: GBM
AGILE. Clin Cancer Res 24(4):737–743
Antonijevic Z, Beckman RA (2019) Platform trials in drug development: umbrella trials and basket
trials. CRC Press, Boca Raton
Barker AD, Sigman CC, Kelloff GJ, Hylton NM, Berry DA, Esserman LJ (2009) I-SPY 2: an
adaptive breast cancer trial design in the setting of neoadjuvant chemotherapy. Clin Pharmacol
Ther 86(1):97–100
Bentzien J, Bharadwaj R, Thompson DC (2015) Crowdsourcing in pharma: a strategic framework.
Drug Discov Today 20(7):874–883
Berger VW (2015) Letter to the editor: a note on response-adaptive randomization. Contemp Clin
Trials 40:240
Berry SM (2020) Potential statistical issues between designers and regulators in confirmatory
basket, umbrella, and platform trials. Clin Pharmacol Ther 108(3):444–446
Berry SM, Connor JT, Lewis RJ (2015) The platform trial: an efficient strategy for evaluating
multiple treatments. JAMA 313(16):1619–1620
Bowden J, Brannath W, Glimm E (2014) Empirical Bayes estimation of the selected treatment mean
for two-stage drop-the-loser trials: a meta-analytic approach. Stat Med 33:388–400
Bowden J, Glimm E (2008) Unbiased estimation of selected treatment means in two-stage trials.
Biom J 50(4):515–527
Bretz F, Koenig F (2020) Commentary on Parker and Weir. Clin Trials 17(5):567–569
Bretz F, Koenig F, Brannath W, Glimm E, Posch M (2009) Adaptive designs for confirmatory
clinical trials. Stat Med 28:1181–1217
Byar DP (1980) Why data bases should not replace randomized clinical trials. Biometrics 36:
337–342
Carreras M, Brannath W (2013) Shrinkage estimation in two-stage adaptive designs with midtrial
treatment selection. Stat Med 32:1677–1690
Chen N, Carlin BP, Hobbs BP (2018) Web-based statistical tools for the analysis and design of
clinical trials that incorporate historical controls. Comput Stat Data Anal 127:50–68
Choodari-Oskooei B, Bratton DJ, Gannon MR, Meade AM, Sydes MR, Parmar MK (2020) Adding
new experimental arms to ransomised clinical trials: impact on error rates. Clin Trials 17(3):
273–284
Cohen DR, Todd S, Gregory WM, Brown JM (2015) Adding a treatment arm to an ongoing clinical
trial: a review of methodology and practice. Trials 16:179
Collignon O, Gartner C, Haidich AB, Hemmings RJ, Hofner B, Pétavy F, Posch M, Rantell K,
Roes K, Schiel A (2020) Current statistical considerations and regulatory perspectives on the
planning of confirmatory basket, umbrella, and platform trials. Clin Pharmacol Ther 107(5):
1059–1067
DiMasi JA, Grabowski HG, Hansen RW (2016) Innovation in the pharmaceutical industry: new
estimates of R&D costs. J Health Econ 47:20–33
Dodd LE, Freidlin B, Korn EL (2021) Platform trials – beware the noncomparable control group. N
Engl J Med 384(16):1572–1573
Dodd LE, Proschan MA, Neuhaus J, Koopmeiners JS, Neaton J, Beigel JD, Barrett K, Lane
HC, Davey RT (2016) Design of a randomized controlled trial for ebola virus disease
medical countermeasures: PREVAIL II, the Ebola MCM study. J Infect Dis 213(12):
1906–1913
Dunnett CW (1955) A multiple comparison procedure for comparing several treatments with a
control. J Am Stat Assoc 50:1096–1121
Elm JJ, Palesch YY, Koch GG, Hinson V, Ravina B, Zhao W (2012) Flexible analytical methods for
adding a treatment arm mid-study to an ongoing clinical trial. J Biopharm Stat 22:758–772
European Medicines Agency. Reflection paper on methodological issues in confirmatory clinical
trials with an adaptive design. London, 18 October 2007. Available from https://www.ema.
europa.eu/en/documents/scientific-guideline/reflection-papermethodological-issues-confirma
tory-clinical-trials-planned-adaptive-design_en.pdf
1482 O. Sverdlov et al.
Kaplan R, Maughan T, Crook A, Fisher D, Wilson R, Brown L, Parmar M (2013) Evaluating many
treatments and biomarkers in oncology: a new design. J Clin Oncol 31(36):4562–4568
Kim ES, Herbst RS, Wistuba II et al (2011) The BATTLE trial: personalizing therapy for lung
cancer. Cancer Discov 1:44–53
Kimani PK, Todd S, Renfro LA, Glimm E, Khan JN, Kairalla JA, Stallard N (2020) Point and
interval estimation in two-stage adaptive designs with time to event data and biomarker-driven
subpopulation selection. Stat Med 39(19):2568–2586
Kimani PK, Todd S, Stallard N (2014) A comparison of methods for constructing confidence
intervals after phase II/III clinical trials. Biom J 56(1):107–128
Kopp-Schneider A, Calderazzo S, Wiesenfarth M (2020) Power gains by using external information
in clinical trials are typically not possible when requiring strict type I error control. Biom J 62(2):
361–374
Kuznetsova OM, Tymofyeyev Y (2011) Brick tunnel randomization for unequal allocation to two or
more treatment groups. Stat Med 30(8):812–824
Kuznetsova OM, Tymofyeyev Y (2014) Wide brick tunnel randomization – an unequal allocation
procedure that limits the imbalance in treatment totals. Stat Med 33(9):1514–1530
Lee KM, Wason J, Stallard N (2019) To add or not to add a new treatment arm to a multi-arm study:
a decision-theoretic framework. Stat Med 38:3305–3321
Marschner IC (2007) Optimal design of clinical trials comparing several treatments with a control.
Pharm Stat 6:23–33
Mayer C, Perevozskaya I, Leonov S, Dragalin V, Pritchett Y, Bedding A, Hartford A, Fardipour P,
Cicconetti G (2019) Simulation practices for adaptive trial designs in drug and device develop-
ment. Stat Biopharm Res 11(4):325–335
Meyer EL, Mesenbrink P, Dunger-Baldauf C, Fülle HJ, Glimm E, Li Y, Posch M, König F (2020)
The evolution of master protocol clinical trial designs: a systematic literature review. Clin Ther
42(7):1330–1360
Meyer EL, Mesenbrink P, Mielke T, Parke T, Evans D, König F on behalf of EU-PEARL
(EU Patient-cEntric clinicAl tRial pLatforms) Consortium (2021) Systematic review of avail-
able software for multi-arm multi-stage and platform clinical trial design. Trials 22:183
Morrell L, Hordern J, Brown L, Sydes MR, Amos CL, Kaplan RS, Parmar MK, Maughan TS
(2019) Mind the gap? The platform trial as a working environment. Trials 20(1):297
Neal D, Casella G, Yang MCK, Wu SS (2011) Interval estimation in two-stage, drop-the-losers
clinical trials with flexible treatment selection. Stat Med 30:2804–2814
Normington J, Zhu J, Mattiello F, Sarkar S, Carlin B (2020) An efficient Bayesian platform trial
design for borrowing adaptively from historical control data in lymphoma. Contemp Clin Trials
89:105890
Palmer CR, Rosenberger WF (1999) Ethics and practice: alternative designs for phase III random-
ized clinical trials. Control Clin Trials 20:172–186
Park JJH, Harari O, Dron L, Lester RT, Thorlund K, Mills EJ (2020) An overview of platform trials
with a checklist for clinical readers. J Clin Epidemiol 125:1–8
Park JJH, Siden E, Zoratti MJ, Dron L, Harari O, Singer J, Lester RT, Thorlund K, Mills EJ (2019)
Systematic review of basket trials, umbrella trials, and platform trials: a landscape analysis of
master protocols. Trials 20:572
Parker RA, Weir CJ (2020) Non-adjustment for multiple testing in multi-arm trials of distinct
treatments: rationale and justification. Clin Trials 17(5):562–566
Pocock SJ (1976) The combination of randomized and historical controls in clinical trials. J Chronic
Dis 29:175–188
PREVAIL II Writing Group (2016) A randomized, controlled trial of Zmapp for ebola virus
infection. N Engl J Med 375:1448–1456
Proschan MA, Follmann DA (1995) Multiple comparisons with control in a single experiment
versus separate experiments: why do we feel differently? Am Stat 49(2):144–149
Quan H, Zhang B, Lan Y, Luo X, Chen X (2019) Bayesian hypothesis testing with frequentist
characteristics in clinical trials. Contemp Clin Trials 87:105858
1484 O. Sverdlov et al.
Contents
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1488
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1488
Basic Characteristics of CRTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1489
Variability Across Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1490
Parameters to Be Estimated: Analysis Populations and Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1491
Analysis Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1491
Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1492
Cluster Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1493
Matching and Stratification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1494
Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1494
Randomization Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1495
Highly Constrained Randomization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1496
Alternative Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497
Sample Size and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497
Minimum Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1497
Sample Size Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1498
Statistical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1500
Individual-Level Regression Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1500
Cluster-Level Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1500
Effects of Correlation Structure on Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1501
Reporting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1501
Ethics and Data Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1502
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1502
L. H. Moulton (*)
Departments of International Health and Biostatistics, Johns Hopkins Bloomberg School of Public
Health, Baltimore, MD, USA
e-mail: [email protected]; [email protected]
R. J. Hayes
Faculty of Epidemiology and Population Health, London School of Hygiene and Tropical
Medicine, London, UK
e-mail: [email protected]
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1503
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1503
Abstract
In a randomized clinical or field trial, when randomization units are comprised of
groups of individuals, many aspects of design and analysis differ greatly from
those of an individually randomized trial. In this chapter, we highlight those
features which differ the most, explaining the nature of the differences and
delineating approaches to accommodate them. The focus is on design, as many
readers will be familiar with the correlated data analysis techniques that are
appropriate for many (although not all) cluster randomized trials (CRTs). Thus,
the chapter begins by covering motivations for using a CRT design, basic
correlation parameters, the variety of potential estimands, delineation and ran-
domization of clusters, and sample size calculation. This is followed by sections
on the analysis and reporting of results, which highlight ways to handle the
multilevel nature of the data. Finally, ethical and monitoring considerations
unique to CRTs are discussed.
Keywords
Cluster randomized trial · Group allocation · Correlated data
Definition
Introduction
The vast majority of RCTs employ a randomization scheme wherein trial partici-
pants are individually randomized. A typical arrangement in a therapeutic trial is to
identify eligible patients as they arrive, one by one, at a clinic or hospital, and assign
them to a study arm according to a fixed randomization list. The therapy to be tested,
or a control version, is then administered to each individual accordingly, and later the
individual’s response is recorded. In a cluster randomized trial, however, entire
groups of potential participants are defined or identified, with those in a given
group assigned the same experimental condition, which they may even experience
simultaneously. Group membership may be determined by geography, e.g., place of
residence or catchment area of a hospital, or by location where services are received:
all the children in a given classroom, say, or all the patients in a hospital ward. It can
77 Cluster Randomized Trials 1489
also be determined by timing, with all the patients presenting on randomly selected
days receiving the experimental treatment and the patients on other days receiving
the standard-of-care treatment, with each day’s patients constituting a group.
Perhaps the first trial that was designed and appropriately analyzed as a CRT was
one of isoniazid administration to prevent tuberculosis, where randomization was
performed, for administrative ease, by groupings of wards in mental institutions
(Ferebee et al. 1963, as reported by Donner and Klar 2000). Still, it was not until
statistical and computing advances made in the 1980s, and uptake of these methods
in the 1990s, that CRTs became a common tool in the trialist’s design repertoire.
There are now a number of English-language books devoted to the subject (Murray
1998; Donner and Klar 2000; Hayes and Moulton 2017; Eldridge and Kerry 2012;
Campbell and Walters 2014), and hundreds of related methodological articles have
been published in the biostatistics and epidemiology literature.
The reasons for carrying out randomization at the group or cluster level are usually a
combination of (with examples):
The principal drawback to CRTs is that, in general, they require greater sample
size in terms of numbers of participants than do individually randomized trials.
There is almost always positive within-cluster correlation, which can reduce the
effective sample size to a large degree, as will be seen in the sample size section.
More participants translate into larger costs for trials that perform maneuvers at the
1490 L. H. Moulton and R. J. Hayes
individual level, and there can be cluster-level costs as well, due to increased
transportation and communications with community leaders or clinic directors.
A related feature of CRTs is that they often are comprised of relatively small
numbers of clusters, say 8–50, although they may have thousands of participants.
This is primarily due to the logistics and costs associated with adding each additional
cluster. Small numbers of clusters can engender inferential difficulties, both in terms
of small sample properties of statistical estimators and greater risks associated with
clusters that in one way or another become outliers.
σ2B σ2
ρ¼ 2
¼ 2 B 2,
σ σB þ σw
σ2B
ρ¼ :
πð 1 πÞ
A third measure, the design effect (DEff for short), is not a measure of cluster
variability per se but rather a descriptive measure of the effect of within-cluster
correlation in the context of a given study design. It can be designated as:
The DEff depends not only on the correlation but also on cluster size, so that it is
more specific to the actual design being considered, but, hence, less generalizable or
applicable to other studies.
Analysis Populations
When designing a study, it is important to think carefully about what one wishes to
estimate and among whom. In individually randomized trials, several analytic
populations may be specified, including intent-to-treat, as-treated, and per-protocol
populations (see ▶ Chap. 82, “Intention to Treat and Alternative Approaches”). The
same is true for CRTs, but there is an extra layer of complication due to randomizing
clusters of people.
In a typical individually randomized trial, participants are enrolled (consented and
given an identification code) and then assigned a randomized study treatment – a
strict intent-to-treat approach would then analyze all data collected from that point
on as if the participant then actually received the assigned treatment. In a CRT,
however, there can be two levels of treatment: the treatment condition the cluster
receives and the treatment received by participants. If clinics are randomized to have
or not have certain educational materials about stroke in their waiting rooms, the
intent-to-treat time at the clinic level may begin months before an individual patient
shows up at the clinic – the individual’s intent-to-treat time-at-risk for having a
1492 L. H. Moulton and R. J. Hayes
stroke would begin when they enter the waiting room. That would mimic the long-
term effectiveness of the intervention were it ever to be adopted, as the individual
would be exposed to the materials from the first time they go to the clinic.
More complications may arise depending on who is considered as having been
randomized. If a cluster is defined by geographic residence, at the time of random-
ization, a strict approach might only follow up individuals who were resident at that
moment. However, that results in a closed cohort that ages over time and might not
be of as much interest as a dynamic cohort with people entering and leaving clusters.
On the other hand, people may move into a cluster because they know it is receiving
a treatment they want to receive. The particular nature of the intervention (e.g.,
applied at the cluster or at the individual level), whether it is masked, its general
availability, and potential biases all need to be considered in order to determine
exactly what parameters the study should try to estimate.
Effects
There are further questions regarding what events should be counted for which
analyses. Halloran and Struchiner (1991) introduced a nomenclature that helps
clarify how different parameters of interest may be estimated, as a function of
which individuals’ events are counted and compared, taking into consideration the
possibility of indirect or herd effects occurring within clusters. Figure 1 indicates
four possible effects and their estimation: direct, indirect, total, and overall effects.
Halloran et al. (1997) may be consulted for further details.
In Fig. 1, comparing the attack rates among those enrolled in each trial arm
provides an estimate of the total effect of the intervention. The total effect is a
combination of the direct effect (the protection afforded to an individual enrolled in
an intervention cluster) and the indirect effect (the protection due to decreased
exposure, as a result of lower secondary transmission) of an intervention. If out-
comes can be measured among individuals who are not specifically enrolled in a
trial, say if there are population disease registries or health-care system data avail-
able, then indirect and overall effects can be measured. In such a situation, the
indirect effect can be directly estimated by comparing attack rates among those who
have not been enrolled in the trial (the rates of the purple cases in Fig. 1). Note that
people who enroll in a study tend to differ from those who do not; thus, it is best not
to compare the non-enrolled in the intervention clusters to everyone in the control
clusters. The overall effect is a combination not only of direct and indirect effects but
also the degree of coverage that has been attained, i.e., the uptake of the study
intervention. It compares attack rates among all those in the clusters who would have
been eligible for the trial, regardless of whether they were enrolled. Finally, the direct
effect can be estimated by comparing those enrolled to those not enrolled within
intervention clusters but may be too biased to be of interest, as there often will be
differences in the kinds of people who enter into a trial and those who do not.
Cluster Specification
It may be clear how to define the clusters in a CRT: if the intervention is the
introduction of a new triage system in an emergency room, the unit of randomization
would be the hospital in which the ER is located, with patients (or patients with a
particular condition) arriving at its ER forming the cluster. With geographically
defined clusters, however, there may be many different options, including postal
codes, census tracts, towns, counties, states, or districts. In a given entire study area,
in general, the more clusters there are, the greater will be the study’s power, given
diminishing returns with respect to cluster size (more on this below in the sample
size section). However, the fewer the clusters, the less the potential for contamina-
tion occurring from control participants adopting or accessing intervention practices
or from control participants introducing pathogens into intervention communities,
both of which would tend to reduce observed effectiveness of an intervention. Also,
with fewer clusters, there can be reduced costs due to logistics, transportation, and
dealing with cluster-level communications and assent from gatekeepers. As an
example, in a pneumococcal conjugate vaccine study in infants on an American
Indian reservation, the randomization could have been performed at the level of the
Indian Health Service administrative units, of which there were eight. Although
mixing of infants in intervention areas with those in the control areas would have
been minimized, there would not have been much power. There were 110 smaller,
tribal organization units, but consultation with local staff indicated there might be
substantial contamination across them. In the end, the 110 areas were grouped into
38 randomization units according to a number of factors including location of
shopping areas and Head Start preschool program centers (Moulton et al. 2001).
If cross-cluster contamination is a risk, another strategy is to designate buffer
zones around the clusters from which outcome data are not collected, although
1494 L. H. Moulton and R. J. Hayes
intervention or control activities might still be carried out in these zones. Note that
this may increase cost of the trial through requiring setting up the trial in larger
geographic areas or more clusters.
Randomization
Randomization Criteria
Alternative Designs
Although the focus in this chapter is on two-arm, parallel design trials, there are
many possible variations in CRT design. Factorial trials are fairly common; the large
cost of a CRT can be more easily justified if two interventions can be evaluated for
nearly the price of one, as can be done with a 2 2 factorial design (Montgomery
et al. 2003). It is rare, however, to see more than two levels of two factors used, due
to the required larger number of clusters and high per-cluster costs.
An increasingly popular design is the stepped wedge design, which is a one-way
crossover trial with staggered implementation, so that clusters are scheduled to go
from control phase to intervention phase at randomly assigned times (steps), until all
clusters have received the intervention. Choice of this design can be motivated by
political considerations, when it is desirable to show a steady march toward all clusters
receiving an intervention. Interpretation of results, however, can be fraught with
difficulties, primarily due to secular trends and variable lengths of implementation
(Kotz et al. 2012). Hussey and Hughes (2007) provide a useful framework for
modeling these trials, but such models have strong assumptions regarding correlation
structures and commonality of effects across clusters that need to be carefully consid-
ered (Thompson et al. 2017). It must be noted that subjecting a population to a trial that
leads to inconclusive results can be ethically problematic. This design is perhaps best
used in situations where an intervention is going to be rolled out in a population
anyway, so that randomizing the rollout affords an opportunity to obtain a better
evaluation of the program than might otherwise be possible. A useful set of articles on
this design, and its many variants, may be found in Trials (Torgerson 2015).
In the latter, the results of every pair must be the same (intervention clusters all do
better than their matched control clusters, or vice versa; in which case
p = 2 26 = 0.031). Clearly, if in one of these designs just one cluster has
difficulties, perhaps withdrawing from the study, or undergoes substantial contam-
ination due to local events, the whole experiment is jeopardized.
The basic concepts involved in sample size determination for CRTs are the same as
for individually randomized trials, except that for CRTs there are two sizes that need
to be designated for a study: the number of randomization units (clusters) in each
trial arm (assuming for simplicity equal numbers of clusters per arm), N, and the
number of individuals in each cluster, n, or nj, j = 1,. . .,N if the sizes vary by cluster.
Sometimes, the total number of clusters, 2 N, is fixed, say the number of
convenient geopolitical units in an area, but it is possible to modify n – we may
have 40 districts to randomize and plan to measure the outcome among a random
sample of 200 individuals per district. In other circumstances, all individuals in a
cluster can be measured easily, say from electronic records in a clinic, but one needs
to decide on how many clinics to enter into the trial. It may be that both N and n are
fixed, but the follow-up time can be increased to obtain more person-years at risk and
hence more events. Typically, it costs less to increase cluster size than to increase the
number of clusters, as there can be extra per-cluster costs associated with commu-
nication, transportation, and gatekeeper signoff. Yet, as will be seen, there are
diminishing returns in terms of power with respect to increasing the size of clusters.
In individually randomized trials, a measure of response variability needs to be
specified: the individual level variance. In CRTs, this is also required, as well as
specification of a measure of between-cluster variability: either the coefficient of
variation k or the intracluster correlation coefficient ρ may be used.
The design effect in terms of ρ may be written as 1 + (n 1)ρ (Kish 1965), an
increasing function of both cluster size and the ICC. Then the total number of
individuals in a trial arm can be found by multiplying the usual sample size formula
by this design effect:
2 σ 20 þ σ 21
Nn ¼ zα=2 þ zβ ½1 þ ðn 1Þρ
ðμ0 μ1 Þ2
Using the coefficient of variation k to express variability across clusters yields the
similar formula:
2 σ 20 þ σ 21 =n þ k2 μ20 þ μ21
N ¼ 1 þ zα=2 þ zβ :
ðμ0 μ1 Þ2
These formulas can be modified to allow for unequal allocation of clusters in the
trial arms and to allow for differential levels of within-cluster correlation across trial
arms (Hayes and Bennett 1999). Other modifications include the special case of the
matched pairs design and accounting for varying cluster sizes: the greater the
variance, the greater the loss of power.
The formula based on k cannot be used directly when dealing with responses that
may be negative, e.g., with anthropometric Z-scores; the formula based on ρ is not
appropriate for Poisson or rate (based on person-years of observation) response vari-
ables. An advantage of k is that it is easily interpretable, and reasonable values for it
can be posited even in the absence of good background data. For example, it may be
known that in a given district the incidence rate of rotavirus diarrhea in infants is 5 per
10 infant-years and that it is highly unlikely for this rate to vary by more than twofold
(from the lowest to the highest) across subdistrict health centers. If the true rates are
approximately normally distributed, it might be reasonable to assume about 95% of the
rates would be between 3.33 and 6.67 per 10 infant-years, so the standard deviation
would be about (6.67 5)/2 = 0.835, giving k = 0.835/5 = 0.17.
As already mentioned, when it comes to study power, there are greatly
diminishing returns to increasing cluster sizes for a specified number of clusters.
The following table gives an illustration of this for a continuous response variable,
diastolic blood pressure, which in a study population has a mean of 100 mmHg, with
a standard deviation of 10 mmHg. For a study design with 30 clusters in each arm,
power to detect a lowering to 95 mmHg is displayed as a function of cluster size for
several levels of k, with 5% Type I error.
Power to detect a 5 mmHg decrease given 30 clusters in each of the intervention and control arms,
population SD = 10 mmHg, and 5% Type I error
Statistical Analysis
Since the 1980s, flexible regression methods for the analysis of correlated data have
been readily accessible to researchers. Maximum likelihood estimation of random
effects models (Laird and Ware 1982), and generalized estimating equations (GEE)
with robust variance estimation (Liang and Zeger 1986), have been the main
approaches for handling longitudinal or multilevel data. These methods are also
appropriate for the analysis of many CRTs, provided there are sufficient numbers of
clusters, say 10–15 in each trial arm. For designs with fewer clusters, the cluster-
level methods discussed in the next section may be preferable.
In general, random effects models, which have a “subject-specific” or conditional
(on cluster) interpretation, are best used when there are sufficient numbers of
observations and/or events in each cluster. GEE models, on the other hand, were
designed for situations with small numbers of observations per cluster. GEE yields
“population-averaged” (Neuhaus et al. 1991) estimates that are usually close to or
identical to the marginal values obtained when ignoring clusters but corrects for
over- or under-dispersion via empirical variance estimation. Several adjustments for
GEE models have been devised to correct the Type I error, which can be inflated
when there are relatively few clusters (Scott et al. 2017).
Because these standard approaches for analyzing correlated data have been well
elaborated in the literature, we now focus on cluster-level methods.
Cluster-Level Methods
assumption is met; a Wilcoxon rank sum test can be used as a check and to
downweigh undue influence of “outlying” clusters.
If there is a sufficient number of clusters, cluster summaries can be used as
responses in regression models that adjust for cluster-level covariates (e.g., whether
there is a tertiary care facility in a geographic cluster, or the median income of cluster
residents). More often, adjustment for individual-level covariates will be required.
Even with very few clusters, the following two-stage method can be employed
(Bennett et al. 2002; Hayes and Moulton 2017): (1) In the first stage, ignoring
clusters, a regression of the outcome variable on all the adjusting variables is fit,
but with the treatment arm indicator(s) omitted; (2) in the second stage, residuals
from the fit in the first stage are calculated for each cluster – then the standard
analysis, say a t-test, is conducted on the residuals. This approach is especially useful
in matched pairs designs, where individual-level regression modeling is problematic.
Reporting Results
Standardized reporting of trials has improved greatly with the publication of the
CONSORT Statement (see chapter “Reporting guidelines”) (Begg et al. 1996). A
specialized version has been produced for CRTs, the most recent of which is by
Campbell et al. (2012). Many checklist items are the same while calling for addi-
tional details on rationale for the cluster design, cluster definition, levels of masking,
how clustering is accounted for in the analyses, and estimates of the degree of
observed clustering. Not mentioned in this CONSORT, but desirable when there
are not too many clusters (say less than 40), is the display in a table or figure of
cluster-by-cluster outcome data, so that readers can identify the degree to which
clusters varied in size or in magnitude of response.
1502 L. H. Moulton and R. J. Hayes
Of the three precepts given in the Belmont Report (The National Commission for
the Protection of Human Subjects of Biomedical and Behavioral Research 1979)
– respect for persons, beneficence, and justice – it is the respect, or informed
consent process, that can be the most problematic for CRTs. It may not be
possible to obtain informed consent on the part of individuals when the interven-
tion is applied at the community level, say a public health information campaign.
In such situations, community leaders, political figures, clinic directors, or other
gatekeepers will need to be approached. This is often the case even when the
intervention is delivered directly to individuals, so that consent at both the
community and individual level is required. The justice principle also calls for
special consideration, as inequitable distribution (or perception thereof) of risks
and benefits may occur, perhaps with certain communities becoming stigmatized
for their role in the trial. Individuals who do not partake in the trial may suffer
adverse effects related to the treatment of the community in which they live – for
example, a mass administration of an antibiotic to children may result in resis-
tance to some organisms, making it difficult to cure adults who succumb to them.
The Ottawa Statement (Weijer et al. 2012) addresses further ethical details
specific to CRTs.
CRTs will typically have a Data Monitoring Committee (DMC), and perhaps also
a Steering Committee, Safety Monitor, or Data Monitor, each of which has some role
in overseeing or ensuring the ethical conduct of the study. For CRTs, the DMC has to
take a broader view than is done in individually randomized trials, considering the
impact on communities or clusters as a whole, which can include those not directly
involved in the trial. Monitoring for early evidence of effectiveness and early
stopping does not occur as frequently in CRTs as in individually randomized trials,
as long-term effects are often of interest, for example, reduction in secondary or
tertiary attack rates. When potential early stopping is of interest, the DMC needs to
be aware that information is accrued, relatively speaking, more rapidly in a CRT, due
to the diminishing returns within clusters of obtaining further information on
individuals in a cluster, as described above in the section on sample size (Hayes
and Moulton 2017).
Discussion
While this chapter explained some introductory statistical notions relevant to cluster
randomized trials, it also concentrated on aspects of how they differ from individ-
ually randomized trials that will prove useful even to seasoned statisticians and
trialists who have not worked with CRTs. In individually randomized trials, it is
usually the case that the intervention is applied at the individual level and data are
collected at the individual level. In CRTs, however, there may be three different
levels involved in randomization, intervention, and data collection. For example,
77 Cluster Randomized Trials 1503
cluster, individual, and disease registry could be the levels, respectively. When
crossed with differing possible analysis populations, from intent-to-treat to per-
protocol, and different effects (e.g., indirect) of interest, the potential estimands
become myriad. As a consequence, investigators have to think long and hard
about exactly what answers a trial should be designed to provide. This will guide
cluster formation and specification of inclusion/exclusion criteria for enrollment,
intervention, and data collection.
By contrast, methods of analysis of CRTs are relatively straightforward. This
chapter has mentioned approaches for handling the particularly problematic ana-
lytic feature of CRTs, namely, that many trials involve small numbers of clusters.
That means methods relying on large-sample asymptotics may become suspect,
and we need to consider alternative methods or conduct additional sensitivity
analyses.
As new methods arise in the design and analysis of individually randomized
trials, say for covariate specification or causal inference for handling loss to follow
up, there will be parallel application of them to CRTs. Such transfer of methodology,
however, needs to be done carefully, especially when accounting for within-cluster
correlation and small numbers of clusters.
Cross-References
References
Bailey RA (1987) Restricted randomization: a practical example. J Am Stat Assoc 82:712–719
Bailey RA, Rowley CA (1987) Valid randomization. Proc R Soc Lond A Math Phys Sci
410:105–124
Begg C, Cho M, Eastwood S, Horton R, Moher D, Olkin I, Pitkin R, Rennie D, Schulz KF, Simel D,
Stroup DF (1996) Improving the quality of reporting of randomized controlled trials. The
CONSORT statement. JAMA 276:637–639
Bennett S, Parpia T, Hayes R, Cousens S (2002) Methods for the analysis of incidence rates in
cluster randomized trials. Int J Epidemiol 31:839–846
Bruhn M, McKenzie D (2009) In pursuit of balance: randomization in practice in development field
experiments. Am Econ J Appl Econ 1:200–232
Campbell MJ, Walters SJ (2014) How to design, analyse and report cluster randomised trials in
medicine and health related research. Wiley, West Sussex
Campbell MK, Piaggio G, Elbourne DR, Altman DG, CONSORT Group (2012) Consort 2010
statement: extension to cluster randomised trials. BMJ 345. https://doi.org/10.1136/bmj.e5661
Diehr PD, Martin C, Koepsell T, Cheadle A (1995) Breaking the matches in a paired t-test for
community interventions when the number of pairs is small. Stat Med 14:1491–1504
Donner A, Klar N (2000) Design and analysis of cluster randomised trials in health research.
Arnold, London
1504 L. H. Moulton and R. J. Hayes
Ebola ça suffit Ring Vaccination Trial Consortium (2015) The ring vaccination trial: a novel cluster
randomised controlled trial design to evaluate vaccine efficacy and effectiveness during out-
breaks, with special reference to Ebola. BMJ 351. https://doi.org/10.1136/bmj.h3740
Eldridge S, Kerry S (2012) A practical guide to cluster randomised trials in health services research,
1st edn. Wiley, West Sussex
Ferebee SH, Mount FW, Murray FJ, Livesay VT (1963) A controlled trial of isoniazid prophylaxis
in mental institutions. Am Rev Respir Dis 88:161–175
Fisher RA (1947) The design of experiments, 4th edn. Hafner-Publishing Company, New York
Halloran ME, Struchiner CJ (1991) Study designs for dependent happenings. Epidemiology
2:331–338
Halloran ME, Struchiner CJ, Longini IM Jr (1997) Study designs for evaluating different efficacy
and effectiveness aspects of vaccines. Am J Epidemiol 146:789–803
Hayes RJ, Bennett S (1999) Simple sample size calculation for cluster-randomized trials. Int J
Epidemiol 28:319–326
Hayes RJ, Moulton LH (2017) Cluster randomised trials, 2nd edn. Chapman & Hall, Boca Raton
Hussey MA, Hughes JP (2007) Design and analysis of stepped wedge cluster randomized trials.
Contemp Clin Trials 28:182–191
Kish L (1965) Survey sampling. Wiley, New York
Kotz D, Spigt M, Arts ICW, Crutzen R, Viechbauer W (2012) Use of the stepped wedge design
cannot be recommended: a critical appraisal and comparison with the classic cluster randomized
controlled trial design. J Clin Epidemiol 65:1249–1252
Laird NM, Ware JH (1982) Random-effect models for longitudinal data. Biometrics 38:963–974
Li F, Turner EL, Heagerty PJ, Murray DM, Volmer WM, Delong EL (2017) An evaluation of
constrained randomization for the design and analysis of group-randomized trials with binary
outcomes. Stat Med 36:3791–3806
Liang KY, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika
73:13–22
Montgomery AA, Peters TJ, Little P (2003) Design, analysis and presentation of factorial
randomized controlled trials. BMC Med Res Methodol 3:26. https://doi.org/10.1186/1471-
2288-3-26
Morgan KL, Rubin DB (2012) Rerandomization to improve covariate balance in experiments. Ann
Stat 40:1263–1282
Moulton LH (2004) Covariate-based constrained randomization of group-randomized trials. Clin
Trials 1:297–305
Moulton LH, O’Brien KL, Kohberger R, Chang I, Reid R, Weatherholtz R, Hackell JG, Siber GR,
Santosham M (2001) Design of a group-randomized Streptococcus pneumoniae vaccine trial.
Control Clin Trials 22:438–452
Murray DM (1998) Design and analysis of group randomised trials. Oxford University Press,
New York
Neuhaus JM, Kalbfleisch JD, Hauck WW (1991) A comparison of cluster-specific and population-
averaged approaches for analyzing correlated binary data. Int Stat Rev 59:25–35
Raab GM, Butcher I (2001) Balance in cluster randomized trials. Stat Med 20:351–365
Scott JM, deCamp A, Juraska M, Fay MP, Gilbert PB (2017) Finite-sample corrected
generalized estimating equation of population average treatment effects in stepped wedge
cluster randomized trials. Stat Methods Med Res 26:583–597. https://doi.org/10.1177/
0962280214552092
The National Commission for the Protection of Human Subjects of Biomedical and Behavioral
Research (1979) Protection of human subjects; Belmont report: notice of report for public
comment. Fed Regist 44:23191–23197
77 Cluster Randomized Trials 1505
Thompson JA, Fielding KL, Davey C, Aiken AM, Hargreaves JR, Hayes RJ (2017) Bias and
inference from misspecified mixed-effect models in stepped wedge trial analysis. Stat Med
36:3670–3682
Torgerson D (ed) (2015) Stepped wedge randomized controlled trials. Trials
16:351,353,354,358,352,350,359
Weijer C, Grimshaw JM, Eccles MP, McRae AD, White A, Brehaut JC, Taljaard M, Ottawa Ethics
of Cluster Randomized Trials Consensus Group (2012) The Ottawa statement on the ethical
design and conduct of cluster randomized trials. PLoS Med 9. https://doi.org/10.1371/journal.
pmed.1001346
Xu Z, Kalbfleisch JD (2010) Propensity score matching in randomized clinical trials. Biometrics
66:813–823
Multi-arm Multi-stage (MAMS) Platform
Randomized Clinical Trials 78
Babak Choodari-Oskooei, Matthew R. Sydes, Patrick Royston, and
Mahesh K. B. Parmar
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1508
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1508
The MAMS Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1510
Advantages of MAMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1510
Example: STAMPEDE Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1511
MAMS Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1513
Design Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1513
Steps to Design a MAMS Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1518
Analysis at Interim and Final Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1519
Choosing Pairwise Design Significance Level and Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1519
Intermediate and Definitive Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1520
Operating Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1521
MAMS Selection Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1525
Adding New Research Arms and Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1527
Software and Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1528
Considerations in Design, Conduct, and Analysis of a MAMS Trial . . . . . . . . . . . . . . . . . . . . . . . . . 1532
Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1532
Conduct Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1535
Analysis Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1535
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1537
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1538
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1539
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1539
Abstract
Efficient clinical trial designs are needed to speed up the evaluation of new
therapies. The multi-arm multi-stage (MAMS) randomized clinical trial designs
have been proposed to achieve this goal. In this framework, multiple
B. Choodari-Oskooei (*) · M. R. Sydes · P. Royston · M. K. B. Parmar
MRC Clinical Trials Unit at UCL, Institute of Clinical Trials and Methodology, London, UK
e-mail: [email protected]; [email protected]; [email protected];
[email protected]
Keywords
Multi-arm multi-stage randomized trials · MAMS designs · Platform protocols ·
Adaptive clinical trials · Intermediate outcome · STAMPEDE trial · Prostate
cancer
Introduction
Background
Randomized controlled trials (RCTs) are the gold-standard for testing whether a new
treatment is better than the current standard of care. Recent reviews of phase III trials
showed a success rate of around 40% in oncology trials, and only around 7% chance
of approval from drugs starting phase I testing (Hay et al. 2014). In many disease
areas such as oncology, traditional randomized trials take a long time to complete
and are often expensive. Multi-arm, multi-stage (MAMS) trial designs have been
proposed to overcome these challenges. The MAMS design aims to speed up the
evaluation of new therapies and improve success rates in identifying effective ones
(Parmar et al. 2008). In this framework, a number of experimental arms are com-
pared against a common control arm and these pairwise comparisons can be made in
several stages. The multi-stage element of the MAMS design resembles the parallel-
group sequential designs, where the accumulating data are used to make a decision
whether to take a certain treatment arm to the next stage.
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1509
Fig. 1 Schematic representation of traditional approach (left) and a multi-arm, multi-stage design
(right) of evaluating experimental treatments (T) against control (C). Traditional approach has a set
of separate phase II studies (activity), some of which may not be randomized, with follow-up phase
III trials (efficacy) in some interventions that pass the phase II stage. Multi-arm, multi-stage
approach provides a platform in which many interventions can be assessed against the control
arm simultaneously and are randomized from the phase II component onwards
1510 B. Choodari-Oskooei et al.
analysis of MAMS trials, and make suggestions for tackling them. Throughout, we
use the acronym MAMS to refer to the multi-arm, multi-stage design described by
Royston et al. (2003). Other approaches to MAMS designs will be briefly discussed
in “Summary.”
Royston et al. (2003) developed a framework for a multi-arm multi-stage design for
time-to-event outcomes which can be applied to designs in the phase II/III settings
(Royston et al. 2003). In this design, an intermediate (I) outcome can be used at the
interim stages to further increase the efficiency of the MAMS design by stopping
recruitment to treatment arms for lack-of-benefit at interim stages. Using an I outcome
in this way allows interim analyses to be conducted sooner and so recruitment to
poorly performing arms can be stopped much earlier than if the primary outcome of
the trial was used throughout. Examples of intermediate and primary or definitive (D)
outcomes are progression-free survival (PFS) and overall survival (OS) for many
cancer trials, and CD4 count and disease-specific survival for HIV trials. When
using an I outcome at interim stages, each of the experimental arms is compared in
a pairwise manner with the control arm using the I-outcome measure, for example, the
PFS (log) hazard ratio. Section “Steps to Design a MAMS Trial” outlines the steps that
should be taken to design a MAMS trial. Section “Choosing Pairwise Design Signif-
icance Level and Power” explains how to choose the stagewise stopping rules and
design power, and section “Intermediate and Definitive Outcomes” provides guidance
on how to choose an I outcome within this framework.
This design has been extended to binary outcomes with the risk difference
(Bratton et al. 2013), and can easily be extended to odds ratio as the primary effect
measure (Abery and Todd 2019). The design has also been extended to include
stopping boundaries for overwhelming efficacy on the definitive (D) outcome of the
trial (Blenkinsop et al. 2019). It is one of the few adaptive designs being deployed
both in a number of trials and across a range of diseases in the phase II and III
settings, including cancer (Sydes et al. 2012), tuberculosis (TB), and surgical trials
(ClinicalTrials.gov Identifier: NCT03838575) (Sydes et al. 2012; ROSSINI 2018;
MRC Clinical Trials Unit at UCL). One example is the STAMPEDE trial for men
with prostate cancer which is used as an example in this chapter for illustration
(Sydes et al. 2012). In the remainder of this chapter, the MAMS designs that utilize
the I-outcome for the lack-of-benefit analysis at the interim looks are denoted by
I 6¼ D. Designs that monitor all the arms on the same definitive (D) outcome
throughout the trial are denoted by I ¼ D.
Advantages of MAMS
The MAMS design has several advantages. First, several primary hypotheses or
treatments can be evaluated under one (master) protocol. This maximizes the chance
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1511
of identifying a new treatment which is better than the current standard (Parmar et al.
2014). Second, all patients are randomized from the start which will ensure a fair and
contemporaneous comparison. This ensures a seamless run through to the phase III if
an early phase (phase II) element is built in the design and under the same protocol
and, if at all possible, includes in the phase III evaluation the information from
patients in the early evaluation. As a result, the overall trial duration will be markedly
reduced compared to separate phase II and III studies since on average many fewer
patients will be required in most situations.
Furthermore, a MAMS design can be part of trials with master protocols such as
platform or umbrella trials. Master protocols allow major adaptations such as ceasing
randomization to an experimental arm or introducing new comparisons through the
addition of new experimental arms – see section “Adding New Research Arms and
Comparisons.” Platform trials provide notable operational efficiency since evalua-
tion of a new treatment within an existing trial will typically be much quicker than
setting up a new trial (Schiavone et al. 2019). Therefore, fewer patients tend to be
exposed to insufficiently effective or harmful treatments as these treatments are
eliminated quickly from the study. This shifts the focus to the more promising
treatments as the trial progresses.
Finally, MAMS designs tend to be popular with patients perhaps because the
increased number of active treatments means that they are more likely to receive the
new treatments. Recruitment tends to markedly improve over time in MAMS trials
while many traditional designs may struggle to accrue – particularly, within the
context of platform trials and when new treatments are to be tested for different,
often biomarker-defined, subgroups of a specific disease. In summary, MAMS
platform designs are efficient because they share a control arm, allow for early
stopping for lack-of-benefit and adding new research arms, and are operationally
seamless.
STAMPEDE is a multi-arm multi-stage (MAMS) platform trial for men with pros-
tate cancer at high risk of recurrence who are starting long-term androgen depriva-
tion therapy (Sydes et al. 2009, 2012). In the initial 4-stage design, five experimental
arms with treatment approaches previously shown to be suitable for testing in a
phase II/III trial were compared to a control arm regimen. In the original design, all
patients received standard of care treatment, and further treatments were added to
this in the experimental arms. The primary analysis was carried out at the end of
stage 4, with overall survival as the primary outcome. Stages 1 to 3 used an
intermediate outcome measure of failure-free survival (FFS) to drop arms for
lack-of-benefit. As a result, the corresponding hypotheses at interim stages were
on lack-of-benefit on failure-free survival. Claims of efficacy could not be made
on this outcome because such a claim can only be made on the primary (D) outcome
of the trial.
1512 B. Choodari-Oskooei et al.
Table 1 Design specification for the 6-arm 4-stage STAMPEDE trial. HR1, ωj, αj are the target
hazard ratio (HR) under the alternative hypothesis for the experimental arms, pairwise (design)
power, and significance levels at each stage. The critical HR and the required control arm events for
each stage are calculated given these design parameters
Stage (j) Type Outcome HR1 ωj αj Critical HR Contl. arm events
1 Activity FFS 0.75 0.95 0.50 1.00 113
2 Activity FFS 0.75 0.95 0.25 0.92 223
3 Activity FFS 0.75 0.95 0.10 0.88 350
4 Efficacy OS 0.75 0.90 0.025 – 437
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1513
interest in continuing with an experimental therapy which was no better than the
control regimen (including those which are detrimental). Thus, the interim decision
rules are distinctly one-sided. Considering the final stage on the D-outcome, any
therapy that is likely to be detrimental in terms of final outcome is very unlikely to
have passed the interim stages. It therefore seems inappropriate to test for differences
in both directions at the final stage.
Recruitment to the original comparisons of the STAMPEDE trial began late in
2005 and was completed early in 2013. The design parameters for the primary
outcome at the final stage were a (one-sided) significance level of 0.025, power of
0.90, and the target hazard ratio of 0.75 on overall survival which requires 437 con-
trol arm deaths (i.e., events on overall survival). An allocation ratio of A ¼ 0.5 was
used for these original comparisons so that, over the long-term, one patient was
allocated to each experimental arm for every two patients allocated to control.
Proportional hazards assumptions were made in both FFS and OS outcomes.
Because distinct hypotheses were being tested in each of the five experimental
arms, the emphasis in the design for STAMPEDE was on the pairwise comparisons
of each experimental arm against the control arm, with emphasis on the control of the
pairwise type I error rate (PWER) – see section “Operating Characteristics” for
definition. Out of the initial five experimental arms, only three of them continued to
recruit through to their final stage. Recruitment to the other two arms was stopped at
the second interim look due to lack of sufficient activity. Since November 2011, new
experimental arms have been added to the original design with five new comparisons
added between 2011 and 2018.
MAMS Design
In this section, we present the MAMS design more formally and discuss how it can
be realized in practice. We present the design for a variety of outcome measures. We
also outline how the operating characteristics of the design can be calculated in
different scenarios.
Design Specification
Consider a J–stage trial where patients are randomized between K experimental arms
(k ¼ 1,. . .,K ) and a single control arm (k ¼ 0). The parameter θjk represents the true
difference in the outcome measure between the experimental arm k and control at
stage j, j ¼ 1,. . .,J.
For continuous outcomes, θjk could be the difference in the means of the two
groups at stage j, μjk μj0; for binary data difference in the proportions, pjk – pj0; for
survival data a log hazard ratio, log(HRjk). For simplicity in notation, we outline the
design specification for the case where the same definitive (D) outcome is monitored
throughout the trial, that is, I ¼ D designs. Therefore, in all notations θ and
1514 B. Choodari-Oskooei et al.
Z represent the definitive primary outcome treatment effect and the corresponding Z-
test statistic comparing experimental arm k ¼ 1, 2. . .,K to the control arm.
The test statistic comparing experimental arm k against the control arm at stage
pffiffiffiffiffiffi
j can be defined as Zjk ¼ b θjk V jk where Vjk is the inverse of the variance of the
treatment effect estimator at stage j for the pairwise comparison k – in statistical
terms, Vjk is known as the Fisher’s (observed) information. In the literature, I is used
instead of V as the standard notation for Fisher’s information. Since in this chapter
the intermediate outcome is abbreviated as I, we use the rather nonstandard notation
V for Fisher’s information. A detailed discussion of information quantification
in various outcome settings is provided in Lan and Zucker (1993). The Z-test
pffiffiffiffiffiffi
statistics is (approximately) normally distributed, that is, Z jk N θjk V jk , 1 , and
is assumed standard normal under the null hypothesis, Zjk N(0, 1). For normal or
binary data, Vjk depends on the number of subjects on the study arms. In time-to-
event or survival settings, it depends on the number of events on the study arms.
Table 2 presents the treatment effect measures for continuous, binary, and survival
outcomes with the corresponding Fisher’s (observed) information Vjk.
For example, in trials with continuous outcomes where the aim is to test that the
outcome of n1 individuals in experimental treatment E1 is on average better (here
better means smaller, e.g., blood pressure) than that of n0 individuals in control
group (C) at stage j, the null hypothesis H 0j1 : μj1 μj0 is tested against the
(one-sided) alternative hypothesis H 1j1 : μj1 < μj0 . In this case, the one-sided type I
error rate and power at stage j are αj1 and ωj1 where α j1 ¼ Φ zα j1 , ω j1 ¼ Φ zω j1 ,
and Φ(.) is the normal probability distribution function. In the MAMS design, αjk and
ωjk are the key stagewise design parameters, and are needed for sample size
calculations – see section “Steps to Design a MAMS Trial” for the other design
parameters.
Without loss of generality, assume that a negative value of θjk indicates a
beneficial effect of treatment k. In trials with K experimental arms, a set of K null
hypotheses are tested at each stage j,
for some prespecified null effects θ0j . In practice, θ0j is usually taken to be 0 on a
relevant scale such as the log hazard ratio for survival outcomes or the mean
difference in continuous outcomes. If the same definitive (D) outcome is monitored
throughout the trial (I ¼ D), then the true treatment effect (θjk) and θ0j are assumed
constant for all j. Otherwise, θJk and θ0J correspond to the true and null effects on the
definitive outcome, and θjk and θ0j correspond to the intermediate outcome for all
j < J and are constant. For sample size and power calculations, a minimum target
treatment effect (often the minimum clinically important difference) is also required,
that is, θ1j. For example, in the STAMPEDE design with five experimental arms and
four stages, there are up to 20 sets of null and alternative hypotheses as above. In this
78
Table 2 Treatment effects, statistical information, and correlation between the test statistics of pairwise comparisons in trials with continuous, binary, and
survival outcomes, with common allocation ratio (A) in all pairwise comparisons. Also, see section “Software and Example” (and “Correlation Structure
Between Pairwise Comparisons”) for an example when an intermediate outcome is used in a MAMS design and how to calculate the between stages correlation
structure in this case
Correlation structure
Outcome Treatment effect or outcome measure (θ) Fisher’s information (V) Corr Z jk , Z j0 k for j0 > j Corr Z jk , Z jk0
n
ffi
qffiffiffiffi A
σ2
σ 2
Continuous μjk μj0 jk
V jk ¼ n j0j0 þ Anjkj0 1 nj0 k Aþ1
qffiffiffiffi
n
ffi A
Binary p jk
pjk pj0 j0 ð1p j0 Þ pjk ð1pjk Þ
Aþ1
V jk ¼ þ 1 nj0 k
p ð1p Þ n j0 An j0 ffi
qffiffiffiffi A
log pjk 1pj0 n jk Aþ1
j0 ð jk Þ n0
jk
A
n o 1 1 Aþ1
pjk V jk ¼ þ An p 1p 1 ffi
qffiffiffiffi
log p n j0 p j0 ð1p j0 Þ j0 jk ð jk Þ n jk
j0 1 nj0 k
1p j0 1pjk
V jk ¼ n j0 p j0 þ An j0 pjk
1 qffiffiffiffi
ffi A
Survival λjk e jk
log HRjk ¼ log 1 Aþ1
λ j0 V jk ¼ ej0 k
Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials
e j0
þ e1jk
1515
1516 B. Choodari-Oskooei et al.
trial, the null θ0j and target treatment effects θ1j used in all the stages (and
comparisons) were 0 and log(0.75), respectively. In MAMS designs that use an
intermediate (I) outcome measure at interim stages, the primary null and alternative
hypotheses, H0Jk and H1Jk, concern θJk, with the hypotheses at stage j ( j < J ) playing a
subsidiary role, mainly to calculate the interim stage sample sizes.
The joint distribution ofpthe Z-test
ffiffiffiffi P statistics therefore follows a multivariate
normal distribution MVN θ V , where θ and V are the J K matrices of the
mean
treatment
effects
and the corresponding Fisher’s (observed) information
V jk ¼ 1=var bθjk , and Σ denotes the correlation matrix between the J K test
statistics – see the last two columns ofp
Table 2, for example in trials with time-to-
ffiffiffiffiffiffiffiffiffiffiffiffiffi
event outcomes note Corr Z jk , Z j0 k ¼ ejk =ej0 k for j0 > j.
The one-sided stagewise significance level αj plays two key roles in the
MAMS design. Together with power ωj, it is used as the design parameter to
calculate the required (cumulative) sample size at the end of stage j. Further, it
acts as the stopping boundary for lack-of-benefit at the end of stage j. In principle,
in a MAMS design different stopping boundaries can be specified for each
pairwise comparison. For simplicity, here we assume the same stopping bound-
aries (αj) for all pairwise comparisons. Section “Choosing Pairwise Design
Significance Level and Power” explains how to choose the stagewise stopping
rules and design power. The interim lack-of-benefit stopping boundaries can also
be defined on the Z-test statistic since there is a one-to-one correspondence
between them – that is, l j ¼ zα j and j ¼ 1,. . ., J – 1. For simplicity in notation,
let L ¼ (l1, . . ., lJ 1) be the stopping boundary for lack-of-benefit prespecified
for interim stages which correspond to the one-sided significance levels α1, . . .,
αJ 1 in all k comparisons – see section “Choosing Pairwise Design Significance
Level and Power.” For example, in survival outcomes, where the treatment effect
is measured by the (log) hazard ratio, L forms an upper bound because the
alternative treatment effect being targeted indicates a relative reduction in (log)
hazard compared to the control arm.
In designs that include interim stopping boundaries for overwhelming efficacy
ð EÞ
on the primary outcome measure, another set of α j should be specified at the
design stage. This may be desirable for both investigators and sponsors because
being able to identify effective regimens earlier increases the efficiency of the
design further by reducing resources allocated to these arms. It may also result in
stopping the trial early to progress efficacious arms to the subsequent phase of the
testing process or to seek regulatory approval and thus expedite uptake of the
treatment by patients. Two popular efficacy stopping boundaries are the
Haybittle-Peto and O’Brien-Fleming stopping rules. Blenkinsop et al. (2019)
investigated the impact of the efficacy stopping rules on the operating character-
istics of the MAMS design. Section “Software and Example” illustrates how to
control the overall type I error rate in MAMS designs with both lack-of-benefit
and efficacy stopping rules using the STAMPEDE trial as an example. Let B ¼
(b1,b2,. . .,bJ) be the stopping boundary for overwhelming evidence of efficacy on
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1517
the primary outcome at interim stages, where b j ¼ zαðEÞ and bJ is the threshold for
j
assessing efficacy at the final analysis corresponding to αJ. The two stopping
boundaries meet at the final stage J to ensure a conclusion can be made regarding
efficacy. In I ¼ D designs, the primary outcome test statistic is compared to the
stopping boundaries at each stage where one of three outcomes can occur
(assuming binding boundaries – see section “Binding/Nonbinding Stopping
Boundaries”):
• If bj < Zjk < lj, experimental arm k continues to the next stage (recruitment and
treatment k continue).
• If Zjk > lj, experimental arm k is “dropped” for lack of benefit (recruitment and
treatment k stopped).
• If Zjk bj, the corresponding null hypothesis can be rejected early and experi-
mental arm k is stopped early for overwhelming efficacy.
• If ZJk > bJ, the test is unable to reject the final stage null hypothesis H 0Jk at level
αJ.
• If ZJk bJ, reject H0Jk at level αJ and conclude efficacy for experimental arm k.
The following steps should be taken to design a MAMS trial with both lack-of-
benefit and efficacy stopping boundaries – see section “Considerations in Design,
Conduct, and Analysis of a MAMS Trial” for further guidelines on some of the
points:
1. Choose the number of experimental arms, K, and stages, J – see section “Design
Considerations.”
2. Choose the definitive D outcome, and (optionally) I outcome – see section
“Intermediate and Definitive Outcomes.”
3. Choose the null values for θ – for example, the (log) hazard ratios on the
intermediate θ0I and definitive θ0D outcomes – see section “Software and
Example.”
4. Choose the minimum clinically relevant target treatment effect size, for exam-
ple, in the time-to-event setting the (log) hazard ratio on the intermediate θ0I
and definitive θ1D outcomes.
5. Choose the control arm event rate (median survival) in trials with binary
(survival) outcome – see section “Software and Example.”
6. Choose the allocation ratio A, the number of patients allocated to each experi-
mental arm for every patient allocated to the control arm. For a fixed-sample
(1-stage) multi-arm trial, the optimal allocation ratio (i.e., the one that minimizes
pffiffiffiffi
the sample size for a fixed power) is approximately A ¼ 1= K.
7. In I 6¼ D designs, choose the correlation between the estimated treatment effects
for the I and D outcomes. An estimate of the correlation can be obtained by
bootstrapping relevant existing trial data – see sections “Correlation Structure
Between Pairwise Comparisons” and “Software and Example,” and Sect. 2.7.1
in Royston et al. (2011) for further details.
8. Choose the accrual rate per stage to calculate the trials timelines – see section
“Software and Example.”
9. Choose a one-sided significance level for lack-of-benefit and the target power
for each stage (αjk, ωjk). The chosen values for αjk and ωjk are used to calculate
the required sample sizes for each stage – see sections “Choosing Pairwise
Design Significance Level and Power” and “Design Considerations.”
10. Choose whether to allow early stopping for overwhelming efficacy on the
primary (D) outcome. If yes, choose an appropriate efficacy stopping boundary
αEj on the D-outcome measure for each stage 1,. . ., J, where αEJ ¼ αJ. Possible
choices are Haybittle-Peto or O’Brien-Fleming stopping boundaries used in group
sequential designs, or one based on α-spending functions (Blenkinsop et al. 2019).
11. Given the above design parameters, calculate the number of control and exper-
imental arm (effective) samples sizes required to trigger each analysis and the
operating characteristics of the design, that is, njk in trials with continuous and
binary outcomes and ejk in trials with time-to-event outcomes, as well as the
overall type I error rate and power. If the desired (prespecified) overall type I
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1519
error rate and power have not been maintained, for instance if the overall
pairwise power is smaller than the prespecified value, steps 9–11 should be
repeated until success. Or, if the overall type I error rate is larger than the
pre-specified value, one can choose a more stringent (lower) design alpha for
the final stage, αJ, and repeat steps 9–11 until the desired overall type I error rate
is achieved – see sections “Software and Example” and “Design
Considerations.”
In a MAMS design, the end of each stage is determined when the accumulated trial
data reaches the predetermined (effective) sample size for that stage. The effective
sample size is the number of subjects in designs with binary and continuous out-
comes (Bratton et al. 2013), and the number of required events in designs with time-
to-event outcomes (Royston et al. 2011). Reaching the end of each stage triggers an
interim analysis of the accumulated trial data. The outcome of analysis is a decision
to discontinue recruitment to a particular experimental arm for lack-of-benefit on I
(or D), to terminate the trial for efficacy on D, or to continue.
At each interim analysis, the treatment effects are estimated using an appropriate
analysis method. For example, the Cox proportional hazards model can be used to
estimate the log hazard ratio, and calculate the corresponding test statistic and P-values.
In I ¼ D designs, the primary outcome test statistic is compared to the stopping
boundaries at each stage where one of the three outcomes, set out in section “Design
Specification,” can occur. In I ¼ 6 D designs with efficacy stopping boundaries for the
D outcome which utilize an I outcome for lack-of-benefit analysis at interim stages, two
I D
sets of treatment effects bθjk (on I) and bθjk (on D) are calculated, together with the
corresponding test statistics and P-values. The outcome of analysis is a decision to
discontinue recruitment to a particular experimental arm for lack-of-benefit on I, to
terminate the trial for efficacy on D, or to continue – see Sect. 2.1 in Blenkinsop and
Choodari-Oskooei (2019) for the further details on these decision rules.
At the final analysis J, the treatment effect is estimated on the primary outcome
for each experimental arm, and the observed P-value is compared against the final
stage significance level αJk. If the P-value is smaller than αJk, we reject the null
hypothesis corresponding to the definitive outcome and claim efficacy. Otherwise,
the corresponding null hypothesis cannot be rejected at the αJk level.
Section “Analysis Considerations” discusses the issue of analysis in more detail,
particularly the potential impact of the stopping rules on the average treatment effect.
The design stagewise type I error (αjk) and power (ωjk) are important in realizing a
MAMS design. Together with the target effect sizes, they are the main driver of the
1520 B. Choodari-Oskooei et al.
stagewise sample sizes. The choice of their values is guided by two considerations.
First, it is essential to maintain a high overall pairwise power, ωk, in Eq. (2) in section
“Pairwise Type I Error Rate and Power” for each comparison in the trial. The
implication is that for testing the treatment effect at the interim analysis, the design
interim-stage power ωjk( j < J) should be high, for example, at least 0.95. For testing
the treatment effect on the definitive outcome, the design pairwise power at the final
stage, ωJk, should also be high, perhaps of the order of at least 0.90 which is higher
than many academic trials might select traditionally. The main cost of using a larger
number of stages is a (slight) reduction in the overall pairwise power (ωk). For
example, the overall pairwise power in the STAMPEDE trial with four stages is
about 0.84 under binding stopping rules for lack-of-benefit – see Table 1 for design
stagewise values for αjk and ωjk. Under the nonbinding rules ωk is equal to the final
stage design power ωJk, that is, 0.90. In section “Software and Example,” we
describe how to calculate the lower bound for the overall pairwise power in cases
where we do not have an estimate of the between-stage correlation structure.
Second, given the design stagewise power ωjk, the values chosen for the αjk
largely govern the effective sample sizes required to be seen at each stage and the
stage durations. Generally, larger-than-traditional (more permissive) values of αjk are
used at the interim stages, because a decision can be made on dropping/continuing
arms reasonably early, that is, with a relatively small sample size. It is necessary to
use descending values of αjk, otherwise some of the stages become redundant.
Royston et al. (2011) suggested a geometric descending sequence of αjk values
starting at α1k ¼ 0.5 and considering αjk ¼ 0.5 j( j < J ) and αJk ¼ 0.025. The latter
mimics the conventional 0.05 two-sided significance level for tests on the
D outcome. Further, Bratton (2015) proposed a family of α-functions and a system-
atic search procedure to find the stagewise (design) significance levels. They have
been implemented in the nstagebinopt Stata program to find efficient MAMS
designs (Choodari-Oskooei et al. 2022a). Section “Considerations in Design, Con-
duct, and Analysis of a MAMS Trial” discusses the relevant issues about the timing
and frequency of interim analyses in more details, including both design and trial
conduct implications. Section “Design Considerations” addresses some of chal-
lenges on the choice of the stagewise (design) power and significance levels to
increase the efficiency of a MAMS design.
The MAMS framework by Royston et al. (2003) allows the use of an intermediate (I)
outcome at the interim stages which can speed up the weeding out of insufficiently
promising treatments. This markedly increases the efficiency of the design since
recruitment to the unpromising arms will be discontinued much faster than other-
wise. Choosing appropriate and valid intermediate and definitive (D) outcomes is
key to the success of the MAMS design (Royston et al. 2011).
The basic assumptions are that “information” on I accrues at the same rate or
faster rate than information for the D outcome, where information is defined as the
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1521
inverse of the variance of the treatment effect estimator. Another assumption is that
the I outcome is on the pathway between the treatments and the D outcome. If the
null hypothesis is true for I, it must also hold for D. In this setting, the I outcome
does not have to be a perfect or true surrogate outcome for the definitive outcome
as defined by Prentice (1989). In the absence of an obvious choice for I, a rational
choice of I at the interim stages might be D itself. As a result, in some settings the
efficiency of the design might be reduced since the interim analyses will be delayed
compared to when using I. In this case, each pairwise comparison resembles a
parallel group-sequential design. In the cancer context, typical intermediate and
definitive outcomes might be PFS and OS, respectively. Information on PFS is
usually available sooner in a study, and in many cancer sites, the treatment effect
on PFS is usually highly positively correlated with that on OS (Royston et al.
2011).
Operating Characteristics
The operating characteristics of a conventional two-arm trial are quantified using the
type I error rate and power of the design. In MAMS designs, type I error can be
controlled for each or a set (or family) of pairwise comparisons. These are quantified
by the pairwise (PWER) and familywise (FWER) type I error rates. The simplest
(and perhaps most useful) measure of power in a MAMS trial is the pairwise power
for the comparison of each experimental arm k against control. The correlation
structure between different test statistics at different stages is required to calculate
these quantities. In the following subsections, we explain how the correlation
structure can be estimated. We also define the pairwise/familywise type I error
rates and power for a MAMS designs with (and without) an intermediate outcome
measure.
qffiffiffiffiffiffiffiffiffiffiffiffiffi
and that of the OS at the final stage will be calculated using ρ: eIjk =eD Jk , j ¼ 1, 2, 3,
where ρ is the correlation between the estimated log hazard ratios on the two
outcomes at a fixed time-point. Note that
qffiffiffiffiffiffiffiffiffiffiffiffiffi if the I- and D-outcomes are identical
qffiffiffiffiffiffiffiffiffiffiffiffiffi
then ρ ¼ 1 and ρ: ejk =eJk reduces to eIjk =eD
I D
Jk (Royston et al. 2011). For other
outcomes a similar formula can be derived (Bratton et al. 2013 for binary outcomes
and Follmann et al. 2021 for continuous outcomes). Bootstrap analysis of individual
patient data from similar previous trials can be used to assess ρ in a particular setting
(Barthel et al. 2009). Section “Software and Example” explains how the correlation
structure can be calculated when an I outcome is used at the interim stages in the
STAMPEDE trial.
maximum pairwise type I error rate for each pairwise comparison, αmax, is actually
equal to the final stage significance level of the trial: αmax ¼ αJ ¼ 0.025.
under the global null hypothesis, H G 0 , that is, when the null hypothesis which
maximizes pairwise alpha is true for all arms.
In a family of k “independent” pairwise comparisons each with their own control
group and a PWER of αk, the overall type I error rate (FWER) is
FWER ¼ Pr reject at least one Hk0 jH G 0
¼ Pr reject H10 or H 20 . . . or H k0 jH G
0
¼ 1 Pr accept H 10 and H20 . . . and H k0 jH G
0
Y
K
¼1 ð1 αk Þ:
k¼1
Bonferroni correction can also be used to calculate the FWER. For example, if the
family includes two (independent) pairwise comparisons each with a (one-sided)
PWER of α1 ¼ α2 ¼ 0.025, the FWERs is 0.0494 from Eq. (3). In this case,
Bonferroni correction can also provide a good approximation, that is, α1 þ α2 ¼
0.05.
To allow for the correlation between the test statistics of different pairwise
comparisons, one can replace the term (1 – α)k in Eq. (3) with an appropriate quantity
to reduce the FWER and gain some efficiency in scenarios where the strong control
of the FWER is required. Dunnett (1955) developed an analytical formula to
calculate the FWER in multi-arm trials which takes care of this correlation structure.
For designs with nonbinding stopping boundaries for lack-of-benefit, the maximum
FWER can be computed using Dunnett probability (Bratton et al. 2016):
(0.0126) which results in lower sample size, hence increasing the efficiency of the
design.
In multi-arm designs, two other types of power can be calculated: any-pair and
all-pair powers. Any-pair power is the probability of showing a statistically signif-
icant effect under the targeted effects for at least one comparison, and all-pairs power
is the probability of showing a statistically significant effect under the targeted
effects for all comparison pairs. The three measures of power will be identical in a
two-arm trial, but when considering a multi-arm design the power measure of
interest may depend on the objective of the trial (Choodari-Oskooei et al. 2020).
For more complex designs with both overwhelming efficacy and lack-of-benefit
stopping boundaries or designs where a new experimental arm is added at a later
stage, Eq. (4) can become quite complicated (Blenkinsop et al. 2019; Choodari-
Oskooei et al. 2020).
In a MAMS design, all experimental arms can reach their final stage of recruitment if
they pass each interim analysis activity boundary. As a result, the number of
experimental arms recruiting at each stage cannot be predetermined. Therefore, the
actual sample size of the trial can be varied considerably with its maximum being
when all the treatment arms reach the final stage. In trials with a large number of
experimental arms, the maximum sample size might be too large either to achieve or
for any funding agency to fund it. In these cases, it may be more appropriate to
prespecify the number of experimental arms that will be taken to each stage,
alongside a criterion for selecting them. One example of such designs is the
ROSSINI-2 surgical trial – see Fig. 2.
The selection of research arms can be made based on ranking of treatment effects,
or a combination of efficacy and safety results. Traditionally, selection of the most
promising treatments has been made in phase II trials where the strict control of
operating characteristics is not a particular concern. In a MAMS selection design, the
selection and confirmatory stages are implemented within one trial protocol, and
selection of the most promising treatments can be made in multiple stages – see
Fig. 2. Patients will be randomized from the start to all the experimental and control
arms, and the primary analysis of the experimental arms that reach the final stage
include all randomized individuals from start. The advantages of MAMS selection
designs are (lower) maximum sample sizes and simpler planning. However, the
selection process might get complicated if all arms look promising at the selection
stage which might affect the operating characteristics of the design (Stallard et al.
2015). However, in this case we are less interested in individual arms, but in the
process of finding an appropriate treatment to the next stage of the trial.
In MAMS selection designs, the primary aim is to select the most promising
treatments with high probability of correct selection where strong control of the error
rates is required in the phase III setting. The probability of correct selection is driven
by the underlying treatment effects, timing of selection, and the number of
1526 B. Choodari-Oskooei et al.
Fig. 2 Schematic representation of the ROSSINI-2 selection design with seven experimental arms
and three stages. There are two prespecified subset selection stages in this trial
comparisons. Stallard et al. (2015) developed analytical derivations for the type I
and II error rates in a two-stage design – more details are in Stallard and Todd
(2003). Further research is needed to explore the operating characteristics of the
selection designs when an I-outcome measure is used for the selection of best
performing arms. In this case, the power of the design might be adversely affected
if the rankings of treatment effects on the I and D outcomes are not similar across
experimental treatments. This might also affect the average treatment effects in the
arms that reach the final stage. Simulations can be done to explore the operating
characteristics of the design as well as the extent of bias in the average treatment
effects. Key practical issues to consider in the simulations are the timing, the
selection criteria (e.g., ranking), and the number of experimental arms selected at
each stage.
When there are several experimental treatments, it is sometimes more efficient to
reduce the number of selected arms in multiple stages (Wason et al. 2017). Other-
wise, the probability of correct selection and power of the design might be adversely
affected. An example is the ROSSINI-2 design where treatment selection has been
done in two stages. It has been shown that implementing treatment selection in the
ROSSINI-2 trial could reduce the maximum sample size by up to 7% (by 370
patients) compared to the MAMS design with no selection imposed, without
adversely affecting the operating characteristics of the trial. Simulations can be
used to explore the impact of the number of treatment arms being selected at each
stage on the operating characteristics of the design.
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1527
In MAMS selection designs, power is only lost under extreme selection criteria.
For example in the ROSSINI-2 trial with seven research arms, at least four arms
should be selected at the first interim analysis and at least three arms at the second
stage to preserve power. In this case the probability of selecting truly the best arm
remains above 90%. The timing of the first selection stage is also important. The
probability of correct selection is generally low if the selection of treatment arms is
done too early in the course of the trial – for example, earlier than 15% of
information time in some settings, see Stallard and Todd (2003) and Choodari-
Oskooei et al. (2022b). For a design such as ROSSINI-2, it has been suggested not
selecting before a fifth of the total planned patients have been recruited.
Phase III randomized clinical trials can take several years to complete in some disease
areas, requiring considerable resources. During this time, new promising treatments
may emerge which warrant testing. The practical advantages of incorporating new
experimental arms into an existing trial protocol have been clearly stated in previous
studies, not least because it obviates the often lengthy process of initiating a new trial
and competing between trials to recruit patients (Ventz et al. 2017; Schiavone et al.
2019; Hague et al. 2019). Funding bodies and scientific committees may wish to
strategically encourage such collaboration. The MAMS design framework can be
implemented as a platform trial. One such example is the STAMPEDE trial which
has incorporated five new pairwise comparisons with more to follow, each starting
accrual more quickly than the original comparisons (Sydes et al. 2012).
When adding a new experimental arm, the two major (statistical) considerations
are as follows. First, the decision whether to control the type I error rate for both the
existing and new comparisons should be made, that is, multiplicity adjustment and
control of the FWER. The decision to focus control on the PWER or the FWER (for
a set of pairwise comparisons) depends on the type of research questions being posed
and whether they are related in some way, for example, testing different doses or
duration of the same therapy in which case the control of the FWER is required.
These are mainly practical considerations and should be determined on a case-by-
case basis in the light of the rationale for the hypothesis being tested and the aims of
the protocol for the trial. Second, how the type I error rate for a set of comparisons
can be calculated if the strong control of the FWER is required in this setting.
Choodari-Oskooei et al. (2020) developed a set of guidelines that can be used to
decide whether multiplicity adjustment is necessary when adding a new experimen-
tal arm – that is, whether to control the PWER or the FWER in a particular design,
see Fig. 2 in Choodari-Oskooei et al. (2020). The emerging consensus among the
broader scientific community is that in most multi-arm trials where the rationale for
research treatments (or combinations) in the existing and added comparisons is
derived separately, the greater focus should be on controlling each pairwise error
rate (Parker and Weir 2020; O’Brien 1983; Cook and Farewell 1996). In designs
where the FWER for the protocol as a whole is required to be controlled at a certain
1528 B. Choodari-Oskooei et al.
level, then the overall type I error can be split accordingly between the original and
added comparisons, and each can be powered using their allocated type I error rate.
To calculate the overall type I error, the Dunnett probability in Eq. (4) can be
extended to control the FWER when new experimental arms are added to a MAMS
(platform) trial (Choodari-Oskooei et al. 2020). The idea is to adjust the correlated
test statistics by a factor that reflects the size of the shared control group that is used
in the pairwise comparisons. This allows the calculation of the operating character-
istics for a set of pairwise comparisons in a platform trial setting with both planned
and unplanned addition of a new experimental arm. Choodari-Oskooei et al. showed
that the FWER is driven more by the number of pairwise comparisons in the family
rather than by the timing of the addition of the new arms. The shared control arm
information common to comparisons (i.e., shared control arm individuals in contin-
uous and binary outcomes and shared control arm primary outcome events in time-
to-event outcomes), and the allocation ratio, are required to calculate the FWER.
However, the FWER can be estimated using Bonferroni correction if there is not a
substantial overlap between the new comparison and those of the existing ones, or
when the correlation between the test statistics of the new comparison and those of
the existing comparisons is less than 0.30, that is, the correlation in last column of
Table 2. Finally, in a recent review, Lee et al. (2021) addressed different statistical
considerations that arise when adding a new research arm to platform trials.
Stata commands, by issuing the following command to Stata: ssc install nstage.
Note that in this section we use a new (beta) version of nstage program which gives
more accurate sample sizes than previously reported. The new version will also be
available on the SSC archive.
We demonstrate how the nstage command can be used to design the STAM-
PEDE trial with time-to-event I and D outcomes. To illustrate the design, we
follow the same steps in section “Steps to Design a MAMS Trial.” The design
parameters in the original comparisons of the STAMPEDE trial were as follows.
In this MAMS trial, five experimental treatments were chosen to be compared
against the control arm in four stages, that is, specified using arms(6 6 6 6) and
nstag(4) option in nstage Stata command below. Each comparison was powered
to detect a target hazard ratio of 0.75 on both the I and D outcome measures, hr0
(1 1) hr1(0.75 0.75). From previous studies, an estimate of the correlation
between survival times on the I (excluding D) and D outcomes was available,
corr(0.60), as well as the median survival (in years) for the I and D outcomes, t
(2 4). Patients were allocated to the control arm with a 2:1 ratio, aratio(0.5), to
increase power. The accrual rate was assumed to be 500 patients per year in all
stages, accrue(500 500 500 500). The stopping boundaries for the lack-of-benefit
were chosen as 0.5, 0.25, 0.10, and 0.025, alpha(0.50 0.25 0.10 0.025). The
original design only included lack-of-benefit stopping boundaries – see Table 1.
Here, we also include the Haybittle-Peto efficacy stopping boundary for illustra-
tion, esb(hp).
Given the above design parameters, nstage calculates the stagewise sample sizes
and overall operating characteristics of the design. The first table after the nstage
command shows the stagewise design specifications and the overall operating
characteristics of the trial. The second and third columns of the operating character-
istics table report the chosen stopping boundaries required for stopping for lack-of-
benefit and efficacy at each interim stage. In this example, each stage requires
p 0.0005 on the definitive outcome measure to declare efficacy early, that is,
Haybittle-Peto rule, shown under the column Alpha(ESB). There is no efficacy
boundary for the final stage, since it is equal to the final stage boundary for lack-of-
benefit, denoted in the column Alpha(LOB). Assuming exponential distribution
(and proportional hazards) for the FFS and OS, and based on the given median
survivals, the first of three interim looks is expected at 2.44 years and the final
analysis at 7.16 years. Since the design includes multiple pairwise comparisons, the
output also presents the maximum FWER, defined in Eq. (4), as the type I error
measure of interest. The upper bound for the overall pairwise power is 0.90 because
it assumes non-binding stopping boundaries for lack-of-benefit since the type I error
rate is maximized under this assumption. However, if we have an estimate of the
correlation between the FFS and OS (log) hazard ratios, ρ, we can calculate a lower
bound for the pairwise power using Eq. (2) with correlation matrix R4. We refer to
this as the correlation between treatment effects on I and D within the trial, not across
cognate trials. The ( j, j0 )th entry of RJ for interim stages, that is, the correlation
between the estimated FFS (log) hazard ratios at interim stages, can be calculated
1530 B. Choodari-Oskooei et al.
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
from the formula presented in Table 2 – that is, eIjk =eI j0 k for j1 > j, and j& j0 ¼
1,2,3, where eIjk , is the I-outcome events at interim stage j for the kth comparison.
Specifically, the correlation of the FFS (log) hazard ratios is time-dependent and its
value depends on the accumulated numbers of events at different times. However,
the correlation between the FFS (log) hazard ratios atq the interim stages and that of
ffiffiffiffiffiffiffiffiffiffiffiffiffi
the OS at the final stage should be calculated using ρ: eIjk =eD Jk , j ¼ 1, 2, 3 (Royston
et p ffi For other outcomes a similar formula can be derived – for example
al.ffiffiffiffiffiffiffiffiffiffiffiffiffi
2011).
ρ: njk =nJk in continuous outcomes, see Follmann et al. (2021). In STAMPEDE,
assuming a correlation of ρ ¼ 0.60 between the FFS and OS (log) hazard ratios, R4
correlation matrix is
0 1
1 71 0:57 0:31
B 0:71 1 0:80 0:43 C
B C
R4 ¼ B C
@ 0:57 0:80 1 0:54 A
0:31 0:43 0:54 1
With this correlation structure, the lower bound for the overall pairwise power, ω,
is 0.83. If we do not have an estimate of the correlation between the treatment effects
on I and D, the lower bound for the overall pairwise power can be calculated
assuming ρ ¼ 0, that is, no correlation between the I and D outcome measures,
In this case, the lower bound for the overall pairwise power is 0.77 (¼ 0.95
0.95 0.95 0.90). The all-pairs and any-pairs power can also be obtained with the
return list command (output not shown).
nstage, nstage(4) alpha(0.5 0.25 0.1 0.025) omega(0.95 0.95 0.95 0.9) hr0(1 1)
hr1(0.75 0.75) accrue(500 500 500 500) arms(6 6 6 6) t(2 4) corr(0.60) aratio(0.5)
esb(hp)
Operating characteristics
Stage Alpha Alpha Power HR|H0 HR|H1 Crit.HR Crit.HR Lengthb Timeb
(LOB)a (ESB)a (LOB) (ESB)
1 0.5000 0.0005 0.950 1.000 0.750 1.000 0.439 2.436 2.436
2 0.2500 0.0005 0.950 1.000 0.750 0.920 0.512 1.189 3.625
3 0.1000 0.0005 0.950 1.000 0.750 0.882 0.553 1.161 4.786
4 0.0250 0.900 1.000 0.750 0.840 2.375 7.161
Max. Pairwise Error Rate 0.0256 Pairwise Power 0.9000
Max. Familywise Error Rate (SE) 0.1056 (0.0003)
LOB lack of benefit, ESB efficacy stopping boundary
a
All alphas are one-sided
b
Length (duration of each stage) is expressed in periods and assumes survival times are exponen-
tially distributed. Time is expressed in cumulative periods
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1531
...
Although the focus of the STAMPEDE trial was on the strong control of
the PWER, we demonstrate how the FWER could be controlled using this design.
The following command specifies that interim analyses should assess for efficacy and
the program should search for a design which controls the FWER at a maximum of
2.5%. The other design parameter inputs and options remain the same. The option for
controlling the FWER identified the final stage αJ required to ensure a maximum
FWER of 2.5% as 0.0043, which lengthened the time to final analysis. This could be
addressed by lengthening accrual, for instance, or an additional interim analysis could
be included in this case. Further simulations should be carried out to quantify the
impact of such changes on the operating characteristics of the design and trial
timelines, particularly on power which can be reduced marginally. Moreover, the
trial timelines presented in outputs assume exponential distribution for both outcomes,
which is restrictive. However, the key design quantities, that is, number of control-arm
events which are needed to trigger the interim or final analysis, only assumes propor-
tional hazards for both I and D (log) hazard ratios. So, if the exponential assumption is
breached, the effect is only to reduce the accuracy of the times for each stage. In
practice, it is helpful to visually represent time to analysis and accrual using diagrams.
The following output shows the sample sizes required for the final stage of
the design, which has changed to achieve control of the FWER. In some settings,
the interim stage stopping boundaries have to be updated to achieve control of the
FWER, for example, to choose a more stringent (lower) P-value thresholds for
efficacy stopping boundaries. The number of control arm D-outcome events required
for the stage 4 analysis should be increased from 437 to 629 to ensure control of the
FWER at 2.5%. This 44% increase in the number of events required would require
substantially greater resources; for this reason investigators should consider care-
fully at the design stage whether control of the FWER is the focus of the design, or
that of the PWER.
1532 B. Choodari-Oskooei et al.
nstage, nstage(4) alpha(0.5 0.25 0.1 0.025) omega(0.95 0.95 0.95 0.9) hr0(1 1)
hr1(0.75 0.75) accrue(500 500 500 500) arms(6 6 6 6) t(2 4) corr(0.60) aratio(0.5)
esb(hp) fwercontrol(0.025)
Operating characteristics
Stage Alpha Alpha Power HR|H0 HR|H1 Crit.HR Crit.HR Lengthb Timeb
(LOB)a (ESB)a (LOB) (ESB)
1 0.5000 0.0005 0.950 1.000 0.750 1.000 0.439 2.436 2.436
2 0.2500 0.0005 0.950 1.000 0.750 0.920 0.512 1.189 3.625
3 0.1000 0.0005 0.950 1.000 0.750 0.882 0.553 1.161 4.786
4 0.0043 0.901 1.000 0.750 0.824 4.164 8.950
Max. Pairwise Error Rate 0.0054 Pairwise Power 0.8999
Max. Familywise Error Rate (SE) 0.0252 (0.0002)
a
All alphas are one-sided
LOB lack of benefit, ESB efficacy stopping boundary
b
Length (duration of each stage) is expressed in periods and assumes survival times are exponen-
tially distributed. Time is expressed in cumulative periods
Sample size and number of events
...
Stage 4
Overall Control Exper.
Arms 6 1 5
Acc. rate 500 143 357
Patientsa 4475 1279 3196
Eventsb 1939 629 1310
a
Patients are cumulative across stages
b
Events are cumulative across stages but are only displayed for those arms to which patients are still
being recruited. Events are for I-outcome at stages 1–3, D-outcome at stage 4
Design Considerations
The number of experimental arms that can be practically included in the trial is one
important consideration. There is no optimal number for this in a MAMS design as
the main drivers for the number of arms are: the number of treatments that are ready
and available for testing; the number of patients available; and the cost of undertak-
ing the protocol.
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1533
overwhelming benefit usually focuses on the need for a small P-value for the
“treatment effect” on the primary outcome measure. Furthermore, stopping for an
overwhelming benefit has direct implications, for the control arm in particular, and
potentially all of the other research arms since it will affect the assessment of other
pairwise comparisons. If the efficacious arm is found to be unsafe, both any-pair and
all-pair powers will be reduced. The reduction in power can be overcome by
increasing the sample size for the other comparisons using the conditional error
approach (Jaki and Magirr 2013).
In a MAMS design, the stagewise sample sizes drive the timings of the interim
analyses. In trials with continuous and binary outcomes, the timing of interim
analysis is typically based on observing a prespecified fraction of total sample size
required for the final analysis. However, in superiority designs with time-to-event
outcomes the timing of the interim analysis can be made based on the prespecified
fraction of total number of events in the control arm. There are two reasons for this
approach. Firstly, an event rate different to that anticipated for the trial overall, across
all arms, could either arise due to a different underlying event rate in all arms or due
to a hazard ratio different to that targeted initially. This level of ambiguity is removed
by using the control arm event rate as the deciding factor for when to conduct the
analysis. Secondly, when more than one experimental arm is recruited to, it is
unlikely that we shall observe the same hazard ratio in all comparisons, giving
different total numbers of events for each comparison. It is practically expedient
for pairwise comparisons started at the same time to have their interim analyses at the
same time. However, the calculation for the overall number of events assumes the
same event rate in all comparisons in the experimental arms. Moreover, Dang et al.
(2020) showed that monitoring the control arm events provides unbiased estimates
of the (Fisher) “information fraction” in group sequential trials with time-to-event
outcomes.
Another important design consideration is the choice between the control of the
PWER or FWER. This is a major decision which drives the required patients and the
cost of undertaking the protocol. Consensus is emerging that the most important
consideration to decide whether to control the PWER or FWER is the relatedness of
the research questions in each pairwise comparison (Choodari-Oskooei et al. 2020;
Parker and Weir 2020; Proschan and Waclawiw 2000). There are cases such as
examining different doses (or duration) of the same drug where the control of the
FWER might be necessary to avoid offering a particular therapy an unfair advantage
of showing a beneficial effect. However, in most multi-arm trials where the rationale
for research treatments (or combinations) is derived separately, the greater focus
should be on controlling each pairwise error rate.
Furthermore, the sample size and power calculations in trials with time-to-event
outcomes assume the treatment effects follow proportional hazards (PH). In general,
if the PH assumption is false, power is reduced and interpretation of the hazard ratio
(HR) as the estimated treatment effect is compromised (Royston and Parmar 2020).
For example, when there is an early treatment effect – where the HR is <1 in the
early follow-up and increases later – the research treatment may pass the interim
stage lack-of-benefit threshold. But, this may cause important lack of power at the
78 Multi-arm Multi-stage (MAMS) Platform Randomized Clinical Trials 1535
final analysis of the D outcome. More serious is the late effects where lack-of-benefit
is likely to appear at the intermediate stages even when there is a demonstrable
treatment effect at the final stage. This is a generic problem of trials with time-to-
event outcomes, and is an area of ongoing methodological research.
Finally, sample size calculations in trials with continuous and binary outcomes
depend on the outcome variance and the underlying control arm event rate. The
values of these parameters are generally specified based on previous studies. Depar-
ture from the assumed values can adversely affect the operating characteristics of a
MAMS design. The impact on power increases with the number of stages and when
the outcome variance or the control arm event rate is overestimated (Mehta and
Tsiatis 2001). It is, therefore, necessary to assess these design assumptions during the
trial. Several methods have been proposed in the literature (Betensky and Tierney
1997; Proschan 2005). One common approach, which has minimal impact on the
operating characteristics, is to recalculate sample size using the revised values for
these parameters without unblinding the treatment effect (Proschan 2005).
Conduct Considerations
Besides decisions on the statistical aspects of MAMS studies, there are a number of
practical issues to consider when conducting a MAMS study which are not the focus
of this chapter. They have been extensively discussed in the literature (James et al.
2008, 2012; Sydes et al. 2009, 2012; Schiavone et al. 2019; Hague et al. 2019). Such
challenges include, for example, ensuring adequate supply of the treatments under
investigation which is much more complex due to the stochastic nature of the
demand on individual treatments; appropriately informing potential participants
before the trials; updating them with new information; and setting up and managing
scientific oversight committees. This highlights the need to garner large-scale col-
laboration bringing large parts of the research community together, to obtain signif-
icant and long-term funding, to obtain long-term commitment from the key research
leaders, to ensure that responsibilities (and also acclaim) are shared as widely as
possible, and to have operational structures and systems which allow the implemen-
tation of such long-term adaptive protocols. These challenges need to be addressed
when the protocol is at the design stage, as they will need to be resolved before any
funding is likely to be approved and released.
Analysis Considerations
A further challenge arises when estimating the treatment effects at the interim
analyses or the end of the study. While it is relatively easy to define statistical
bias, different definitions of an unbiased estimator are relevant in the MAMS design
(Robertson et al. 2021). An estimator is unconditionally unbiased, if it is unbiased
when averaged across all possible realizations of an adaptive trial. In contrast, an
estimator is conditionally unbiased if it is unbiased only conditional on the
1536 B. Choodari-Oskooei et al.
the estimate of treatment effect of trials that reach the final stage is a more major
consideration than that in stopped trials.
Several unbiased estimators of the treatment effect have been proposed to correct
for selection bias in these cases (Stallard and Kimani 2018; Bowden and Glimm
2008; Sill and Sampson 2007). They mostly proposed for two-stage designs with
continuous, conditionally normal outcome variables. However, the proposed unbi-
ased estimators might not be preferred to the slightly biased standard (MLE)
estimator because their mean square errors are likely to be larger. Simulations should
be used in these cases under different target treatment effects to assess the degree of
selection bias and probability of stopping under realistic treatment effect sizes.
Robertson et al. (2021) provide a comprehensive overview of proposed approaches
to remove or reduce the potential bias in point estimation of treatment effects in an
adaptive design, as well as illustrating how to implement them. They also propose a
set of guidelines for researchers around the choice of estimators and the reporting of
estimates following an adaptive design (Robertson et al. 2022). Moreover, construc-
tion of (simultaneous) appropriate confidence intervals is more complex and spe-
cialized methods need to be considered.
Finally, it is useful to note when analyzing the trial at the interim stages, power
may be increased like for any trial by adjusting for covariates and stratification
factors. Since the early stages of a MAMS trial will contain relatively few patients,
the trial population across the arms is more likely to be unbalanced in terms of
potentially confounding covariates such as age. Accounting for these known, influ-
ential covariates in the analysis may increase the robustness of the results.
Summary
efficient than designs based on cumulative test statistics. Mehta and Patel (2006)
discussed the pros and cons of this approach and showed that more flexibility in
these designs comes at the cost of large increases in expected sample sizes for these
designs. Recently, Ghosh et al. (2020) extended the cumulative MAMS designs
(with I ¼ D) to permit data-dependent adaptations such as sample size
re-estimation, and compared them with those based on combining independent
multiplicity adjusted P-values from the different stages. They showed that the
power gain from the cumulative test statistics approach can be substantial, by up
to 18%, and increases with the heterogeneity of underlying treatment effects. They
also showed that the power gain is larger for designs with extreme interim stopping
boundaries, that is, when it is more difficult to drop arms. Their findings are
consistent with results published in Koenig et al. (2008), Friede and Stallard
(2008), and Magirr et al. (2014).
This chapter described a class of multi-arm multi-stage trial designs incorpo-
rating repeated tests for both lack-of-benefit and efficacy of a new treatment
compared with a control regimen. Importantly, the interim lack-of-benefit analysis
can be done with respect to an intermediate outcome measure at a relaxed signif-
icance level. If carefully selected, such an intermediate outcome measure can
further increase the efficiency of the design compared to the other alternatives
where the same primary outcome is used at the interim stages (Ghosh et al. 2020).
This chapter demonstrated the mathematical calculation of the operating charac-
teristics of the designs with/without an intermediate outcome at interim stages, and
outlined advantages of the MAMS design over other alternatives. It demonstrated
how the MAMS design speeds up the evaluation of new treatment regimens in
phase II and III trials.
Key Facts
Cross-References
Acknowledgments We are grateful to Professor Ian White for his helpful comments on the earlier
version of this chapter. This work is based on research arising from MRC grants MC_UU_00004/09
and MC_UU_12023/29.
References
Abery JE, Todd S (2019) Comparing the MAMS framework with the combination method in multi-
arm adaptive trials with binary outcomes. Stat Methods Med Res 28(6):1716–1730. https://doi.
org/10.1177/0962280218773546
Barthel FMS, Parmar MKB, Royston P (2009) How do multi-stage multi-arm trials compare to the
traditional two-arm parallel group design – a reanalysis of 4 trials. Trials. https://doi.org/10.
1186/1745-6215-10-21
Betensky RA, Tierney C (1997) An examination of methods for sample size recalculation during an
experiment. Stat Med 16:2587–2598
Blenkinsop A, Choodari-Oskooei B (2019) Multiarm, multistage randomized controlled trials with
stopping boundaries for efficacy and lack of benefit: an update to nstage. Stata J 19(4):782–802
Blenkinsop A, Parmar MKB, Choodari-Oskooei B (2019) Assessing the impact of efficacy stopping
rules on the error rates under the MAMS framework. Clin Trials 16(2):132–142. https://doi.org/
10.1177/1740774518823551
Bowden J, Glimm E (2008) Unbiased estimation of selected treatment means in two-stage trials.
Biom J 50(4):515–527
Bratton DJ (2015) PhD thesis: design issues and extensions of multi-arm multi-stage clinical trials.
UCL, London
Bratton DJ, Phillips PPJ, Parmar MKB (2013) A multi-arm multi-stage clinical trial design for
binary outcomes with application to tuberculosis. Med Res Methodol 13:139
Bratton DJ, Parmar MKB, Phillips PPJ, Choodari-Oskooei B (2016) Type I error rates of multi-arm
multi-stage clinical trials: strong control and impact of intermediate outcomes. Trials 17:309.
https://doi.org/10.1186/s13063-016-1382-5
Choodari-Oskooei B, Parmar MKB, Royston P, Bowden J (2013) Impact of lack- of-benefit
stopping rules on treatment effect estimates of two-arm multi-stage (TAMS) trials with time
to event outcome. Trials 14:23
Choodari-Oskooei B, Bratton DJ, Gannon MR, Meade AM, Sydes MR, Parmar MK (2020) Adding
new experimental arms to randomised clinical trials: impact on error rates. Clin Trials 17(3):
273–284. https://doi.org/10.1177/1740774520904346
Choodari-Oskooei B, Bratton DJ, Parmar M (2022a) Facilities for optimising and designing multi-
arm multi-stage (MAMS) randomised controlled trials with binary outcomes. Stata J, submitted
1540 B. Choodari-Oskooei et al.
Magirr D, Stallard N, Jaki T (2014) Flexible sequential designs for multi-arm clinical trials. Stat
Med 33:3269–3279
Mehta CR, Patel NR (2006) Adaptive, group sequential and decision theoretic approaches to sample
size determination. Stat Med 25:3250–3269. https://doi.org/10.1002/sim.2638
Mehta C, Tsiatis A (2001) Flexible sample size considerations using information-based interim
monitoring. Drug Inf J 35(4):1095–1112. https://doi.org/10.1177/009286150103500407
Meyer EL, Mesenbrink P, Mielke T, Parke T, Evans D, Konig F, EU-PEARL (EU Patient-cEntric
clinicAl tRial pLatforms) Consortium (2021) Systematic review of available software for multi-
arm multi-stage and platform clinical trial design. Trials 22:183. https://doi.org/10.1186/s13063-
021-05130-x
MRC Clinical Trials Unit at UCL. RAMPARE Trial. https://www.rampart-trial.org/
O’Brien PC (1983) The appropriateness of analysis of variance and multiple-comparison pro-
cedures. Biometrics 39(3):787–788
Parker RA, Weir CJ (2020) Non-adjustment for multiple testing in multi-arm trials of distinct
treatments: rationale and justification. Clin Trials 17(5):562–566. https://doi.org/10.1177/
1740774520941419
Parmar MK, Barthel FM, Sydes M, Langley R, Kaplan R, Eisenhauer E, Brady M, James N,
Bookman MA, Swart AM, Qian W, Royston P (2008) Speeding up the evaluation of new agents
in cancer. J Natl Cancer Inst 100(17):1204–1214
Parmar MKB, Carpenter J, Sydes MR (2014) More multiarm randomised trials of superiority are
needed. Lancet 384(9940):283–284. https://doi.org/10.1016/S0140-6736(14)61122-3
Piantadosi S (2005) Clinical trials: a methodologic perspective, 2nd edn. Wiley, New York
Posch M, Koenig F, Branson M, Brannath W, Dunger-Baldauf C, Bauer P (2005) Testing and
estimation in flexible group sequential designs with adaptive treatment selection. Stat Med 24:
3697–3714. https://doi.org/10.1002/sim.2389
Prentice RL (1989) Surrogate endpoints in clinical trials: definition and operational criteria. Stat
Med 8(4):431–440. https://doi.org/10.1002/sim.4780080407
Proschan MA (2005) Two-stage sample size re-estimation based on a nuisance pa-rameter: a review.
J Biopharm Stat 15(4):559–574. https://doi.org/10.1081/BIP-200062852
Proschan MA, Waclawiw MA (2000) Practical guidelines for multiplicity adjustment in clinical
trials. Control Clin Trials 21:527–539
Robertson DS, Choodari-Oskooei B, Dimairo M, Flight L, Pallmann P, Jaki T (2021) Point
estimation for adaptive trial designs. Stat Med, under review. https://arxiv.org/abs/2105.08836
Robertson DS, Choodari-Oskooei B, Dimairo M, Flight L, Pallmann P, Jaki T (2022) Point
estimation for adaptive trial designs II: practical considerations and guidance. Stat Med, under
review. https://arxiv.org/abs/2105.08836
ROSSINI 2: Reduction of surgical site infection using several novel interventions trial protocol,
Tech. rep (2018). https://www.birmingham.ac.uk/Documents/college-mds/trials/bctu/rossini-ii/
R0SSINI-2-Protocol-V1.0-02.12.2018.pdf
Royston P, Parmar MK (2020) A simulation study comparing the power of nine tests of the
treatment effect in randomized controlled trials with a time-to-event outcome. Trials 21:315.
https://doi.org/10.1186/s13063-020-4153-2
Royston P, Parmar MK, Qian W (2003) Novel designs for multi-arm clinical trials with survival
outcomes with an application in ovarian cancer. Stat Med 22(14):2239–2256
Royston P, Barthel FM, Parmar MK, Choodari-Oskooei B, Isham V (2011) Designs for clinical
trials with time-to-event outcomes based on stopping guidelines for lack of benefit. Trials 12:81
Schiavone F, Bathia R, Letchemanan K et al (2019) This is a platform alteration: a trial management
perspective on the operational aspects of adaptive and platform and umbrella protocols. Trials
20:264. https://doi.org/10.1186/s13063-019-3216-8
Sidak Z (1967) Rectangular confidence regions for the means of multivariate normal distributions.
J Am Stat Assoc 62(318):626–633
Sill MW, Sampson AR (2007) Extension of a two-stage conditionally unbiased estimator of the
selected population to the bivariate normal case. Commun Stat Theory Methods 36:801–813
1542 B. Choodari-Oskooei et al.
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1544
Dynamic Treatment Regimens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1544
Scientific Questions about DTRS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1547
Sequential, Multiple Assignment, Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1547
Returning to the Scientific Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1549
Other Smart Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1551
Power Considerations and Analytic Methods for Primary Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1553
Additional Considerations for Designing and Implementing a Smart . . . . . . . . . . . . . . . . . . . . . . . . 1555
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1557
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1557
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1558
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1558
Abstract
A dynamic treatment regimen (DTR) is a prespecified set of decision rules that
can be used to guide important clinical decisions about treatment planning. This
includes decisions concerning how to begin treatment based on a patient’s
characteristics at entry, as well as how to tailor treatment over time based on
the patient’s changing needs. Sequential, multiple assignment, randomized trials
(SMARTs) are a type of experimental design that can be used to build effective
dynamic treatment regimens (DTRs). This chapter provides an introduction to
DTRs, common types of scientific questions researchers may have concerning the
development of a highly effective DTR, and how SMARTs can be used to address
such questions. To illustrate ideas, we discuss the design of a SMART used to
answer critical questions in the development of a DTR for individuals diagnosed
with alcohol use disorder.
Keywords
Dynamic treatment regimen · Adaptive intervention · Tailoring variable ·
Sequential randomization · Multistage randomized trial
Introduction
A dynamic treatment regimen (DTR) is a sequence of decision rules that can be used
to guide how treatment can be adapted and readapted to the individual in clinical
practice settings. These treatment adaptations can be in terms of the type of treat-
ment, mode of treatment delivery, treatment intensity or dose, or other intervention
components. As with other types of manualized interventions, the decision rules that
make up a DTR are prespecified and well operationalized; this helps to ensure that
they can be replicated by future clinicians or evaluated by future researchers. DTRs
are also referred to as adaptive interventions (Lei et al. 2012; Nahum-Shani et al.
2012b), Adaptive treatment strategies (Murphy 2005; August et al. 2016; Nahum-
Shani et al. 2017), treatment policies (Lunceford et al. 2002; Wahed and Tsiatis
79 Sequential, Multiple Assignment, Randomized Trials (SMART) 1545
Responders
naltrexone
naltrexone +
medical management
behavioral intervention +
medical management +
Non-responders naltrexone
Fig. 1 Schematic of an example DTR for an adult receiving treatment for alcohol use disorder.
Nonresponse to treatment is defined as two or more heavy drinking days during the 8-week initial
study period
2004, 2006), multistage treatments (Thall and Kyle Wathen 2005), and multicourse
treatment strategies (Thall et al. 2002).
To make the idea of a DTR more concrete, consider as an example the treatment
of patients with alcohol use disorder. Naltrexone is a medication that diminishes the
pleasurable effects of alcohol (Oslin et al. 2006). Response to naltrexone is heter-
ogenous due to factors such as poor patient adherence, biological response to the
medication, low social support, and poor coping skills (Nahum-Shani et al. 2017).
As a result of this heterogeneity, it is important to offer a supportive intervention
along with the naltrexone medication. One such intervention is medical manage-
ment, a face-to-face clinical support intervention that includes monitoring for adher-
ence to treatment. A more intensive clinical support intervention is the combined
behavioral intervention, which includes components which target adherence to
medication and enhance the patient’s motivation for change. The intervention also
involves the patient’s family, when possible, and reinforces abstinence by empha-
sizing social support (Longabaugh et al. 2005; Lei et al. 2012). Hereafter, we refer to
the combined behavioral intervention as simply “behavioral intervention.”
Figure 1 illustrates an example DTR that involves the use of naltrexone, medical
management, and behavioral intervention. In this example DTR, the patient is
offered naltrexone alongside medical management for up to 8 weeks, with weekly
check-ins with the clinician as a part of medical management. If, at any of the weekly
check-ins during this 8-week period, the patient reports experiencing two or more
heavy drinking days, the patient is identified as a nonresponder and is offered
behavioral intervention in addition to naltrexone and medical management. If
instead the patient does not experience two or more heavy drinking days during
the 8-week period, then, at week 8, the patient is identified as a “durable responder”
and continues treatment with naltrexone but without medical management (Lei et al.
2012).
There are four main components of a DTR, all of them prespecified: (1) decision
points, (2) treatment options, (3) tailoring variables, and (4) decision rules. Decision
points are times in a patient’s care where a treatment decision is made. They can
occur at scheduled intervals, after a specific number of clinic visits, or be event-
based, such as the point at which a patient fails to respond or adhere to a treatment.
The timing of decision points should be based on scientific or practical consider-
ations which inform when treatment may need to be modified. For instance, in
adolescent weight loss, clinicians typically evaluate response to treatment after
1546 N. J. Seewald et al.
about 3 months: This suggests a decision point should be placed at about this time
(Naar-King et al. 2016).
The second component of a DTR is the collection of treatment options available
at each decision point. This set may include aspects of treatment such as type of
treatment, intensity of treatment, and/or delivery method; see Lei et al. (2012) for
detailed examples. It may also include strategies for modifying treatment, such as
augmenting or intensifying an intervention, or staying the course (Pelham Jr. et al.
2016). The set of possible treatment options can be different at each decision point.
The third component is the Tailoring variables which are used to individualize
(“tailor”) treatment at each decision point. These could be static characteristics, such
as age or other demographic factors, known co-occurring conditions, or other
characteristics collected at intake. Tailoring variables could also be time-varying
characteristics that may change based on previous treatments, disease severity,
treatment preferences, or adherence.
The fourth component in a DTR is the decision rules. At each decision point, a
decision rule takes in the values of the tailoring variables and recommends a
treatment option (or set of options). The collection of decision rules over all decision
points is what makes up a DTR (Murphy and Almirall 2009).
In the alcohol use disorder DTR depicted in Fig. 1, there are two decision points
from the perspective of the clinician. The first is when treatment begins. The second
decision point is the first time the patient is identified as a nonresponder during the
first 8 weeks of treatment, or when the patient is identified as a responder at week
8. In the example DTR, there is only a single treatment option at the first decision
point: naltrexone with medical management. At the second decision point, there are
two treatment options: naltrexone with medical management and behavioral inter-
vention or naltrexone alone. In this example, there is a single tailoring variable,
which is the number of heavy drinking days reported by the patient following the
start of the initial intervention. This information is used to inform whether a patient
remains a responder for 8 weeks or triggers the nonresponse criterion (two or more
heavy drinking days) within the 8 weeks. The decision rule at the first decision point
is to offer all patients naltrexone with medical management. The decision rule at the
second decision point recommends withdrawing responders from medical manage-
ment at week 8 and offering behavioral intervention in addition to existing treatment
to any patient that triggers a nonresponse within the 8 weeks.
DTRs also have applications to other clinical settings, for example, in prevention
medicine, implementation, or in special education. In prevention applications, DTRs
could help operationalize the transition between “universal” preventive interven-
tions, which target a large section of the population, and “selected” then “indicated”
preventive interventions, which target populations at progressively higher risk of
developing a disorder (August et al. 2016; Hall et al. 2019). Implementation focuses
on the uptake or adoption of evidence-based practices by systems of providers (e.g.,
clinics); here, a DTR can be used to guide how best to adapt (potentially costly)
organizational-level interventions that seek to improve the health of individuals at
the organization (Kilbourne et al. 2014, 2018; Quanbeck et al. 2020). In special
education, DTRs can be used to guide how best to adapt interventions designed to
79 Sequential, Multiple Assignment, Randomized Trials (SMART) 1547
naltrexone A
Responders
R
naltrexone +
B
telehealth
naltrexone +
medical management +
behavioral intervention +
stringent non-response
medical management + C
placebo
R
Non-Responders behavioral intervention +
medical management + D
naltrexone
R
naltrexone E
Responders
R
naltrexone +
F
telehealth
naltrexone +
medical management +
behavioral intervention +
lenient non-response
medical management + G
placebo
R
Non-Responders behavioral intervention +
medical management + H
naltrexone
Fig. 2 Schematic of the ExTENd SMART. Circled R indicates randomization; treatments are
boxed. The stringent definition of nonresponse is triggered when the participant reports 2 or more
heavy drinking days in 1 week; the lenient definition, 5 or more heavy drinking days
As stated before, the goal of SMART designs is to aid the development of DTRs.
Data collected in a SMART can be used to answer questions concerning which
intervention option to provide at critical decision points during care. For example, in
1550 N. J. Seewald et al.
Table 1 Embedded dynamic treatment regimens (DTRs) in the ExTENd SMART (Fig. 2). The
stringent definition of nonresponse is triggered when the participant reports 2 or more heavy
drinking days in 1 week; the lenient definition, 5 or more heavy drinking days
Stage
Embedded Stage 2 treatment for Stage 2 treatment Subgroups in Figure 2
DTR 1 treatment responders for non responders consistent with DTR
1 naltrexone + naltrexone behavioral A, C
medical intervention +
management medical
+ stringent management +
non-response placebo
2 naltrexone + naltrexone behavioral A, D
medical intervention +
management medical
+ stringent management +
non-response naltrexone
3 naltrexone + naltrexone + behavioral B, C
medical telehealth intervention +
management medical
+ stringent management +
non-response placebo
4 naltrexone + naltrexone + behavioral B, D
medical telehealth intervention +
management medical
+ stringent management +
non-response naltrexone
5 naltrexone + naltrexone behavioral E, G
medical intervention +
management medical
+ lenient management +
non-response placebo
6 naltrexone + naltrexone behavioral E, H
medical intervention +
management medical
+ lenient management +
non-response naltrexone
7 naltrexone + naltrexone + behavioral F, G
medical telehealth intervention +
management medical
+ lenient management +
non-response placebo
8 naltrexone + naltrexone + behavioral F, H
medical telehealth intervention +
management medical
+ lenient management +
non-response naltrexone
The ExTENd SMART, in which all participants were randomized initially and both
responders and nonresponders were rerandomized, is just one type of SMART
design. The defining feature of a SMART is that at least some participants are
randomized more than once; below, we introduce three additional common
SMART designs. SMARTs may include more than two stages of randomizations
and provide more than two interventions at each randomization. However, for
simplicity, the three SMART designs described below have only two stages and
two intervention options at each randomization.
Many SMARTs use a so-called “prototypical” design in which all participants are
randomized in the first stage, but subsequent randomizations are restricted only to
nonresponders (Sherwood et al. 2016; August et al. 2016; Gunlicks-Stoessel et al.
2016; Naar-King et al. 2016; Pelham Jr. et al. 2016; Schmitz et al. 2018). A
schematic is given in Fig. 3. Note that the tailoring variable could be reversed so
1552 N. J. Seewald et al.
Stage 1 Stage 2
Responders
C
A
D
R
Non-Responders
E
R
Responders
C
B
D
R
Non-Responders
E
Fig. 3 “Prototypical” SMART design. All participants are randomized in the first stage; only
nonresponders are rerandomized. There are four DTRs embedded in this design
that responders are the group that is rerandomized. This type of SMART design may
be helpful in a scenario in which there is an open scientific question about either
responders or nonresponders, but not both. For example, in the SMART described
by Pelham Jr. et al. (2016), participants who responded to first-stage treatment
continued on that treatment: The trial was not motivated by a question about
second-stage treatment for responders. Nonresponders, however, were rerandomized
between an intensified version of their first-stage intervention, or augmentation of
the intervention with another component. It should be noted that it is not necessary
that responders and nonresponders to different first-stage treatments be given the
same second-stage intervention options: Nonresponders to B, for instance, might be
rerandomized between treatments F and G.
In some contexts, scientific, practical, or ethical considerations limit the
treatment options available as follow-up to a particular first-stage intervention.
This type of consideration is accommodated by the SMART design described in
Fig. 4, in which participant rerandomization depends on both their response
status and previous treatment (Almirall et al. 2016; Kasari et al. 2014; Kilbourne
et al. 2014). In Fig. 4, participants who respond to treatment A are not
rerandomized, but the nonresponders are rerandomized. In the branch where
participants receive treatment B as their first stage treatment, no one is
rerandomized. This may be used if there are no practical or ethical treatment
options available to offer nonresponders to B, for example. In the SMART
described in Kasari et al. (2014), it was not feasible to rerandomize participants
who did not respond to one of the initial interventions. For these participants, the
only feasible option was to intensify their initial treatment. There are three DTRs
embedded in this type of SMART.
79 Sequential, Multiple Assignment, Randomized Trials (SMART) 1553
Stage 1 Stage 2
Responders
C
A
D
R
Non-Responders
R E
Responders
F
B
G
Non-Responders
Fig. 4 A SMART design in which only nonresponders to a particular first-stage treatment are
rerandomized. There are three DTRs embedded in this design
Not all SMARTs involve restricted randomization: In some designs, all par-
ticipants are rerandomized regardless of their response to previous treatment
(Fig. 5) (Chronis-Tuscano et al. 2016). In this scenario, investigators might
collect information on one or more candidate-tailoring variables but do not use
them when rerandomizing: The tailoring variable is not embedded in the design.
In the SMART shown in Fig. 5, all participants are randomized to either treatment
A or B, and then rerandomized to either treatment C or D regardless of their
response to first-stage treatment. There are four treatment paths embedded in this
design, but because there is no embedded tailoring variable, these are not DTRs
per se. These so-called “unrestricted” SMARTs are sequential, fully-crossed,
2 2 factorial designs. In this case, the factors are A versus B at stage 1 for all
participants, crossed with C versus D at stage 2 for all individuals. However, as
above, second-stage treatment options may depend on the first randomization
(i.e., individuals who receive B might be rerandomized between E and F rather
than C and D).
Like any other randomized trial, a SMART should be powered based on the primary
aim of the study. Here, we revisit three common primary aims for a SMART and
discuss power considerations and analysis for each. For simplicity, we restrict our
focus to two-stage studies with an outcome observed at the end of the trial and in
which all randomizations occur between two treatment options with equal probabil-
ity. More general situations are described by, e.g., Ogbagaber et al. (2016).
A common primary aim is the comparison of initial treatment options, averaged
over subsequent treatment. This is a two-group comparison: In the context of Fig. 2,
1554 N. J. Seewald et al.
for example, this compares the mean outcome across subgroups A, B, C, and D to
the mean outcome across subgroups E, F, G, and H. As such, standard two-group
comparison methods can be used for both analysis and power considerations. In the
continuous-outcome case, linear regression with an indicator for first-stage treatment
along with any prognostic baseline (prior to first-stage randomization) covariates can
be used; the minimum sample size can be calculated using the standard formula
2
4 z1α=2 þ z1β
N
δ2
where δ is the smallest clinically relevant standardized effect size the investigator
wishes to detect using a test with type-I error rate α/2 with power 1 β. We use zp to
denote the p-th quantile of the standard normal distribution.
A second common primary aim is the comparison of second-stage treatment
options among responders or nonresponders, averaged over initial treatment assign-
ment. In a prototypical SMART (Fig. 3), this involves comparing nonresponders
who received treatment D to those who received treatment E in stage 2. Again, this is
simply a two-group comparison among nonresponders, so we can use standard
methods for analysis restricted to the nonresponders. The formula for the total
sample size for the SMART is the same as above, upweighted by nonresponse
probabilities:
2
2 z1α=2 þ z1β 1 1
N þ
δ2 1 PðRA ¼ 1Þ 1 PðRB ¼ 1Þ
comparison of individuals who are consistent with the DTR which recommends A
then C for responders and D for nonresponders against those consistent with the
DTR which recommends B initially, then F for responders and G for nonresponders.
This comparison is often done using a regression model which allows for the
simultaneous estimation of mean outcomes under each of the embedded DTRs and
accounts for the facts that (1) some participants may be consistent with more than
one DTR, and (2) not all participants are randomized more than once. This can be
achieved using a so-called “weighted and recycled” approach (Nahum-Shani et al.
2012b).
In a prototypical SMART, responders are randomized only once whereas non-
responders are randomized twice. Therefore, there is imbalance by design in the
numbers of responders and nonresponders consistent with each embedded DTR. We
can correct for this imbalance with inverse-probability-of-treatment weights:
Assuming equal randomization, responders receive a weight of (1/2)1 ¼ 2 and
nonresponders receive a weight of (1/2 1/2)1 ¼ 4. Furthermore, responders to
treatment A are consistent with two embedded DTRs: The first recommends A, C for
responders, and D for nonresponders; the second recommends A, C for responders,
and E for nonresponders. The same holds for responders to treatment B. Regression
approaches which simultaneously estimate mean outcomes for all embedded DTRs
must account for this; see Appendix A of Nahum-Shani et al. (2012b) for additional
details.
Sample size formulae for a comparison of two embedded DTRs are surprisingly
straightforward and build on the standard formulae given above. The total sample
size for the SMART is
2
4 z1α=2 þ z1β
N DE
δ2
where DE is a “design effect” that accounts for differential randomization of
responders and nonresponders in the second stage. In a prototypical SMART,
DE ¼ 2 P(R ¼ 1) assuming a common response rate across first-stage treatments.
In the SMART shown in Fig. 4, DE ¼ (3 P(R ¼ 1))/2; in an ExTENd-style
SMART in which all participants are rerandomized (Fig. 1), DE ¼ 2 (Oetting et al.
2011).
The SMART designs discussed above are representative of most of the SMARTs that
have been implemented to date. To our knowledge, most SMARTs in the field have
two stages with randomizations limited to two treatment options. However, as
mentioned above, SMART designs may include more than two stages of randomi-
zation, or more than two treatment options after a randomization. As with any
1556 N. J. Seewald et al.
to adapt treatment to the changing needs of the individual. In adaptive trials, the trial
is adaptive; in SMARTs, the focus is on developing an adaptive treatment strategy
(a DTR). More recently, statisticians have begun to develop randomized trial designs
that are both sequentially randomized and adaptive (Cheung et al. 2015).
Readers interested in more in-depth information about SMARTs and DTRs might
see the books by Chakraborty and Moodie (2013), Kosorok and Moodie (2015), or
Tsiatis et al. (2019). In addition, Nahum-Shani et al. (2012b) and Ogbagaber et al.
(2016) provide tutorials on analytic strategies for comparing embedded DTRs in a
SMART with a continuous, end-of-study outcome. Nahum-Shani et al. (2020) also
provide a tutorial for analyzing SMARTs with longitudinal outcomes. For analytic
and sample size considerations for SMARTs with binary outcomes, see Kidwell et al.
(2018); survival outcomes, Feng and Wahed (2009), Li and Murphy (2011); and
continuous longitudinal outcomes, Lu et al. (2016), Li (2017), Dziak et al. (2019),
and Seewald et al. (2020). Recently, methods have been developed for clustered
SMARTs for developing clustered DTRs (NeCamp et al. 2017). Finally, for infor-
mation on estimating optimal DTRs from a SMART see Moodie et al. (2007),
Murphy (2003), Nahum-Shani et al. (2012a), or Zhao and Laber (2014).
Dynamic treatment regimens provide a guide for the type of sequential intervention
decision-making that arises naturally in clinical settings (Lavori and Dawson 2014).
Sequential, multiple assignment, randomized trials (SMARTs) are one type of exper-
imental design that can be used by researchers for developing DTRs. This chapter
discussed the components that make up DTRs, and scientific questions that researchers
may have about them. It then described how a SMART can be used to address these
scientific questions. The ExTENd SMART study – designed to develop a DTR for
adults with alcohol use disorder – was used to illustrate these ideas.
For clinical trial researchers interested in developing efficient and effective DTRs,
the SMART may be a useful design to consider. As discussed in the chapter, there are
different types of SMART designs. Ultimately, for researchers who choose to use a
SMART, the type of SMART design they choose should be grounded in the scientific
questions they are seeking to answer.
Key Facts
Cross-References
References
Almirall D, DiStefano C, Chang Y-C, Shire S, Kaiser A, Lu X, Nahum-Shani I, Landa R, Mathy P,
Kasari C (2016) “Longitudinal Effects of Adaptive Interventions With a Speech-Generating
Device in Minimally Verbal Children With ASD.” J Clin Child Adolesc 45 (4): 442–56. https://
doi.org/10.1080/15374416.2016.1138407
Almirall D, Nahum-Shani I, Lu W, Kasari C (2018) Experimental designs for research on adaptive
interventions: singly and sequentially randomized trials. In: Collins LM, Kugler KC (eds)
Optimization of behavioral, biobehavioral, and biomedical interventions: advanced topics,
Statistics for social and behavioral sciences. Springer International Publishing, Cham, pp
89–120. https://doi.org/10.1007/978-3-319-91776-4_4
August GJ, Piehler TF, Bloomquist ML (2016) Being ‘SMART’ about adolescent conduct problems
prevention: executing a SMART pilot study in a juvenile diversion agency. J Clin Child Adolesc
Psychol 45(4):495–509. https://doi.org/10/ghpbrn
Cable N, Sacker A (2007) Typologies of alcohol consumption in adolescence: predictors and adult
outcomes. Alcohol Alcoholism 43(1):81–90. https://doi.org/10/fpmm33
Chakraborty B, Moodie EEM (2013) Statistical Methods for Dynamic Treatment Regimes. Statis-
tics for biology and health. Springer New York, New York, NY. https://doi.org/10.1007/978-1-
4614-7428-9
Cheung YK, Chakraborty B, Davidson KW (2015) Sequential multiple assignment randomized
trial (SMART) with adaptive randomization for quality improvement in depression treatment
program: SMART with adaptive randomization. Biometrics 71(2):450–459. https://doi.org/10.
1111/biom.12258
Chronis-Tuscano A, Wang CH, Strickland J, Almirall D, Stein MA (2016) Personalized treatment
of mothers with ADHD and their young at-risk children: a SMART pilot. J Clin Child Adolesc
Psychol 45(4):510–521. https://doi.org/10/gg2h36
Collins LM, Nahum-Shani I, Almirall D (2014) Optimization of behavioral dynamic treatment
regimens based on the sequential, multiple assignment, randomized trial (SMART). Clin Trials
11(4):426–434. https://doi.org/10/f6cjxm
Dragalin V (2006) Adaptive designs: terminology and classification. Drug Inf J 40(4):425–435.
https://doi.org/10/ghpbrt
Dziak JJ, Yap JRT, Almirall D, McKay JR, Lynch KG, Nahum-Shani I (2019) A data analysis
method for using longitudinal binary outcome data from a SMART to compare adaptive
interventions. Multivar Behav Res 0(0):1–24. https://doi.org/10/gftzjg
79 Sequential, Multiple Assignment, Randomized Trials (SMART) 1559
Feng W, Wahed AS (2009) Sample size for two-stage studies with maintenance therapy. Stat Med
28(15):2028–2041. https://doi.org/10.1002/sim.3593
Gunlicks-Stoessel M, Mufson L, Westervelt A, Almirall D, Murphy SA (2016) A pilot SMART for
developing an adaptive treatment strategy for adolescent depression. J Clin Child Adolesc
Psychol 45(4):480–494. https://doi.org/10/ghpbrv
Hall KL, Nahum-Shani I, August GJ, Patrick ME, Murphy SA, Almirall D (2019) Adaptive
intervention designs in substance use prevention. In: Sloboda Z, Petras H, Robertson E, Hingson
R (eds) Prevention of substance use, Advances in prevention science. Springer International
Publishing, Cham, pp 263–280. https://doi.org/10.1007/978-3-030-00627-3_17
Heilig M, Egli M (2006) Pharmacological treatment of alcohol dependence: target symptoms and
target mechanisms. Pharmacol Ther 111(3):855–876. https://doi.org/10/cfs7df
Kasari C, Kaiser A, Goods K, Nietfeld J, Mathy P, Landa R, Murphy SA, Almirall D (2014)
Communication interventions for minimally verbal children with autism: a sequential multiple
assignment randomized trial. J Am Acad Child Adolesc Psychiatry 53(6):635–646. https://doi.
org/10.1016/j.jaac.2014.01.019
Kidwell KM, Seewald NJ, Tran Q, Kasari C, Almirall D (2018) Design and analysis considerations
for comparing dynamic treatment regimens with binary outcomes from sequential multiple
assignment randomized trials. J Appl Stat 45(9):1628–1651. https://doi.org/10.1080/02664763.
2017.1386773
Kilbourne AM, Almirall D, Eisenberg D, Waxmonsky J, Goodrich DE, Fortney JC, JoAnn
E. Kirchner, et al. (2014) Protocol: adaptive implementation of effective programs trial
(ADEPT): cluster randomized SMART trial comparing a standard versus enhanced implemen-
tation strategy to improve outcomes of a mood disorders program. Implement Sci 9(1):132.
https://doi.org/10/f6q9fc
Kilbourne AM, Smith SN, Choi SY, Koschmann E, Liebrecht C, Rusch A, Abelson JL et al (2018)
Adaptive school-based implementation of CBT (ASIC): clustered-SMART for building an
optimized adaptive implementation intervention to improve uptake of mental health interven-
tions in schools. Implement Sci 13(1):119. https://doi.org/10/gd7jt2
Kosorok MR, Moodie EEM (eds) (2015) Adaptive treatment strategies in practice: planning trials
and analyzing data for personalized medicine. Society for Industrial and Applied Mathematics,
Philadelphia, PA. https://doi.org/10.1137/1.9781611974188
Laber EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA (2014) Dynamic treatment regimes:
technical challenges and applications. Electron J Stat 8(1):1225–1272. https://doi.org/10/
gg29c8
Lavori PW, Dawson R (2004) Dynamic treatment regimes: practical design considerations. Clin
Trials 1(1):9–20. https://doi.org/10/cqtvnn
Lavori PW, Dawson R (2014) Introduction to dynamic treatment strategies and sequential multiple
assignment randomization. Clin Trials 11(4):393–399. https://doi.org/10.1177/
1740774514527651
Lei H, Nahum-Shani I, Lynch K, Oslin D, Murphy SA (2012) A ‘SMART’ design for building
individualized treatment sequences. Annu Rev Clin Psychol 8(1):21–48. https://doi.org/10.
1146/annurev-clinpsy-032511-143152
Li Z (2017) Comparison of adaptive treatment strategies based on longitudinal outcomes in
sequential multiple assignment randomized trials. Stat Med 36(3):403–415. https://doi.org/10.
1002/sim.7136
Li Z, Murphy SA (2011) Sample size formulae for two-stage randomized trials with survival
outcomes. Biometrika 98(3):503–518. https://doi.org/10.1093/biomet/asr019
Longabaugh R, Zweben A, Locastro JS, Miller WR (2005) Origins, issues and options in the
development of the combined behavioral intervention. J Stud Alcohol Suppl (15):179–187.
https://doi.org/10/ghpb9f
Lu X, Nahum-Shani I, Kasari C, Lynch KG, Oslin DW, Pelham WE, Fabiano G, Almirall D (2016)
Comparing dynamic treatment regimes using repeated-measures outcomes: modeling consider-
ations in SMART studies. Stat Med 35(10):1595–1615. https://doi.org/10/gg2gxc
1560 N. J. Seewald et al.
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1564
Monte Carlo Simulations and Trial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1565
Case Study 1: The VALOR Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1567
Motivation of Adaptive Sample Size Re-Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1567
Statistical Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1568
VALOR Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1570
Practicalities of Running an Adaptive Trial (With Reference to VALOR) . . . . . . . . . . . . . . . . 1574
Case Study 2: SPYRAL HTN OFF-MED Trial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1575
Motivation of Bayesian Design with Discount Prior Methodology . . . . . . . . . . . . . . . . . . . . . . . 1575
Statistical Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1576
Discount Prior Methodology in the Context of the SPYRAL Trial Design . . . . . . . . . . . . . . . 1577
Compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1577
Discount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1577
Combine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1578
Estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1579
S. Ankolekar (*)
Cytel Inc, Cambridge, MA, USA
Maastricht School of Management, Maastricht, Netherlands
e-mail: [email protected]
C. Mehta
Cytel Inc, Cambridge, MA, USA
Harvard T.H. Chan School of Public Health, Boston, MA, USA
e-mail: [email protected]
R. Mukherjee · S. Hsiao
Cytel Inc, Cambridge, MA, USA
J. Smith
Sunesis Pharmaceuticals Inc, San Francisco, CA, USA
T. Haddad
Medtronic Inc, Minneapolis, MN, USA
Abstract
Clinical trials often involve design issues with mathematically intractable com-
plexity. Being part of multi-phase drug development programs, the trial designs
need to incorporate prior information in terms of historical data from earlier
phases and available knowledge about related trials. Some trials with inherent
limits on data collection may need augmentation with simulated pseudo-data. For
planning of interim looks, group sequential and adaptive trials require accurate
timeline predictions of reaching clinical milestones involving complex set of
operational and clinical models. In general, clinical trial design involves an
interactive process involving interplay of models, data, assumptions, insights,
and experiences to address specific design issues before and during the trial. This
offers a rich context for simulation-centric modeling, the theme of this chapter.
We will focus on practical considerations of applying simulation modeling tools
and techniques to design and implementation of clinical trials. This will be
achieved through two real-life case studies and relevant illustrative examples
drawn from literature and our practical experience.
Keywords
Adaptive design · Bayesian discount · Sample-size re-estimation · Design
simulation · Power prior · Discount function · O’Brien-Fleming efficacy
Introduction
This chapter focuses on application of Monte Carlo simulations for clinical trial
design. In view of the emphasis of the book on principles and practice, we will focus
on practical considerations of applying simulation modeling tools and techniques to
design and implementation of clinical trials. This will primarily be achieved through
two real-life case studies and relevant illustrative examples drawn from literature and
our practical experience.
The chapter is organized in five sections. The next section introduces the basic
simulation concepts and relates them to clinical trials. We deliberately take a
simulation-centric view of clinical trials in the section and make a case for enhanced
role of simulation techniques in their design and implementation. This will be
followed by two detailed sections covering two real-life case studies, one completed
and other currently ongoing. Finally, we conclude the chapter with a few remarks.
80 Monte Carlo Simulation for Trial Design Tool 1565
Acute myelogenous leukemia (AML) is a disease of the bone marrow with poor
prognosis and few available therapies, a continued area of unmet need. The VALOR
study was a phase 3, double-blind, placebo-controlled trial conducted at 101 inter-
national sites in 711 patients with AML. Patients were randomized 1:1 to vosaroxin
plus cytarabine (vos/cyt) or placebo plus cytarabine (pla/cyt) stratified by disease
status, age, and geographic location. The primary and secondary efficacy endpoints
were overall survival (OS) and complete response rate. This study is registered at
clinicaltrials.gov (NCT01191801).
Prior to designing the VALOR trial, Sunesis Pharmaceuticals completed a single arm
Ph 2 trial in relapsed refractory AML. Observed mean OS in that trial was approx-
imately 7 months. Assuming an expected OS of 5 months in the control arm, the
VALOR trial was powered at 90% for hazard ratio (HR = 5/7 = 0.71) requiring 375
events and 450 patients. However, there can be uncertainty around Ph 2 estimates.
While HR greater than 0.71 (say 0.75) could still be clinically meaningful, powering
for this smaller effect size would require a large, initially unfeasible number of
patients. An adaptive approach allows the sample size to be conditional on observed
data accumulating in the first part of the Ph 3 trial, avoiding unnecessarily enrolling
patients at the start if the true HR is close to 0.71 and allowing additional patients to
1568 S. Ankolekar et al.
be enrolled later if the effect size is smaller but still of meaningful magnitude. This
flexible approach allowed the exposure of patients and the expenditure of resources
to be conditional on observed results at the interim.
Statistical Methodology
VALOR was a two-arm trial with time to death as the primary endpoint. It was required
to have 100 (1β) = 90% power to detect an improvement in median survival from
5 months on pla/cyt (the control arm) to 7 months on vos/cyt (the experimental arm)
(HR = 0.71) using a one-sided log-rank test at the significance level α = 2.5%. The
trial was designed with one interim analysis when 50% of the death events were
observed, at which point one of the following four decisions could have been taken:
In trials with a time to event endpoint, the power is driven by number of events,
say D, and not by sample size. Sample size plays an indirect role, however, since the
larger the sample size, the earlier the required D events and the shorter will be the
expected study duration. To meet the 90% power requirement, the target number of
death events D is given by the formula
" #2
zα þ zβ
D¼ ½IF
lnðHRÞ
where zγ is the upper (1γ) 100% percentile of the standard normal distribution
and IF is an inflation factor to recover power loss due to spending some of the
available α for possible early stopping at the interim analysis. Values of IF for
different α-spending functions are available in Jennison and Turnull (2000). An
analytical relationship between required number of events, sample size, patient
enrollment rates, and study duration is available in Kim and Tsiatis (1990). Based
on these considerations, the planned initial enrolment was for 450 patients and 375
events with the possibility of increasing both the planned events and sample size by
50% if the results of the interim analysis fell in the promising zone (see below). Note
the total sample size was increased to allow for 5% dropouts and an effective sample
size of either 450 or 676 patients. Enrolment assumptions were tested periodically by
simulating the pooled study data (blinded to the treatment assignment) so that
accurate assessments of dates for the interim and final analyses could be obtained.
These simulations are described later in VALOR simulation.
The decision to terminate for efficacy at the interim analysis time point would be
based on the O’Brien-Fleming efficacy boundary derived from the Lan and DeMets
80 Monte Carlo Simulation for Trial Design Tool 1569
(1983) α-spending function, invoked when 50% of the death events were observed,
and the appropriate amount of α was spent to ensure that the overall one-sided type-1
error remained 0.025. This approach results in a one-sided significance level of
0.001525 for the interim analysis (with 187 of the planned 375 events) and 0.0245
for the final analysis (with 375 events). The overall significance level of this test
procedure was guaranteed to be 0.025 (one-sided).
If the efficacy criterion was not met, one of the remaining three decisions would
be taken based on the conditional power or probability of achieving statistical
significance at the end of the trial conditional on the results observed at the interim
analysis. Precise formulae for conditional power are available in Mehta and Pocock
(2011) and Gao et al. (2008). The trial would now be modified as follows:
• Fix a maximum upper limit of 562 for the number of events and 676 for the
sample size.
• Compute the conditional power with the number of events being increased to 562,
based on a hazard ratio of 0.71 as specified at the design stage. If the conditional
power (CP-plan-562) so computed is less than 50%, the DSMB would recom-
mend stopping for futility. If the futility criterion was not met, continue as
discussed below.
• Compute the conditional power at 187 events, with the number of events equal to
375 (CP-obs-375) at the final analysis as initially specified, based on the hazard
ratio estimated at the interim analysis.
– If CP-obs-375 30%, the results are considered unfavorable. Continue the
trial with no further change until 375 events are reached, and perform the final
analysis.
– If 30% < CP-obs-375 90%, the results are considered promising. Increase
the number of events to 562 and sample size to 676.
– If CP-obs-375 > 90%, the results are considered favorable. Continue the trial
with no further change until 375 events are reached, and perform the final
analysis.
The DSMB was allowed to exercise clinical judgement, based on its access to
unblinded safety and efficacy data to make minor adjustments to the sample size
obtained by the above rules.
Because the number of events could potentially be increased in a data-dependent
manner at the interim look, the final analysis would not use the conventional log-
rank statistic to determine if statistical significance is reached. Instead it used the
weighted statistic proposed by Cui et al. (1999) in which the independent log-rank
statistics of the two stages are combined by prespecified weights that are equal to the
planned proportion of total events at which the interim analysis would be taken if
there were no change in the design. In the present case, the trial was designed for 375
events with an interim analysis at 187 events. The planned proportion was 0.5 for
each stage and the log-rank statistics for the two stages were combined with weights
that equal the square root of 0.5. Thus, if Z1 and Z2 are the standardized log-rank
1570 S. Ankolekar et al.
statistics from the data before and after the interim analysis, the combined statistic
for the final analysis was
pffiffiffiffiffiffiffi pffiffiffiffiffiffiffi
Zf ¼ 0:5Z1 þ 0:5Z 2
pffiffiffiffiffiffiffi pffiffiffiffiffiffiffi
In order to ensure preservation of type-1 error, the two weights 0:5 and 0:5 for
the two stages must be used even if the total number of events was increased at the
interim analysis. This could result in a slight loss of efficiency, which is offset by the
increase in events.
VALOR Simulations
Table 1 for hazard ratios of 0.71, 0.74, and 0.77. The operating characteristics
include probabilities, conditional powers, trial durations, and sample sizes asso-
ciated with unfavorable, promising, and favorable zones at the interim look. For
comparison purposes the operating characteristics of the two-look nonadaptive
group sequential design, with 375 maximum events, 450 patients, and no
reassessment of events or sample size are also displayed. Both designs have an
O’Brien-Fleming efficacy boundary and a futility boundary for terminating at the
interim look if the conditional power based on HR = 0.71 is below 50%.
Average power gains of 3% to 6% are obtained with the adaptive design at an
average cost of 50–70 additional subjects and an average increase in study
duration of 2–4 months. The real benefit of the adaptive design, however, lies
in its ability to learn from the interim results and avoid an underpowered trial.
This is evident from an examination of the zone-wise powers. For example,
under the pessimistic scenario HR = 0.77, the study is underpowered at 71%.
But if the interim results land in the promising zone, the conditional power of the
adaptive designs is boosted up to 90% but remains 71% for the nonadaptive
design. This gain in power does come with a cost in terms of increased sample
size to 675 instead of 450 and the study duration is 38 months instead of 29.
However, these additional resource commitments would have to be made only
after observing the interim data, if promising. The simulation model adequately
supports the sample size reestimation decision, if any, as it consistently shows
increased power associated with promising zone over the range of hazard ratio
scenarios.
Fig. 2 Clinical events predictions after observed 115 events for realized enrolment of 303 patients
There are regulatory guidances on adaptive designs by both the FDA (Adaptive
designs for clinical trials of drugs and biologics, draft guidance, September 2018)
and EMA (Reflection paper on methodological issues in confirmatory clinical trials
planned with an adaptive design, adoption by CHMP October 2007) that emphasize
the need to prespecify analysis methods, minimize operational bias, and control the
type I error, as well as the unbiased point estimation of treatment effect.
When there is an interim look at efficacy, as in the promising zone methodology,
there are several practical considerations to minimize operational bias. First, con-
sider strict control around the availability and communication of interim results.
Decide as a sponsor what the message will be (if any) following the interim analysis.
Will a change in the total sample size be announced to sites, operational entities, or
investors? Various stakeholders (legal, regulatory, operational, medical, etc.) can be
consulted up front to ensure that there is agreement about the planned communica-
tion strategy and an understanding of any implications. The VALOR trial made use
of a special Access Control Execution System to both control and document the flow
of data and reports created at the interim and shared with DSMB members. The
promising zone as originally intended would increase the sample size by an amount
proportionate to the observed results at the interim. The VALOR trial employed an
all or nothing 50% sample size increase instead of a proportionate increase thus
limiting the ability to back-calculate interim results based on the planned increase in
sample size.
The promising zone methodology allows for strict control of the type I error as
described in Jennison and Turnull (2000). However, there are some practical con-
siderations that should be understood in the conduct and analysis of the trial. First, in
this design, the test statistic Z1for the data at the interim is combined with the test
statistic Z2 for the data post-interim with prespecified weights as shown in the
equation for Zf at the end of the Statistical Methodology part of this case study. In
simulation and in theory, the data supporting the interim test statistic Z1 do not
change between the time of the analysis at the interim and final analysis while in
practice they may. Additional follow up (censor dates change) or data cleaning may
alter the value of the test statistic computed at the time of the interim and upon which
the sample size adjustment was made and the time of the final analysis. In the
VALOR trial, the value of the test statistic Z1 for the interim time point was
recomputed on final data before being combined with Z2 to produce the final
adjusted statistic. Thereby the test statistic Z1was most representative of the final
cleaned data and did not require creating and submitting data packages of the interim
data to support an interim value incorporated into the final analysis. Second, a
secondary analysis of OS in the VALOR trial was a stratified log rank. It was
determined that the weighted Cui et al. (1999) method could also be applied to
interim and post-interim test statistics after the stratification.
80 Monte Carlo Simulation for Trial Design Tool 1575
Recognizing the potential for temporal bias and other unknown factors that may
impact the similarity of effect sizes in the two phases of the study, a Bayesian
discount prior method is used for the primary efficacy analysis of the pivotal study
as described in Haddad et al. (2017), whereby data from the PoC phase form the
basis of an informative prior distribution for the pivotal study. The prior information
is dynamically discounted with a factor between 0 and 1, based on the extent to
which the prior data is dissimilar to the data from the pivotal study.
The pivotal study is currently ongoing, and not all details of the statistical design
are publicly available at the time of writing. Our description of the design, simula-
tions, and operating characteristics should be considered illustrative of the general
approach and may not fully agree with the statistical plan of the trial.
1576 S. Ankolekar et al.
Statistical Design
The study plans to randomize up to 433 patients total including both the PoC and
pivotal phases.
The primary efficacy analysis of the pivotal trial uses a baseline adjusted com-
parison of change in SBP from baseline to 3 months post-procedure. Let xi denote
the baseline SBP and yi the SBP change from baseline for the i-th patient. The linear
model of interest is
yi ¼ βc I i fcontrolg þ βt I i ftestg þ βx xi þ ϵ i , ϵ i Normal 0, σ 2 , ð1Þ
where, Ii{test} is the indicator for the test group (1 for test and 0 for control),
Ii{control} = 1 Ii{test}. The main parameter of interest is β = βt βc, representing
the baseline adjusted treatment effect. The primary efficacy hypothesis is H0 : β 0
versus HA : β < 0.
The analysis to evaluate the efficacy hypothesis in the pivotal trial assumes
separate power-prior (Ibrahim et al. 2015) normal distributions on βc and βt and
uniform prior on log(σ), a standard choice for non-informative prior distribution on
the variance term in a normal model. The prior distribution assumes zero correlation
among model coefficients. The power-prior approach allows the amount of borrow-
ing from historical data to be specified in terms of one parameter for the test group
(αt) and one parameter for the control group (αc). The parameter values range
between 0 and 1, with 0 indicating no borrowing and 1 indicating full borrowing
from historical data. These power prior parameters are calculated as part of the
discount prior method as described in the next section. The posterior distribution of β
obtained via this approach will then be used to estimate the posterior probability that
β < 0. The success criteria for this trial is that this posterior probability is greater than
0.975. This criterion aligns with the classical frequentist rule of using a one-sided
test at 2.5% level of significance.
Multiple interim analyses are planned for this study. At each interim analysis, the
decision to continue enrollment or stop enrollment for expected success or futility
will be based on the predictive probability of success, which is derived by imputing
the incomplete data from the posterior distributions of model parameters given
interim data, and then recalculating the posterior probability of success. This com-
pletion process is repeated several times. The proportion of runs where the posterior
probability for β < 0 achieves the success criteria (> 0.975) is the predictive
probability of success. For efficacy, imputations are carried out for patients who
have enrolled prior to a particular interim analysis, hence their baseline SBP values
are available, but have not yet completed their 3-month follow-up. For futility,
imputations are carried out for patients who have been enrolled prior to the interim
analysis but completed their 3-month follow-up as well as for patients who have not
yet been enrolled, up to the maximum sample size. Enrollment is stopped for
expected success if the predictive probability of success with the currently enrolled
patients is greater than 90%, and enrollment is stopped for futility if the predictive
80 Monte Carlo Simulation for Trial Design Tool 1577
probability of success at the maximum sample size is less than 5%. A similar
approach to interim decision-making is described in Berry (2011).
The discount prior method (Haddad et al. 2017) used in the SPYRAL trial was
developed collaboratively by statisticians from the sponsor and the United States
Food and Drug Administration (FDA) as part of the Medical Device Innovation
Consortium (MDIC). An R package (bayesDP) developed by Musgrove and Haddad
(2017) implementing this method is available. The method as it applies in the context
of the trial is described here. The reader is referred to the referenced papers for details
on the general methodology, and further to the R package documentation for details
on implementation.
The analysis to evaluate the primary efficacy hypothesis in the pivotal trial
assumes separate power-prior (Ibrahim et al. 2015) normal distributions on βc and
βt and uniform prior on log(σ). The power parameter of the power-prior for the test
group (αt) and for the control group (αc) are calculated as part of the discount prior
method, which comprises four steps: compare, discount, combine, and estimate.
Compare
The test and the control group data are separately used to fit the following model,
using combined data from both phases of the study in the given arm:
yi ¼ β~0 þ β~1 I i fcurrentg þ β~x xi þ ϵ i , ϵ i Normal 0, σ 2 ,
where Ii{current} equals 0 if the data is from the pivotal phase and equals 1 if the
data is from the PoC phase. Here a joint uniform prior is assumed for log(σ) and the
model coefficients. The degree
of agreement between the two phases can be mea-
sured by p ¼ P β~1 > 0jy . A value of p close to 0.5 indicates agreement while
deviation from 0.5 on either side indicates lack of agreement in terms of the
distribution of the response variable after adjusting for covariates. Thus, we trans-
form p to 2p if p < 0.5 and to 2(1 p) if p 0.5, so that higher transformed values of
p indicate higher levels of agreement. These calculations are carried out separately
for each arm, resulting in pc and pt.
Discount
The similarity measures pc and pt are each mapped to a discount value in the interval
[0, 1] by a discount function F( p). Examples include the identity function F( p) = p
1578 S. Ankolekar et al.
p k
and Weibull function FðpÞ ¼ 1 eðλÞ . The power prior parameters for each arm are
defined as αt = αmaxF( pt) and αc = αmaxF( pc), where αmax is a parameter between
0 and 1 defined at the beginning of the study to control the maximum level of
borrowing from PoC data. The discount function and any accompanying parameters
are also predefined to achieve certain operating characteristics. The same discount
function is used for both arms, which may yield different levels of discount based on
the values of pt and pc. Use of the Weibull function facilitates exploration of a wide
range of discount profiles using just two parameters, the shape k and scale λ. The
discount function used for one of our designs with k = 3 and λ = 0.5 is shown in Fig. 3.
Combine
The power prior method is used to combine the PoC and pivotal data, whereby
informative normal priors based on PoC data are used for the linear model coeffi-
cients while applying a suitable level of discount according to the degree of
similarity with
the pivotal data. Thus,
in the linear model (1), we use independent
priors βc N β0c , bτ0c =αc and βt N b
~ b 2 ~ β0c (b
β0t , bτ20t =αt , where b β0t ) and bτ20c (bτ20t ) are
maximum likelihood estimates of the model parameters and their variances for the
control group (test group)
using PoC data. For the baseline variable, we do not apply
a discount, and use βx N~ b
β0x , bτ20x where b
β0x and bτ20x are estimated baseline parameter
and its variance using the linear model (1) fitted to the PoC data. For the variance
term in (1), a flat prior on log(σ) is used. With these prior specifications, joint
posterior samples for βc and βt are drawn conditional on pivotal trial outcomes ( y),
80 Monte Carlo Simulation for Trial Design Tool 1579
from which we generate a posterior sample for β = βt βc concerning the mean SBP
change difference between the test and sham groups.
Estimate
Using the posterior distribution from the combined pivotal and PoC data, the
probability of a treatment effect favoring the test group is estimated as
Here y0 denotes the PoC data, which is needed for prior specification and determi-
nation of prior discount levels αt and αc.
Role of Simulation
Simulations are critical in both the planning and implementation of the SPYRAL
study. In the planning stage, operating characteristics are evaluated in order to
optimize the design parameters and to facilitate discussion with regulatory author-
ities when seeking alignment on the design. The optimization process in this case
was not a formal procedure involving objective functions (such an approach, while
more rigorous, would have been computationally infeasible), but rather was iterative
and informal, whereby simulations were performed under several combinations of
realistic design parameters – such as sample size, timing and number of interim
looks, discount function parameters, early stopping thresholds – under a range of
plausible effect sizes including the null scenario (β = 0) and results were compared
across scenarios to determine the parameter combination(s) that provided the best
balance of type 1 error rate, power, and interim stopping probabilities in the
judgment of the study team. One advantage of the discount prior approach is the
flexibility to adjust the discount function to keep type 1 error rate at an acceptable
level without needing to change the success criterion.
While the trial is ongoing, simulations are used for repeatedly imputing the
incomplete data to derive estimates of the predictive probability of success for
interim decision-making. Furthermore, ad hoc simulations may be requested by
the Data Monitoring Committee should there be questions on how the efficacy
analysis would look if the trial were to progress under particular scenarios of interest.
Validation of the programs for the Bayesian computations and trial simulations were
performed so as to provide assurances to the sponsor and regulatory reviewers that
the tools for design and implementation are working as intended in a manner
consistent with documentation. An independent team tested specific functions
from the bayesDP package that are intended to be used in the Bayesian analysis,
80 Monte Carlo Simulation for Trial Design Tool 1581
Fig. 4 Power versus treatment effect (difference in baseline adjusted SBP change at 3 months,
measured in mmHg), using the identity discount function
and tested the simulation program used to establish the trial performance character-
istics that are described in the statistical analysis plan.
For testing of bayesDP functions, specific test cases were derived such that the
output being tested can be determined exactly using theoretical knowledge when
possible. In cases where this cannot be done, outputs were compared with simulation
results obtained by independent means, or evaluated for consistency with known
theoretical properties.
For example, if the discount parameters are defined such that no borrowing from
the prior is allowed (e.g., by setting αmax = 0), then the posterior distribution of the
treatment effect β follows a scaled t-distribution, hence posterior samples generated
by bayesDP were compared with their expected theoretical values. The difference
(stochastic error) between the posterior sample and the theoretical distribution were
quantified using the Kolmogorov-Smirnov distance, and the average and maximum
distance over several runs was summarized in the validation report. On the other
1582 S. Ankolekar et al.
Fig. 5 Power versus treatment effect (difference in baseline adjusted SBP change at 3 months,
measured in mmHg), using the Weibull discount function with shape 3 and scale 0.5
hand, if no restrictions are placed on the amount of borrowing from the prior except
as dictated by the discount function (αmax = 1), then the posterior distribution of β no
longer has an analytically convenient form. In this case, the posterior samples from
bayesDP were compared with posterior samples obtained using Stan, a Bayesian
computation tool in common use.
The strategy for validating the simulation program consisted of code review to
map the logic of a single simulated trial, code review to map the logic of the
execution and summary of multiple simulations, and using the program to run
repeated simulations under different scenarios and ensuring operating characteristics
behave as expected as the input parameters are allowed to vary. In particular, it was
verified that power is a monotone function of sample size and effect size.
The simulation program, bayesDP source code (available for download from the
Comprehensive R Archive Network), and validation report were made available to
regulatory authorities for review.
80 Monte Carlo Simulation for Trial Design Tool 1583
Key Facts
Cross-References
References
Antonijevic Z, Pinheiro J, Fardipour P, Lewis RJ (2010) Impact of dose selection strategies used in
phase II on the probability of success in phase III. Stat Biopharm Res 2(4):469–486
Arnold B, Hogan D, Colford J, Hubbard A (2011) Simulation methods to estimate design power: an
overview for applied research. BMC Med Res Methodol 11:94
Benda N, Branson M, Maurer W, Friede T (2010) Aspects of modernizing drug development using
clinical scenario planning and evaluation. Drug Inf J 44:299–315
Berry SM (ed) (2011) Bayesian adaptive methods for clinical trials. Chapman & Hall/CRC
biostatistics series. CRC Press, Boca Raton. 305 p
Bhatt DL, Kandzari DE, O’Neill WW, D’Agostino R, Flack JM, Katzen BT (2014) A controlled
trial of renal denervation for resistant hypertension. N Engl J Med 370:1393–1401
Chang M (2011) Monte Carlo simulation for the pharmaceutical industry: concepts, algorithms, and
case studies. Chapman & Hall/CRC biostatistics series. CRC Press, Boca Raton
Cui L, Hung HMJ, Wang S (1999) Modification of sample size in group sequential clinical trials.
Biometrics 55:853–857
Dmitrienko A, Pukstenis E (2017) Clinical trial optimization using R. Chapman & Hall/CRC
biostatistics series. CRC Press, Boca Raton
East 6 (2018) Statistical software for the design, simulation and monitoring clinical trials. Cytel Inc.,
Cambridge, MA
Evans SR (2010) Fundamentals of clinical trial design. J Exp Stroke Transl Med 3(1):19–27
Friede T, Nicholas R, Stallard N, Todd S, Parsons NR, Valdes-Marquez E, Chataway J (2010)
Refinement of the clinical scenario evaluation framework for assessment of competing devel-
opment strategies with an application to multiple sclerosis. Drug Inf J 44:713–718
80 Monte Carlo Simulation for Trial Design Tool 1585
Gao P, Ware J, Mehta C (2008) Sample size re-estimation for adaptive sequential design in clinical
trials. J Biopharm Stat 18:1184–1196
Haddad T, Himes A, Thompson L, Irony T, Nair R (2017) Incorporation of stochastic engineering
models as prior information in Bayesian medical device trials. J Biopharm Stat 27:1089–1103
Ibrahim JG, Chen M-H, Gwon Y, Chen F (2015) The power prior: theory and applications. Stat Med
34(28):3724–3749
Jennison C, Turnull BW (2000) Group sequential methods with applications to clinical trials.
Chapman and Hall/CRC, London
Jiang Z, Song Y, Shou Q, Xia J, Wang W (2014) A Bayesian prediction model between a biomarker
and clinical endpoint for dichotomous variables. Trials 15:500
Kim K, Tsiatis AA (1990) Study duration for clinical trials with survival response and early
stopping rule. Biometrics 46:81–92
Lan KKG, DeMets DL (1983) Discrete sequential boundaries for clinical trials. Biometrika
70:659–663
Mehta CR, Pocock SJ (2011) Adaptive increase in sample size when interim results are promising: a
practical guide with examples. Stat Med 30:3267–3284
Muller P, Berry D, Grieve A, Smith M, Krams M (2007) Simulation-based sequential Bayesian
design. J Stat Plann Inference 137:3140–3150
Musgrove D, Haddad T (2017) BayesDP: tools for the Bayesian discount prior function. https://
Cran.R-project.org/package=bayesDP
Paux G, Dmitrienko A (2016) Mediana: clinical trial simulations. R package version 1.0.4. http://
gpaux.github.io/Mediana/
Robert C, Casella G (2010) Introducing Monte Carlo methods with R. Springer, New York
Suess E, Trumbo B (2010) Introduction to probability simulation and Gibbs sampling with R.
Springer, New York
Thompson JR (1999) Simulation: a modeler’s approach. Wiley, New York
Townsend RR, Mahfoud F, Kandzari DE, Kario K, Pocock S, Weber MA (2017) Catheter-based
renal denervation in patients with uncontrolled hypertension in the absence of antihypertensive
medications (SPYRAL HTN-OFF MED): a randomised, sham-controlled, proof-of-concept
trial. Lancet 390(10108):2160–2170
Part VII
Analysis
Preview of Counting and Analysis Principles
81
Nancy L. Geller
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1590
Who Counts? Everyone Randomized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1590
What Happens when Things Are Not Perfect? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1591
Missing Outcome Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1591
Analyses Other than Intention to Treat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1592
Analysis Principles in Complex Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1593
Other Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1594
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1595
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1596
Abstract
This chapter provides an introduction to the Section on Analysis. The chapters in
this section range from elementary design and analysis considerations to many
more advanced topics. In this preview, each chapter is briefly mentioned in turn
and the reader is invited to delve more deeply into the individual chapter for
details.
Keywords
Analysis of clinical trials · Intention to treat · Missing outcome data · Non-
compliance · Statistical analysis plan · Analyses other than intent-to-treat
N. L. Geller (*)
National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA
e-mail: [email protected]
Introduction
This chapter covers a broad range of topics some elementary, and with some
advanced. The first section gives a broad overview of the chapter brief reference to
the contents of each of the sections
A fundamental tenet of clinical trial methodology is that the analysis of a clinical trial
should account for everyone randomized. The basic principle of including all
randomized subjects in the primary analysis according to their randomized treatment
assignment (and not according to treatment received) is known as the intention-to-
treat (ITT) principle. Good discussions on this topic are found in Lachin (2000) and
Friedman et al. (2015). A summary of ITT and alternatives is given in ▶ Chap. 82,
“Intention to Treat and Alternative Approaches,” by Goldberg. There are also three
other chapters in this book that briefly discuss ITT: ▶ Chaps. 100, “Causal Inference:
Efficacy and Mechanism Evaluation,” ▶ 93, “Adherence Adjusted Estimates in
Randomized Clinical Trials,” and ▶ 84, “Estimands and Sensitivity Analyses.”
Because randomization assures balance on average in baseline factors, both
known and unknown, an ITT analysis is an unbiased comparison between the
treatments among all randomized subjects, presumably defined by the eligibility
criteria of the trial. An ITT analysis evaluates a treatment policy. It ignores
non-adherence, withdrawal from the trial, and treatment stoppages (even if the
protocol allows them). Even dropping randomized subjects who receive no treatment
can lead to biased results; for example, in a trial of treatment compared to placebo,
those assigned to active treatment who do not take the treatment makes treatment
look more like placebo. Further, It is likely that adherers differ from non-adherers in
ways that are difficult to assess.
A proper ITT analysis requires outcomes on all randomized subjects and one
should do as best as is possible to get these outcomes, no matter whether the subject
remains on trial or takes assigned treatment as planned.
In some trials, subjects have been excluded from the trial after they are random-
ized (whether or not they received their assigned treatments). This may well lead to
spurious results. A classic example of excluding randomized subjects from analysis
is the Anturane reinfarction trial (The Anturane Reinfarction Trial Research Group
1978, 1980) which was a double-blind placebo-controlled trial of sulfinpyrazone in
recent post-myocardial infarction (post-MI patients. The primary endpoint was
sudden cardiac death within six months. Only those who received therapy for at
least seven days were considered “analyzable.” Of the 1620 patients randomized,
this eliminated 145 patients, more-or-less equally distributed in the two treatment
groups. Results reported were overwhelmingly positive for sulfinpyrazone. Since
sulfinpyrazone was an approved drug only for gout, the sponsor went before the
FDA to have the drug approved for a new indication, sudden cardiac death in
patients who were within 6 months post-MI. The FDA review of the data revealed
81 Preview of Counting and Analysis Principles 1591
Much can go wrong, making the ITT principle not so easy to implement, even in the
two-armed randomized comparison with a well-defined single endpoint. Two simple
examples are missing outcome data and non-compliance. Often when there are only
a few subjects with missing outcome data, they are censored at their last follow
up. This inherently assumes that the reason for the missing is unrelated to treatment
assignment, which usually has no basis. Non-compliance can make the outcomes of
the treatments look more alike. Of course a pharmaceutical company in interested in
the effect of their product in those who take it.
These two issues have led to a great deal of statistical methodology to evaluate
clinical trial results accounting for missing outcome data and non-compliance.
Many trials ignore missing outcome data, often citing that only a small percent of those
randomized have been lost to follow up and have missing outcomes. Such an analysis
assumes that data are missing completely at random (MCAR), that is, that the trial
results would not change if we did have those missing data. That is a strong assump-
tion, as subjects are “lost” for reasons often related to treatment assignment. Examples
vary from experiencing toxicity (whether or not related to the trial) to just being “sick
and tired” of all of the necessary visits, to development of other conditions that require
the subject’s attention. One way in which the MCAR assumption is justified is to
compare baseline data of those missing outcome data to those with outcome data. Of
course, not finding a difference does not guarantee there is no difference.
In a case where a few subjects provide baseline data and no follow up
(i.e., subjects drop out after baseline) and, further, there is balance in drop outs
between treatment arms, the Guidance for Clinical Trials (1998) suggested justifying
the dropping of such patients from the trial analysis. This is acceptable in some cases
(c.f. Choi et al. 2020), but usually there is some data beyond baseline even when
subjects are lost to follow up. There is a vast literature dealing with the problem of
missing outcome data, starting with the classical book by Little and Rubin (2002,
second edition). An introduction to methods for missing data is given in ▶ Chap. 86,
“Missing Data,” by Tong, Li, and Allen.
A common way to deal with missing outcome data is to use the data in the trial to
substitute or impute values for the missing data. A simple way to do this is to
consider the best and worst case scenarios. That is, the best case scenario would give
1592 N. L. Geller
the most favorable values (among trial outcomes) for one treatment group and the
least favorable values for the other treatment group. The worst case scenario would
do the reverse. If there are few missing outcomes, these two methods could lead to
the same trial result, which would be highly satisfying. However, these simple
methods can give vastly different estimates of treatment effect, more so with more
missing data. They also underestimate the standard errors because the uncertainty in
the missing values is not considered. Methods that substitute one value for missing
data are called single imputation methods.
Rather than replacing missing data with one substituted value, the distribution of
observed values may be used by devising models to predict the outcome variable
based on the complete data and then using these models to estimate the missing
outcome data. Doing this multiple times yields an estimate of each outcome as well
as a better estimate of its standard error than single imputation methods. There are
several different multiple imputation methods that may be used (Sterne et al. 2009).
Others advocate supervised learning as far superior for imputation (Chakrabortty and
Cai 2018). Thus, there is no definitive way to deal with missing outcome data. Six
different methods are described by Badr (2019). All methods require some statistical
assumptions and thus will often be controversial. The best that investigators can do is
to set forth the primary method that will be used for the primary analysis, as well as a
number of sensitivity analyses to give confidence to the primary results. How to
interpret results if some sensitivity analyses lead to a different trial outcome also
should be considered. Most important is to plan for how missing data will be dealt
with in the statistical analysis plan, because missing data are almost inevitable.
Some still call an analysis which ignores missing data (e.g., patient was lost to
follow up and dropped from the primary analysis) an ITT analysis, which is both
misleading and a misuse of the term.
Many who undertake clinical trials, notably in the pharmaceutical industry, are
interested in other analyses than ITT analyses because the ITT estimates of treatment
effect “may not provide an intuitive or clinically meaningful estimate of treatment
effects” (Ruberg and Akacha 2017).
A treatment effect of primary interest, which may not be the ITT estimate, is
called an estimand. A description of estimands, including the need for careful
definition and for planning sensitivity analysis is provided in ▶ Chap. 84,
“Estimands and Sensitivity Analyses,” by Russek-Cohen and Petullo. Adherence-
adjusted analyses are also discussed in ▶ Chap. 92, “Statistical Analysis of Patient-
Reported Outcomes in Clinical Trials,” by Mazza and Dueck, and causal estimands
are discussed in ▶ Chap. 100, “Causal Inference: Efficacy and Mechanism Evalua-
tion,” by Landau and Emsley.
Ruberg and Akacha suggest that ITT analyses do not adjust for confounding
factors post-randomization (such as drug discontinuation or addition of rescue
medication). They require an explicit definition of what treatment effect is of primary
81 Preview of Counting and Analysis Principles 1593
interest (“relevant and meaningful”), which they call the estimand. They define four
estimands other than the ITT estimand. One is based on a composite variable which
combines a change in a symptom score with discontinuation of study drug due to an
adverse event. “Success” is improvement in symptom score and completion of
taking the study drug. A second estimand is the treatment effect if all subjects
adhered to study medication for the period of the trial. A third is what is the effect
on those who can take the study drug(s) without adverse events. The fourth is the
treatment effect for each subject before an adverse event or discontinuation.
Ruberg and Akacha claim that the probability of discontinuation of a study drug
due to either adverse events or lack of efficacy (or both) may be quantified and
treatment comparisons made using usual statistical methods if reasons for discon-
tinuation are carefully defined. They do not consider “physician choice” or “loss to
follow up” to be sufficiently detailed. They consider administrative discontinuation
(e.g., the patient moves away from the center) to be missing completely at random,
so that they don’t include such patients. To estimate efficacy and safety for those able
to adhere to study treatment, adherence must be first defined and would be trial
specific, perhaps taking 70% or 80% of the drug. The authors claim that this
combination of probability of discontinuing study drug due to AE, probability of
discontinuation for lack of efficacy and efficacy in adherers provides a more
complete and meaningful description of drug effect that the ITT estimate.
While the FDA demands an ITT analysis, perhaps allowing other analyses, the
European Medicines Agency (EMA), has produced ICH E9 (R1), Addendum on
estimands and sensitivity analysis in clinical trials to the guideline on statistical
principles for clinical trials (ema.europa.eu/documents/scientific-guideline/ich-e9-
r1-addendum-estimands-sensitivity-analysis-clinical-trials-guideline-statistical_
en.pdf). EMA allows treatment effect to be specified through an estimand (rather
than through the ITT estimate):
The definition of a treatment effect, specified through an estimand, should consider whether
values of the variable after an intercurrent event are relevant, as well as how to account for
the (possibly treatment-related) occurrence or non-occurrence of the event itself. More
formally, an estimand defines in detail what needs to be estimated to address a specific
scientific question of interest.
Several chapters deal with elementary statistical methods to perform hypothesis tests
and parameter estimation. (See the ▶ Chaps. 83, “Estimation and Hypothesis Testing,”
▶ 87, “Essential Statistical Tests,” ▶ 88, “Nonparametric Survival Analysis.” So why
is there need for so much more statistical methodology?
Over time, clinical trials have become more complex and often attempt to answer
more complicated questions than two-armed trials with one primary endpoint. In
many cases, incorporating covariates into the primary analysis increases the power to
detect differences. An introduction to regression methods for dichotomous or
1594 N. L. Geller
Other Analyses
Potential prognostic factors are among the data collected in clinical trials for use in the
primary analysis or for modeling response. In Sect. 7.10, Li presents insight into the
process of finding prognostic factors as well as suggesting use of prognostic factors to
increase the power of the primary statistical analysis. The stability of statistical models
may be assessed by resampling procedures and this is described by Sauerbrei and
Boulesteix (▶ Chap. 96, “Use of Resampling Procedures to Investigate Issues of
Model Building and Its Stability”). ▶ Chapter 101, “Development and Validation of
Risk Prediction Models,” describe risk prediction models and how to develop and
validate them.
Several other statistical problems require specialized analyses. The nonlinear
nature of pharmacokinetic and pharmacodynamic processes require special analyses
81 Preview of Counting and Analysis Principles 1595
to relate drug exposure to response and several methods are described in ▶ Chap. 98,
“Pharmacokinetic and Pharmacodynamic Modeling,” by Kalaria, Wang, and
Gobburu. The potential for adverse events after a drug or device receives marketing
approval has led to pharmacovigilance, the study of adverse effects of drugs post-
marketing. ▶ Chapter 99, “Safety and Risk Benefit Analyses,” by Guo describes
many methods of benefit-risk analysis.
The main analysis principle here is that the analysis should reflect the specific
goals of the question being answered. There are often multiple methods that might be
used and the onus is on the investigators, in particular the statistical investigators, to
choose the one she or he considers most suitable, or even to derive new methods.
Although controversial, ▶ Chap. 94, “Randomization and Permutation Tests,” advo-
cate randomization tests for many situations, in particular, for when the assumptions
of the usual approaches are unlikely to be met.
Clinical trials should be well designed to test a carefully posed primary hypothesis
and/or estimate a well-defined primary parameter. A statistical analysis plan (SAP)
should describe the methodology that will be used to answer the questions posed, both
primary and secondary. The plan should account for all randomized patients, even if
some are missing outcome data. The SAP should be completed before the data are
unblinded. There are many choices for statistical analysis for a given situation and they
are mentioned here and further described in the following chapters.
Key Facts
From the simple two-armed randomized trial with a single primary endpoint, clinical
trials have become more complex, with investigators working in broad research
areas, such as longitudinal data, multiple endpoints, multiple treatment arms, and
data with special characteristics, such as safety data and quality of life data. This
chapter covers many aspects of basic clinical trial analysis as well as many recent
developments.
Cross-References
References
Badr W (2019) 6 different ways to compensate for missing values in a dataset (Data Imputation with
examples). https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-
values-data-imputation-with-examples-6022d9ca0779
Chakrabortty A, Cai T (2018) Efficient and adaptive linear regression in semi-supervised settings.
Ann Stat 46:1541–1572. https://doi.org/10.1214/17-AOS1594
Choi IJ, Kim CG, Lee JY, Young-Il Kim Y-I, Myeong-Cherl Kook MC, Park B, Joo J (2020) Family
history of gastric cancer and Helicobacter pylori treatment. N Engl J Med 382:427–436. https://
doi.org/10.1056/NEJMoa1909666
Friedman LM, Furberg CD, DeMets D, Reboussin DM, Granger CB (2015) Fundamentals of
clinical trials, 5th edn. Springer. Chapter 18. ISBN 978-3-319-18539-2
ICH E9 (R1), Addendum on estimands and sensitivity analysis in clinical trials to the guideline on
statistical principles for clinical trials. http://ema.europa.eu/documents/scientific-guideline/ich-
e9-r1-addendum-estimands-sensitivity-analysis-clinical-trials-guideline-statistical_en.pdf
ICH Harmonized Tripartite Guideline Statistical Principles for Clinical Trials E9 (1998). https://
database.ich.org/sites/default/files/E9_Guideline.pdf
Jeffries NO, Troendle JF, Geller NL (2018) Detecting treatment differences in group sequential
longitudinal studies with covariate adjustment. Biometrics 74:1072–1081. https://doi.org/10.
1111/biom.12837
Lachin JM (2000) Statistical considerations in the intent-to-treat principle. Control Clin Trials 21:
167–189
Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken
Ruberg SJ, Akacha M (2017) Considerations for evaluating treatment effects from RCTs. Clin
Pharmacol Ther. https://doi.org/10.1002/cpt.869
Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR
(2009) Multiple imputation for missing data in epidemiological and clinical research: potential
and pitfalls. Br Med J 338:b2393. https://doi.org/10.1136/bmj.b2393
Temple R, Pledger G (1980) The FDA’s critique of the Anturane Reinfarction trial. N Engl J Med
303:1488–1492
The Anturane Reinfarction Trial Research Group (1978) Sulfinpyrazone in the prevention of cardiac
death after myocardial infarction – the Anturane Reinfarction trial. N Engl J Med 298:289–295
The Anturane Reinfarction Trial Research Group (1980) Sulfinpyrazone in the prevention of sudden
death after myocardial infarction. N Engl J Med 302:250–256.
Intention to Treat and Alternative
Approaches 82
Judith D. Goldberg
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1598
Randomized Controlled Clinical Trials (RCTs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1600
Examples of RCTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1600
Example: Salk Vaccine Trial – Vaccine Efficacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1600
Example: The HIP Breast Cancer Screening Study – Screening for Early
Detection of Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1601
Example: The Polycythemia Vera Study Group PVSG-01 – A Randomized
Multicenter Trial (Open Label) for Chronic Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1602
Example: Randomized Phase III Trial in Chronic Disease – MPD-RC 112 Phase III
Trial of Frontline Pegylated Interferon Alpha-2a (PEG) Versus Hydroxyurea (HU)
in High-Risk Polycythemia Vera (PV) and Essential Thrombocythemia (ET):
NCT01258856 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1604
ITT Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1604
Alternatives to ITT Population for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1606
Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1607
Alternative Approaches to Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1608
Noninferiority and Equivalence Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1609
Cluster Randomized Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1610
Example: Cluster Randomized Trials and IIT – Online Wound Electronic Medical
Record to Reduce Lower Extremity Amputations in Diabetics – A Cluster
Randomized Trial [AHRQ: R01 HS019218-01] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1610
Some Additional Design Considerations for ITT Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1610
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1611
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1612
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1612
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1612
J. D. Goldberg (*)
Department of Population Health and Environmental Medicine, New York University School of
Medicine, New York, NY, USA
e-mail: [email protected]
Abstract
“Intention to treat” or “intent to treat” (ITT) is the principal approach for the
evaluation of the treatment or intervention effect in a randomized clinical trial
(RCT). In an RCT, patients or subjects are randomized to one or more study
interventions according to a formal protocol that describes the entry criteria,
study treatments, follow-up plans, and statistical analysis approaches. In an
ideal trial, all randomized patients or subjects have the correct diagnosis, are
randomized correctly, comply with the treatment, and are evaluated according
to the study plan. These patients would have complete data and follow-up. In
this case, the ITT analysis that respects the randomization principle provides
unbiased tests of the null hypothesis that there is no treatment or intervention
effect. The goal in many cases is to establish the efficacy of a treatment or
intervention: does the planned treatment work? In practice, however, because
of the many ways in which the ideal is not the reality, an ITT analysis provides
a comparative evaluation of the effectiveness of the randomized intervention
strategy (does the strategy work), rather than of the efficacy of the planned
intervention itself. Examples of blinded, unblinded, screening, and drug clin-
ical trials are provided. Approaches to handling deviations from ideal are
described.
Keywords
Intent to treat (intention to treat) · Randomized controlled trial (RCT) ·
Compliance · Protocol · Efficacy · Effectiveness · Causality · Missing data
Introduction
The randomized clinical trial (RCT) is the gold standard that is used to establish the
efficacy or effectiveness of a new treatment or intervention. In an RCT, participants
(subjects or patients) are assigned to one or more treatment or intervention groups
using a random assignment mechanism, that is, patients or subjects are allocated to
the one or more treatment or intervention groups using a prespecified random
allocation scheme where allocation, ideally, is double blind; both participants and
treating (and evaluating) staff are blinded (masked) to treatment assignment. Further,
under these ideal circumstances, only subjects who have met all of the eligibility
criteria for entry into the trial would be randomized as close to the initiation of the
treatment or intervention as possible. This set of all randomized patients or subjects
comprises what is generally defined as the “intent-to-treat” or “intention-to-treat”
(ITT) population. The ITT population is included in the analysis in the group to
which they were assigned (see, e.g., Ellenberg 1996; Piantadosi 1997; DeMets 2004;
Goldberg and Belitskaya-Levy 2008a; Friedman et al. 1998). Adherence to intention
to treat in the analysis requires that all randomized patients or subjects be included in
the analyses regardless of whether or not they received the assigned treatment,
82 Intention to Treat and Alternative Approaches 1599
complied with the trial requirements, completed the trial, or even met the entry
criteria for the trial. This approach is preferred for the analysis of RCTs since it
respects the principle of randomization and provides unbiased tests of the null
hypothesis that there is no treatment or intervention effect, although the estimate
of the treatment effect may still be biased (Harrington 2000). There are, however,
multiple ways in which the actual deviates from the ideal. RCTs come in many
flavors, have different objectives, and are conducted with varying levels of quality.
Different trial objectives, issues in trial implementation and conduct that include
missing data, patient/subject noncompliance, and differing degrees of follow-up lead
to deviations from the ideal that have to be recognized and handled in the analysis of
such trials.
Under the ITT principle, in a randomized trial, patients remain in the trial under
the following circumstances which have implications for analysis and interpretation
(Goldberg and Belitskaya-Levy 2008a, b):
• If the patient is found to not have the disease under study. This can occur when
final verification of disease status is based on special tests that are completed after
randomization or on a central review of patient eligibility.
• If the patient never receives a single dose of the study drug.
• If the patient does not comply with the assigned treatment regimen or does not
compete the course of treatment.
• If the patient withdraws from the study for any reason.
This chapter reviews concepts for randomized trials, the issues regarding imple-
mentation of the operational definition of ITT in specific trials, and the implications
of the operational definition on the statistical analysis as the trial proceeds. These
issues range from evaluation of the impact of the deviation from the ITT model to the
consideration of other potential paradigms based on differing definitions of the
population included in the analysis. These alternatives range from various modifi-
cations of ITT (mITT), such as all treated patients, to all treated patients with the
correct diagnosis and to all treated patients who complied with assigned treatment,
among others. Further, errors in treatment allocation and diagnosis at the time of
randomization as well as missing outcomes and errors or misclassification of out-
comes and errors of measurement or misclassification of covariates including strat-
ification factors, multicenter deviations and heterogeneity, and missing data of all
kinds need to be considered.
Traditional approaches to handle these issues as well as recent alternative
approaches to analysis are described. Note that the deviations from the planned
randomization and the ITT paradigm move the randomized clinical trial to an
observational trial setting that leads to additional considerations in analysis.
Examples that illustrate the evolution of the concept of “intention to treat” and its
implications are provided to frame these issues. In addition, the interpretation of ITT
in non-inferiority and equivalence RCTs are discussed as are analogues to ITT
analyses in non-randomized trials and observational studies.
1600 J. D. Goldberg
The discussion in this chapter also includes considerations for the statistical
analysis under the ITT paradigm for different types of clinical trials with different
types of objectives.
In the context of medical research and the search to improve treatments or other
interventions to improve outcomes for patients or participants, the RCT provides the
controlled experimental setting to evaluate the efficacy or effectiveness of the “new”
treatment or intervention compared to control (either placebo or another active
treatment) in an unbiased, ideally blinded manner. An RCT is conducted under a
clinical protocol that explicitly defines the trial objectives; the primary outcome(s);
how, when, and on whom the outcome(s) will be measured; and the measures of the
effects of the intervention (National Research Council 2010) with a focus on
prevention of missing data of all types. The benefits of this controlled experimental
approach are that any observed differences between the two (or more) groups with
respect to the outcome are attributable to the intervention. Both confounding and
selection bias are removed since neither the subject nor the investigator chooses the
treatment assignment (see, e.g., Harrington 2000). In what follows, several examples
of RCTS are provided. These trials illustrate many of the issues that arise in the
analysis and interpretation in the ITT framework and its alternatives.
Examples of RCTs
The Salk polio vaccine trial, a classic example of a prevention trial, established the
efficacy of the new killed virus vaccine to provide protection against paralysis or
death from poliomyelitis (Brownlee 1955; Francis et al. 1955; Meier 1957, 1989).
While there were safety issues associated with the use of a killed virus vaccine
(Meier 1957, 1989), the National Foundation for Infantile Paralysis (NFIP) advisory
committee agreed that the Salk vaccine was safe and could produce desired antibody
levels in children who had been tested. Thus, “it remained to prove that the vaccine
actually would prevent polio in exposed individuals. It would be unjustified to
release such a vaccine for general use without convincing proof of its effectiveness,
so it was determined that a large-scale ‘field-trial’ should be undertaken” (Meier
1989).
Various approaches to the design of such a trial were considered that included the
vital statistics approach, an observed control approach, and lastly, an RCT with
randomization to a placebo control group. While the ideal design was the RCT, the
general reluctance to randomize children to placebo injections led to the choice of an
observed control study in which children in grade 2 would receive the vaccine and
children in grades 1 and 3 would be observed for the occurrence of polio. The final
82 Intention to Treat and Alternative Approaches 1601
study, however, also included a double blind RCT in which 750,000 children were
randomized to injections with placebo or with the vaccine. The trial was conducted
in a relatively short timeframe with endpoints observed within the time period.
While the results of the observed control portion of the trial favored the vaccine,
the results of the RCT portion were unequivocal and provided compelling evidence
of the effectiveness of the vaccine. The primary results of the RCT were based on the
ITT analysis of all randomized children who were included in their assigned
treatment group regardless of whether or not they received the injections as planned;
that is, the primary comparison included those subjects who were randomized to be
vaccinated (including those who were not vaccinated) with the polio vaccine or the
placebo in each of the randomized groups. In the observed control study, it was
known who received and who did not receive the vaccination among the intervention
subjects, but not among the control subjects. The control group then inherently could
consist of subjects who would have received the vaccine and those who would not
have received it, so any fair comparison must consider all subjects in each group.
A classic example of a randomized trial of screening for the early detection of breast
cancer is the HIP Breast Cancer Screening Study. This RCT was designed to evaluate
the effectiveness of mammography, at the time an untested tool for early detection, in
combination with a clinical examination, to be compared with “usual care.” The
primary question was whether a screening program that incorporated mammography
could reduce mortality from breast cancer. The study was conducted in the Health
Insurance Plan of Greater New York, one of the first health maintenance organiza-
tions (HMO) in the USA. Sixty-two thousand women were randomly chosen across
all of clinical sites in New York City. Of these women, approximately 30,000 were
randomized to be invited for an initial screening examination and 3 subsequent
annual examinations. The remaining 30,000 women were followed for diagnosis of
and mortality from breast cancer. This trial, initiated in 1963, predates the require-
ments for Institutional Review Board approvals and informed consent requirements
that have since become the norm. The trial design and results are described by
Shapiro et al. (1974, 1988).
Of the 30,000 women who were randomized to be invited for screening, 20,000
accepted the initial invitation; 59% of these women completed all 4 examinations. In
the 5 years of the screening study, 299 cases of breast cancer were diagnosed among
the women randomized to the screening group: 225 of these cases were detected
among women who had a screening examination; 74 cases were detected among
those who refused the invitations. There were 285 cases detected in the control group
(Shapiro et al. 1974). Table 1 shows the cumulative numbers of deaths in the first
5 years: those women who refused screening have a higher observed death rate from
breast cancer than those women who were screened; death rates for all causes reflect
the same phenomenon. The primary results of the trial rest on the comparison of the
1602 J. D. Goldberg
Table 1 HIP Breast Cancer Screening Study: cumulative deaths in the first 5 years from entry
Number of # BC BC deaths/ # All other All deaths/
Group women deaths 1000 deaths 1000
Total randomized to 31,000 39 1.3 837 27
screening
Screened 20,200 23 1.1 428 21
Refused 10,800 16 1.5 409 38
Total randomized to 31,000 63 2.0 879 28
control
5-year death rates in the total screened group (1.3/1000) compared with total control
group rate (2.0/1000).
The control group consists of a group of women who would have accepted the
invitation to be screened and those who would have refused. Within the control
group, who falls into each of these groups is unknown, and, therefore, the only fair
comparison is of the rates in the total randomized groups, the ITT comparison. Fink
et al. (1968) studied the “reluctant” participants who refused screening in the
randomized to screening group and identified differences between those who did
and not participate that were related to socioeconomic status and education as well as
presence of other health issues that took priority for them. While this trial was
randomized, it was not blinded in the sense that it is known which women were in
each group; the control group received usual care. Outcomes could more readily be
evaluated because of the nature of the HMO that did have medical records for all
participants. There was central review of biopsy and surgery records. Follow-up was
carried out for all women in the trial from these records and from death records.
Screening trials have additional considerations that will not be discussed here.
Note, however, that self-selection of participants, handled primarily through the ITT
primary analysis, can still be an issue in interpretation. Other issues that impact the
analysis and interpretation of the results of these trials include the distinctions between
diagnosis at an initial screen, reflecting disease prevalence, and diagnosis at subse-
quent screens, identifying incident cases; the definition of a positive screen in a multi-
modality setting; the lack of true follow-up of negatives on screening (false negatives);
lead time bias; length-biased sampling; and misclassification of disease status.
In this randomized open label (unblinded) phase III trial, the primary objective was
to compare complete hematologic response rates determined by blinded central
review in patients randomized to treatment by PEG (new treatment) and HU
(standard of care, readily available by prescription) by the end of 12 months of
therapy with planned analyses in each of the two disease strata (PV and ET). Patients
were to be within 1 year of initial diagnosis and to be treatment naïve with less than
3 months of HU therapy. The trial was originally designed to randomize 612 patients
with 2 planned interim analyses, the first when 25% of accrual with adequate time on
study to be evaluable for response. Randomization began in September 2011 and
continued through June 2016 at 24 centers in 6 countries. The study was amended
multiple times because of slow enrollment to a final sample size of 170 patients
across the 2 disease strata with 1 interim analysis planned to be conducted when 75
patients were evaluable for response. Entry criteria were relaxed so that allowable
prior duration of disease was lengthened from less than 1 year in the original
protocol to less than 5 years in a situation where the diagnosis of the disease could
be made only at the time of an identified complication. Thus, this amendment would
enroll more patients with indolent disease who did not have an early complication
that would have rendered them ineligible for the study. This trial illustrates many
additional difficulties in the conduct of an RCT when one of treatments is the current
standard of care available outside of the trial. In fact, the sponsor of the experimental
arm stopped drug supply for administrative reasons. While treatment assignments
were implemented using a blinded randomization scheme, the actual assigned
treatment was known to both the investigators and patients because of the different
methods of delivery. In this trial, 7% of the 86 HU patients never received any
treatment, while all of the 82 PEG patients received treatment. When the study was
closed with the reduced sample size, the final ITT CR rates were 37.2% in the HU
group and 35.4% in the PEG group (Mascarenhas et al. 2018).
ITT Principle
The alternatives frequently used in place of the ITT principle to define the study
population for comparison in an RCT include modifications to the ITT (mITT), per
protocol (PP), as treated (AT), and variations among these and other options to define
the groups that are being compared in a trial. Each of these alternatives requires
careful definition within a trial protocol to ensure that there is clarity with respect to
the details.
mITT is often only vaguely defined and requires specificity to even evaluate how
it would operationally impact the analysis of an RCT. For example, patients ran-
domized as eligible, but subsequently found to be ineligible, could be excluded. Or,
the modification could be just to include all patients randomized to a trial who
actually received at least one dose of study treatment as randomized. The closest to
the ITT paradigm would be to include these patients as randomized. In randomized
trials of infectious diseases, a modified ITT approach is often used. In this case,
outcomes in the two treatment groups are compared for patients who actually have
diseases caused by organisms sensitive to the treatments. But, in practice, treatment
is given presumptively since the results of sensitivity testing are often not available
at the initiation of treatment. In this case, the ITT approach provides an approxima-
tion to the treatment strategy that would be implemented in practice.
Per protocol (PP) populations are defined to include patients who met the
protocol criteria for entry, complied with the treatment regimen based on the trial
definition of compliance and completed follow-up for the outcome. PP approach
includes analysis of patients evaluable for response in oncology trials, for example,
in which only those patients who received sufficient treatment to be evaluated at the
primary response outcome assessment are included in the analysis, eliminating those
patients who might have deteriorated on treatment and went on to other treatments or
died prior to the evaluation time. Clearly, this can provide misleading results
depending on the distributions of these patients in the treatment groups being
compared.
As treated (AT) populations would include patients with assignment to the
treatment actually received rather than assignment to treatment as randomized. An
as-treated (AT) analysis assigns subjects or patients to the treatment-taken group
regardless of the randomized assignment. As Ellenberg (1996) points out, these AT
analyses assign subjects to groups based on their compliance in the randomized trial.
And the definition of compliance to the assigned treatment can be subjective. In an
AT analysis, for example, a subject assigned to the new treatment may actually take
the standard treatment. In a blinded RCT, subjects can take only the standard if
available outside the trial, a common issue in trials that use available treatments as
the standard. In the absence of blinding, when one or both treatments are available
outside the trial setting (such as MPD-RC 112 with both hydroxyurea and PEG
interferon available outside the trial), the problem is exacerbated. In fact, patients
and/or their physicians will comply with the assigned treatment only if it is not
available to them in other ways. In these settings, compliance rates on the random-
ized regimen often differ for the two (or more) treatment arms. The ITT analysis in
82 Intention to Treat and Alternative Approaches 1607
this setting provides an unrealistic assessment of the treatment effects, but any other
analyses are biased by the selection process associated with compliance.
Note that safety analyses, however, are conducted appropriately on an AT basis
with subjects assigned to the regimen actually taken, as distinct from the ITT
approach for effectiveness.
While there are multiple variants of these broad approaches to identifying the
populations for analysis in an RCT, each of which has limitations, if these multiple
analyses do not differ in any substantive way, the risk of incorrectly attributing
efficacy (or lack of efficacy) to a new treatment are reduced.
In the setting of a noninferiority or equivalence trial, the use of the ITT population
for analysis reduces any differences between the two groups favoring the conclusion
of noninferiority or equivalence (see, e.g., Kim and Goldberg 2001; Sanchez and
Chen 2006).
Missing Data
The consensus among clinical trialists is that the gold standard remains the ITT
analysis for the randomized trial. The primary source of deviation from the ITT
paradigm arises from missing data of some type. That said, the NRC report (2010)
and many other authors focus on the need to develop a careful protocol that considers
primary outcomes and includes plans to minimize missing data of all kinds. Missing
data can occur at every stage of a trial with differing implications for analysis.
That missing data are unrelated to treatment or outcome and are missing
completely at random (MCAR) is generally not the case in RCTs. Rather, mis-
singness can be related to treatment but not outcome, missing at random (MAR), a
possible scenario. Such an assumption can lead to overly optimistic estimates of
treatment effects (see ▶ Chap. 84, “Estimands and Sensitivity Analyses”) but can be
useful. Lastly, missingness can be related to both treatment and outcome, that is, not
missing at random (NMAR), a scenario as noted in ▶ Chap. 84, “Estimands and
Sensitivity Analyses,” that is a more likely occurrence than one would like. The
▶ chapter 86, “Missing Data” provides details of approaches to incorporate missing
data into the analysis of an RCT.
Patients can be randomized to the incorrect treatment, in incorrect strata, or with
incorrect diagnoses at the outset. In some trials, randomization and treatment occur
based on a presumptive diagnosis while additional testing and review of the entry
criteria continue. At the study design stage, the goal is to randomize as close to the
initiation of treatment as possible with as much confirmed information as possible.
The handling of these types of errors impacts the analysis. In the ITT paradigm,
subjects or patients remain in the trial as randomized. Similarly, if the subject or
patient never receives the randomized treatment, the subject remains in the analysis
as randomized. If the patient does not comply with the assigned treatment, the patient
remains as randomized. And, if the patient withdraws from the study for any reason
at any point, the patient remains in the trial as randomized.
1608 J. D. Goldberg
The emphasis in the NRC report (2010) is on minimizing missing data of any
kind. Some baseline data can be missing in any trial. Trials with a single treatment or
intervention encounter and an immediate assessment of the outcome have the
smallest potential for missing data. As the duration of the intervention increases,
the potential for missing data increases and includes subject withdrawal for many
potential reasons that may be related to the intervention (e.g., side effects). As the
length of the follow-up period after the completion of treatment increases, the
potential for loss to follow-up as well as for the use of alternative treatments
increases. Short-term treatment and follow-up minimize missing data; long-term
treatment with end of treatment follow-up or long-term post treatment follow-up.
Post-randomization missing data can become a major problem of analysis and
interpretation (NRC 2010; Little and Kang 2015) after long-term post treatment
follow-up provide more opportunities for increased missing data.
In short-term trials with, for example, a one-time intervention (e.g., vaccine,
screening test) and a short-term single outcome assessment, missing outcome data
should not pose a major problem. Most clinical trials, however, involve multiple
dosing or treatments over time with follow-up at planned intervals during the active
intervention phase and then long-term follow-up after the intervention is complete.
Because the likelihood that complete data are obtained as planned is reduced, there is
recent interest in extending the concept of the ITT approach through a focus on how
to evaluate the multiple objectives, including the primary objective, of the trial and
choice of the appropriate outcomes (measurements) to be used for these evaluations.
▶ Chapter 84, “Estimands and Sensitivity Analyses,” provides an overview and
summarizes the strategies for the choice of “estimands” for different scenarios
beyond the ITT analysis. In this context, the ITT estimand estimates the effect of
the randomization to treatment on outcome. Among strategies to address intercurrent
events (post randomization) including compliance/noncompliance (Little et al.
2009) are treatment policy estimands similar to the ITT approach, composite end-
points that include intercurrent events on treatment, hypothetical estimands, princi-
pal stratification causal estimates (Fragakis and Rubin 2002; Little and Rubin 2000),
and on-treatment estimates. Compliance can be incorporated into analyses in various
ways including as a covariate. However, the definition of compliance has to be clear
and consistent and defined in the protocol. These concepts are elucidated from a
regulatory standpoint in ICH E9 R1 (2017) and elsewhere.
The primary ITT analysis in an RCT provides as we note above an estimate of the
treatment strategy defined by a protocol that includes all randomized patients as
randomized. The reality in a trial is often quite different. In addition, different kinds
of trials have different requirements. The AT and PP populations can be analyzed, but
each has different interpretations with respect to trial results. Composite endpoints can
provide a single summary in an ongoing trial that includes events on treatment. For
example, such an endpoint in a survival-type trial could be based on progression-free
82 Intention to Treat and Alternative Approaches 1609
survival, the time to disease progression or death, whichever occurs first. The analysis
of long-term outcomes can also be confounded with use of rescue medications, side
effects, and response or lack of efficacy itself if there any of these events contribute to
missing data, particularly with differential rates in the groups being compared. Of
course, it is still possible to have incomplete information with respect to the earlier
endpoint so that there is bias introduced when the first event is death that may or may
not have been preceded by disease progression. Analyses of multiple endpoints and
competing risk analyses can shed some light on these potential biases.
In longitudinal trials with repeated measurements of outcomes such as blood
pressure over time, mixed effects regression models and general estimating equation
models can be used. However, again, the assumptions of such methods (e.g., missing
at random) can easily be violated by differences in the distributions and types of
intercurrent events between the treatment groups (Little and Kang (2015). Hogan
et al. (2004) summarize approaches to handling dropouts in the longitudinal setting.
Various approaches to analysis of the different populations (AT, PP, compliers)
have been traditionally employed with known limitations. For example, single
imputation in an analysis based on PP patients can be viewed as a variant of what
is known as a “completers” or “complete case” analysis. The other extreme of this
approach is to use the first observation carried forward, best observation carried
forward, or “last observation carried forward” (LOCF) to replace missing outcome
data. In an LOCF analysis, the last available observation is used for each subject in
the analysis; this observation could even be the baseline pre-treatment observation.
These kinds of analyses are flawed and yield results that are biased in different ways.
While these methods have mostly been replaced by mixed effects regression models
and various other approaches, the differences in the results from all of these methods
can provide useful sensitivity analyses (see Thabane et al. 2013).
Other approaches that have been proposed for the analysis of RCTs that address
many of these issues are beyond the scope of this chapter. Bayesian methods can be
used to incorporate additional treatment information such as rescue medication for
treatment failure using data augmentation algorithms (Shaffer and Chinchilli 2004).
Selection models allow formal incorporation of potential outcomes and pattern
mixture models to model associations between observed exposures and outcomes
(Goetghebuer and Loeys 2002). Causal effect models can be used for realistic
treatment assignment rules when the expected treatment assignment (ETA) is vio-
lated (Van der Laan and Petersen 2007).
In the case of noninferiority or equivalence trials, an ITT analysis can bias the results
in favor of noninferiority or equivalence. This occurs because the effective sample
size is reduced by the inclusion of ineligible and noncompliant patients and the
difference between the groups is decreased favoring “no difference.” Hybrid ITT/PP
analyses that exclude noncompliant patients and incorporate the impact of missing
data in this setting have been proposed by Sanchez and Chen (2006).
1610 J. D. Goldberg
The Online Wound Electronic Medical Record (OWEMR) was an informatics tool
that synthesizes diabetic foot ulcer (DFU) data to inform treatment decisions. The
primary objective of this two-arm cluster RCT was to assess the impact by 6 months
of the OWEMR and standard of care (OWEMR+SOC) compared to SOC alone on
lower limb amputation or death. In a cluster randomized trial, intervention or
treatment assignment is randomized to a group of individuals defined, for example,
by a classroom, a school, or a center in a multicenter trial. In such a trial, the cluster
or group is randomized, and all members of the group or unit receive the same
treatment assignment. Thus, the unit of randomization is the cluster. Sample size is
based on the number of clusters and, only in part, on the number of observations
within a cluster. Analyses are based on the cluster summary and can also be based on
the individual observations nested within cluster. In a cluster randomized trial, the
ITT analysis can be thought of as the analysis of all randomized clusters. However,
there can be instances in which within a cluster, individuals do not uniformly adhere
to the assigned treatment. Within a cluster, individual outcomes are often correlated.
This RCT was originally designed to include 3504 patients in 12 centers (clusters;
292 patients/cluster) each of which would be randomized to the tool+SOC or SOC
alone. Enrollment began in August 2011 and was expected to complete in September
2013. Six of the original 12 study sites were closed for poor accrual. Additional sites
were identified; 16 study sites (ranging in size from 0 to 295 patients) were included
in this study. Of the 1608 subjects who signed informed consent in these sites, 47
were screen failures and 1561 were enrolled in the trial. OWEMR+SOC centers
enrolled 1 to 295 subjects (total, 977; outcome rate, 14.7%), and SOC alone centers
enrolled 0 to 169 subjects (total, 584; primary outcome rate, 11.8%). When early
terminations were included as failures, the composite failure rates were, respectively,
36.9% and 42.1%. For the primary endpoint, the results favored SOC, but the
dropout and early termination rates were, in fact, higher in the SOC arm. This
illustrates the difficulty in attracting centers (and patients) to remain on a trial
when they are not randomized to the new intervention.
Careful consideration to all of the design details in the RCT protocol can facilitate
analysis and interpretation of the results. Some possible design modifications could
retain the features of the planned ITT analyses.
Often in RCTs, substudies using new and potentially expensive technologies are
incorporated into the trial. Frequently, these substudies are carried out in conve-
nience samples or “wherever data are available.” While a relatively large sample size
is often required for the primary and secondary endpoints, the inability to measure all
82 Intention to Treat and Alternative Approaches 1611
variables on all subjects/patients introduces missing data that may or may not be
missing at random. Nested random subsamples of the overall study population can
be used to measure these more expensive classes of variable with smaller subsamples
as the cost of collection increases. These SMAR-type designs (Belitskaya-Levy et al.
2008; Goldberg 2006) enable integrated analyses in a setting of planned monotone
missingness with data missing at random.
In RCTs with response-dependent changes in treatment, patients can be ran-
domized to complete regimens at the initial randomization with a treatment
strategy randomization that incorporates treatment arms that randomize patients
to continue treatment or to receive additional treatment if required. For example, in
combination therapy trials for hypertension, patients could be randomized to the
new treatment, standard treatment, or placebo at the initial stage. Subsequent
randomizations would allow various combinations within each of the original
treatment groups. Such a design would allow an unbiased comparison within
each original randomization group of patients with and without additional therapy
after the first stage.
Sequential Multiple Assignment Randomized Trials (SMARTs) allow the com-
parison of dynamic treatment strategies (DTS) or sets of decision rules for patient
management (see Almirall et al. 2014). In the SMART framework, patients are
randomized to different treatment branches that separate DTSs. In the classic ITT
framework, randomization occurs at the start of the trial with subsequent treatment
changes after the initial randomization that are not governed by randomization. The
DTS ITT converts to SMART by using randomization when treatment decisions
would change. Patient information would contribute to one or more DTSs until the
patient leaves the DTS. These designs are gaining traction particularly in areas such
as behavioral intervention trials for smoking cessation.
Intention to treat (ITT) is the preferred approach for the statistical analysis of
randomized clinical trials. A careful, unambiguous protocol should be developed
prior to the trial initiation, and a plan for statistical analysis should be in place prior
to the conclusion of the trial and its unblinding. Investigators must provide a careful
accounting of all patients randomized, all patients treated, and all post-randomiza-
tion exclusions (if any) for lack of efficacy, lack of compliance, or lack of safety (see,
e.g., DeMets 2004; Begg 2000; Lachin 2000). The ITT principle provides a para-
digm for the conduct of RCTs that focuses on reducing any biases in patient/subject
assignment or evaluation of outcomes. That said, the realities of clinical trial conduct
often make it necessary to carefully consider deviations from the ideal model in the
analysis. The NRC report (2010) and the IHC E9 guidance (2017) review the
alternatives for analysis based on the goals of the specific trial and advantages and
limitations of these alternatives. Regardless of the primary method of analysis,
sensitivity analyses should be conducted to evaluate the effects of missing and/or
erroneous data at all stages of the trial process from randomization errors to missing
outcome data as well as the impact of lack of compliance.
1612 J. D. Goldberg
Key Facts
Cross-References
References
Almirall D, Nahum-Shani I, Sherwood NE, Murphy SA (2014) Introduction to SMART designs for
the development of adaptive interventions: with application to weight loss research. Transl
Behav Med 4:26–274. PMCID: PMC4167891
Begg CB (2000) Commentary: ruminations on the intent-to-treat principle. Control Clin Trials
21:241–243
Belitskaya-Levy I, Shao Y, Goldberg JD (2008) Systematic missing-at-random (SMAR) design and
analysis for translational research studies. Int J Biostat 4(1):Article 15. https://doi.org/10.2202/
1557-4679.1046. PubMed PMID: 20231908; PubMed Central PMCID: PMC2835456
82 Intention to Treat and Alternative Approaches 1613
Bell ML, Fiero M, Horton NJ, Hsu C-H (2014) Handling missing data in RCTs; a review of the top
medical journals. BMC Med Res Methodol 14:118
Berk PD, Goldberg JD, Silverstein MN et al (1981) Increased incidence of acute leukemia in
polycythemia vera associated with chlorambucil therapy. NEJM 304:441–447
Berk PD, Wasserman LR, Fruchtman SM, Goldberg JD (1995) Treatment of polycythemia vera; a
summary of clinical trials conducted by the polycythemia vera study group. In: Wasserman LR,
Berk PD (eds) Polycythemia vera and the myeloproliferative disorders. Chapter 15. N.
Saunders, Berlin, pp 166–194
Brownlee KA (1955) Statistics of the 1954 polio vaccine trials. J Am Stat Assoc 50(272):
1005–1013
DeMets DL (2004) Statistical issues in interpreting clinical trials. J Intern Med 255:529–537
Ellenberg J (1996) Intent-to-treat analysis vs as-treated analysis. Drug Inf J 30:535–544
Ellenberg J (2005) Intention to treat analysis: basic. Encyclopedia of biostatistics. John Wiley and
Sons, Ltd
Fink R, Shapiro S, Lewison J (1968) The reluctant participant in a breast cancer screening program.
Public Health Rep 83(6):479–490
Fragakis CE, Rubin DB (2002) Principal stratification in causal inference. Biometrics 58:21–29
Francis T Jr et al (1955) An evaluation of the 1954 poliomyelitis vaccine trials – summary report.
Am J Public Health 45(5):1–63
Friedman LM, Furberg CD, DeMets DL (1998) Fundamentals of clinical trials, 3rd edn. Springer,
New Year
Goetghebuer E, Loeys T (2002) Beyond intention to treat. Epidemiol Rev 24:85–90
Goldberg JD (1975) The effects of misclassification on the bias in the difference between two
proportions and the relative odds in the fourfold table. J Am Stat Assoc 70:561–567
Goldberg JD (2006) The changing role of statistics in medical research: experiences from the past
and directions for the future. Invited paper, Proc Amer Stat Assoc. 1963–1969
Goldberg JD, Belitskaya-Levy I (2008a) In: Melnick E, Everitt BS (eds) Intent-to-treat principle.
Encyclopedia of quantitative risk assessment. Wiley, Chichester
Goldberg JD, Belitskaya-Levy I (2008b) In: Melnick E, Everitt BS (eds) Randomized controlled
trials. Encyclopedia of quantitative risk assessment. Wiley, Chichester
Goldberg JD, Belitskaya-Levy I (2008c) In: Melnick E, Everitt BS (eds) Efficacy. Encyclopedia of
quantitative risk assessment. Wiley, Chichester
Goldberg JD, Koury KJ (1989) In: Berry DA (ed) Design and analysis of multicenter trials. Chapter 7
in statistical methodology in the pharmaceutical sciences. Marcel Dekker, New York, pp 201–237
Goldberg JD, Shao YS (2008) In: Melnick E, Everitt BS (eds) Comparative efficacy trials (phase III
studies). Encyclopedia of quantitative risk assessment. Wiley, Chichester
Harrington DB (2000) The randomized clinical trial. J Am Stat Assoc 95:312–315
Hogan JW, Roy J, Korkontzelou C (2004) Tutorial in biostatistics: handling drop-out in longitudinal
studies. Stat Med 23:1455–1497
ICH (1998) E9: guideline on statistical principles for clinical trials. www.ich.org
ICH (2017) E9 R1: Addendum on estimands and sensitivity analyses in clinical trials. Step 2. www.
ich.org
Kim MY, Goldberg JD (2001) The effects of outcome misclassification and measurement error on
the design and analysis of therapeutic equivalence trials. Stat Med 20(14):2065–2078. PubMed
PMID: 1143942
Lachin JM (2000) Statistical considerations in the intent-to-treat principle. Control Clin Trials
21:167–189
Little R, Kang S (2015) Intention-to-treat analysis with treatment discontinuation and missing data
in clinical trials. Stat Med 34:2381–2390
Little RJA, Rubin DB (2000) Casual effects in clinical and epidemiological studies via potential
outcomes: concepts and analytical approaches. Annu Rev Public Health 21:121–145
Little RJA, Long Q, Lin X (2009) A comparison of methods for estimating the causal effects of a
treatment in randomized clinical trials subject to noncompliance. Biometrics 65:640–649
Mascarenhas J et al (2018) Results of the myeloproliferative neoplasms – research consortium
(MPN-RC) 112 randomized trial of pegylated interferon alfa-2a (PEG) versus hydroxyurea
1614 J. D. Goldberg
(HU) therapy for the treatment of high risk polycythemia vera (PV) and high risk essential
thrombocythemia (ET). Blood 132:577. https://doi.org/10.1182/blood-2018-99-111946
Meier P (1957) Safety testing of poliomyelitis vaccine. Science 125:1067–1071. https://doi.org/
10.1126/science.125.3257
Meier P (1989) The biggest public health experiment ever: the 1954 field trial of the Salk
poliomyelitis vaccine. In: Tanur JM, Mosteller F, Kruskal WH, Lehmann EL, Link RF, Pieters
RS, Rising GR (eds) Statistics: a guide to the unknown, 3rd edn. Duxbury
National Research Council (2010) The prevention and treatment of missing data in clinical trials.
The National Academies Press, Washington, DC
Piantadosi S (1997) Clinical trials: a methodologic perspective. Wiley-Interscience, New York
Royes J, Sims J, Ogollah R, Lewis M (2015) A systematic review finds variable use of the intention-
to-treat principle in musculoskeletal randomized controlled trials with missing data. J Clin
Epidemiol 68:15–24
Sanchez MM, Chen X (2006) Choosing the analysis population in non-inferiority studies. Stat Med
25:1169–1181
Shaffer M, Chinchilli V (2004) Bayesian inferences for randomized clinical trials with treatment
failures. Stat Med 23:1215–1228
Shapiro S, Goldberg JD, Hutchison GB (1974) Lead time in breast cancer detection and implica-
tions for periodicity of screening. Am J Epidemiol 100(5):357–366. PubMed PMID: 4417355
Shapiro S, Venet W, Strax P, Venet L (1988) Periodic screening for breast cancer: the health
insurance plan project and its sequelae, 1963–1986. Johns Hopkins, Baltimore
Stuart EA, Perry DF, Le H-N, Ialongo NS (2008) Estimating intervention effects of prevention
programs: accounting for noncompliance. Prev Sci 9:288–298
Thabane L, Mbuagbaw L, Zhang S, Samaan Z, Marcucci M, Ye C, Thabane M, Giangregorio L,
Dennis B, Kosa D, Debana VB, Dillenburg R, Fruci V, Bawor M, Lee J, Wells G, Goldsmith CH
(2013) A tutorial on sensitivity analyses in clinical trials: the what, why, when and how. BMC
Med Res Methodol 13:92. http://www.biomedcentral.com/1471-2288/13/92. Accessed 28 Apr
2018
US FDA (2016) Non-inferiority clinical trials to establish effectiveness: guidance for industry.
https://www.fda.gov/downloads/Drugs/Guidances/UCM202140
Van der Laan MJ, Petersen ML (2007) Causal effect models for realistic individualized treatment
and intention to treat rules. Int J Biostat 3(1):Article 3. http://www.bepress.com/ijb/vol3/iss1/3
Estimation and Hypothesis Testing
83
Pamela A. Shaw and Michael A. Proschan
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1616
Estimation and Uncertainty for Continuous Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1617
Estimation and Uncertainty for Noncontinuous Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1618
Estimation of the Difference Between Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1619
Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1620
Special Topics in Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1623
Exact Tests and Other Considerations for Choosing a Hypothesis Test . . . . . . . . . . . . . . . . . . . 1623
Multiple Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1625
Noninferiority Versus Superiority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1626
Controversies in Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1627
Two-Sided Versus One-Sided Controversy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1627
The P-Value Controversy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1628
Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1629
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1629
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1630
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1630
Abstract
This chapter presents basic elements of parameter estimation and hypothesis
testing. The reader will learn how to form confidence intervals for the mean,
and more generally, how to calculate confidence intervals for the one parameter
setting and for the difference between two groups. Principles of hypothesis testing
are detailed, including the choice of the null and alternative hypotheses, the
significance level, and implications for choosing a one-sided versus two-sided
P. A. Shaw (*)
University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
e-mail: [email protected]
M. A. Proschan
National Institute of Allergy and Infectious Diseases, Bethesda, MD, USA
e-mail: [email protected]
test. The p-value is defined and a discussion of controversies that have arisen over
its use are included. After reading this chapter, the reader will have a better
understanding of the necessary steps to set up a hypothesis test and make valid
inference about the quantity of interest. Other topics in this chapter include exact
hypothesis tests, which may be preferable for small sample settings, and the
choice of a parametric versus nonparametric test. The chapter also includes a brief
discussion of the implications of multiple comparisons on hypothesis testing and
considerations of hypothesis testing in the setting of noninferiority trials.
Keywords
Confidence interval · Hypothesis testing · Noninferiority trial · Parameter ·
p-value · Power · Significance · Standard error · Test statistic · Type I error ·
Type II error
Introduction
Two other common parameters of interest for clinical trials are proportions from
binary outcomes and hazard ratios from time-to-event outcomes. For these param-
eters, the principles of the CI are the same. We can again rely on the Central Limit
Theorem and form the confidence interval by adding and subtracting the desired
quantile multiple of the appropriate SE. For proportions, the sample mean of the
binary outcome is the estimated proportion pb and the estimated SE is ½pbð1 pb)/n]1/2.
The 95% confidence interval for a population proportion is thus pb 1:960½pbð1 pb)/
n]1/2. The log hazard ratio (bβ) and its SE can be directly estimated from the Cox
model, and it is on the log-scale that the hazard ratio estimator is approximately
normal (Cox 1972; Hosmer et al. 2011). One can form the 95% confidence interval
by bβ 1.960 SE. This is also referred to as the Wald confidence interval for the log-
hazard ratio. For this and other statistics estimated with likelihood techniques, there
are other methods besides the Wald approach to estimate confidence intervals, such
as those that rely on the score or the likelihood ratio statistics (Casella and Berger
2002). While for large sample sizes, these three methods will produce similar results,
at smaller sample sizes there will be noticeable differences. The score statistic can be
a more efficient method than the more commonly used Wald technique; that is, with
83 Estimation and Hypothesis Testing 1619
the same data set, the score interval can be narrower, which is a desirable feature for a
confidence interval. See Lin et al. (2016) for a comparison of the different methods
of hazard ratio confidence interval estimation.
For discrete data that take on many values that have a natural order, such as count
data or an outcome taken from an ordinal scale, such as the Likert scale, it is common
to treat these outcomes like a continuous outcome to summarize a treatment effect.
That is, the mean value x and sample standard deviation s are calculated for each arm
and a 95% confidence interval is formed using methods for a continuous outcome,
seen in the previous section. For an ordinal scale, one must first assign numeric
values; most commonly the positive integers are used. For example, for the 5-level
Likert scale, one can assign 1 ¼ strongly disagree, 2 ¼ disagree, 3 ¼ neither agree
nor disagree, 4 ¼ agree, and 5 ¼ strongly agree. For data such as these, that start off
far from normally distributed, it may take larger sample sizes for the estimate of
interest, say x, to be well-approximated with a normal distribution. An alternative
approach, called nonparametric statistics, is discussed in a later section of this
chapter.
A formal comparison of the difference between two arms of a clinical trial is often
desired. For quantities like the sample mean or proportion, which are approximately
normally distributed, this will be relatively straightforward. The difference is
approximately normally distributed, as any linear combination of two (jointly)
normal statistics will also be normally distributed. Thus, we can apply our general
confidence interval technique to form the CI for the difference. We can then examine
this confidence interval and see whether it contains the value 0, which would indicate
the data are consistent with no difference in the parameter of interest between
groups.
For example, suppose in the weight loss trial the mean weight change from
baseline is x ¼ 6 for arm A and y ¼ 2 for arm B. Suppose further arm A has a
sample standard deviation sx ¼ 20 and m ¼ 100 people and arm B has standard
deviation sy ¼ 15 and n ¼ 90 people. The estimated mean weight change difference
between arms is x y ¼ 4. If we assume the true SD in each arm may be different,
then the SE for this difference is qcalculated by the square
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi root of the sum of the
variances for each mean, sxy ¼ 202 =100 þ 152 =90 ¼2.550. An approximate
95% CI could be formed again using quantiles from a t distribution, but in this case,
the degrees of freedom (df) for the difference of means must be estimated. The
common approach is to use Satterthwaite’s formula (Rosner 2015), which yields
0 2 2 s2 2 11
2 2 2
sx y
s
df ¼ mx þ ny @ m1 þ n1 A
s m n
. The 95% confidence interval is then
between the two arms. If one assumed that the two arms had a common true
variance, one could use a more efficient estimate of the common variance by pooling
data from the two arms and estimating a single SD. Since we generally do not know
whether arms have a common SD, the Satterthwaite method is preferable. A
common, yet faulty, approach is to use the same data to first conduct a hypothesis
test for equal variances between the two arms and then, based on results of that test,
decide which estimate of the SE to use. This procedure can yield slightly anti-
conservative confidence intervals.
In the case of paired differences for a continuous outcome, one can first form the
within person difference d ¼ x-y for each individual and then follow the usual
procedure for confidence intervals for a single continuous outcome.
For a clinical trial with binary outcomes, we can take a similar approach to
forming the confidence interval for the difference in proportions between two
independent groups. Denote the difference in sample proportions by pbx pby : Since
each pb is approximately normally distributed, so is the difference inffi proportions. The
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
bpx 1bpx bpy 1bpy
SE for pbx pby can be estimated as SEbp bp ¼ m þ n , and the 95% CI
x y
becomes pbx pby 1:960 SEbp bp . To determine whether the data are consistent with
x y
no between-arm difference, one can again consider whether the CI contains the value
zero.
When estimating the difference between intervention groups for other outcomes,
one simply needs to formulate a parameter in a statistical model which represents this
difference and estimate it. The 1-α confidence interval could be formed with the
upper and lower α/2 quantile of the probability distribution appropriate for that
statistical model. A common approach parameterizes this difference in a regression
model. Estimates for both the parameter and its SE are straightforward. For censored
survival data, the hazard ratio is inherently a parameter for the difference between
arms, in this case a ratio. For a ratio, the value representing no difference between
arms is 1. Consequently, a confidence interval for the HR containing 1, or equiva-
lently a confidence interval of the log-hazard containing 0, is consistent with no
difference between arms.
Hypothesis Testing
In the previous section, the confidence interval was used to answer questions about
the parameter such as whether data were consistent with no difference between two
intervention arms. We can also directly answer questions about the value of a
parameter with a process called hypothesis testing. Confidence intervals and hypoth-
esis testing are intimately linked. In fact, as will be explained below, in many
situations, there is a 1–1 correspondence between the conclusion made from a
confidence interval for the value of a parameter (such as whether a value of zero is
consistent with the data) and the results of a hypothesis test. The hypothesis testing
83 Estimation and Hypothesis Testing 1621
framework provides a way to formalize the language and process for drawing
conclusions about parameter values from the data.
The hypothesis test can be described as consisting of five steps:
The null hypothesis is a statement about a value for the parameter, for which data
will be collected to assess. For the parameter of interest μ, the null value is
represented μ0. For example, if μ ¼ μA-μB is the parameter for the difference in
the average weight change between arms A and B, one may set the null to be one of
no difference, i.e., H0: μ0 ¼ 0. For this one-dimensional parameter, the alternative
hypothesis (HA) can take on three possible forms: (i) μ <0 (individuals on Arm A
have smaller weight change), (ii) μ >0 (individuals on Arm A have bigger weight
change), and (iii) μ 6¼ 0 (the average weight change for arm A and B is different).
The first two are examples of “one-sided” (also called “one-tailed”) alternative
hypotheses and the third is a “two-sided” alternative hypothesis. This distinction will
matter in terms of evaluating the strength of evidence against the null hypothesis. In
a randomized clinical trial, even though showing a difference in one direction is
more of interest (such as that the novel intervention A is superior to intervention B),
it is most typical to have a two-sided hypothesis. The issue of one-sided versus two-
sided has been the source of continued debate and some controversy, as will be
explained later in this chapter.
The basic idea of conducting a hypothesis test is to calculate a test statistic that
estimates how far the sample estimate of the parameter of interest is from its target
value under the null hypothesis (μ0). The assumed probability distribution for the
data is used to calculate the likelihood of a test statistic value at least as extreme as
the observed value, assuming the null hypothesis is true. This probability is called
the p-value. R.A. Fisher is attributed to having given the p-value its dominance in the
scientific literature with his seminal book in 1925 (Fisher 1925; Kyriacou 2016). For
approximately normal data, we can measure the departure from the null hypothesis
in terms of numbers of standard errors. That is, we form the statistic t ¼ (b μ – μ0)/SE,
where b μ is the sample value for the parameter of interest and SE is the standard error
for bμ. When b μ is a mean or difference of means, the test statistic t has an approximate
t distribution, with degrees of freedom that can again be approximated by the
Satterthwaite formula. In other settings, it is common to rely on the approximate
normality of t to calculate the p-value. Test statistics can also take on different
functional forms, for which their likelihood must be derived in order to calculate
the p-value.
For the weight loss trial, one can set up the null and alternative hypotheses for the
between-arm difference in average change in weight (μ). The null and alternative
hypotheses are H0: μ ¼ 0 and HA: μ 6¼ 0, respectively. Recall b μ ¼ x y ¼ 4 and
1622 P. A. Shaw and M. A. Proschan
true between-arm difference in average weight change was 5 lb. Having good power
helps interpret a null result. For an adequately powered study, one with a high chance
of rejecting the null in favor of the alternative of interest, a null result indicates the
data are not consistent with that alternative. If power was 92.9% in the weight loss
trial, and the null was not rejected, this is reasonably strong evidence that the
between-arm difference in weight change is smaller than 5 lb. In this example, the
observed sample standard deviations were 15 and 20 lb. on the two arms. If the true
underlying SD on each arm were 18 lb. and the true treatment difference 5 lb., power
with 95 per arm would be only 48%. Thus, with such low power, it is equally likely
to reject or not reject the null. Therefore, failure to reject the null in this trial would
not provide reliable evidence that the alternative was false. In many settings, the
sample size is chosen so that power for an alternative of interest is at least 80% (20%
type II error rate). In definitive settings, such as a large phase III trial, 90% power is
often desirable.
The validity of a hypothesis test relies on choosing the correct method or probability
distribution for the test statistic. This will depend on the distribution of the variables
being measured and the study design. For instance, if data are highly skewed and the
sample size is small, then using the t-test will likely result in an incorrect p-value.
Even if data are not severely skewed, small samples may mean that one cannot rely
on approximate normality of the test statistic to calculate the p-value. In this case, it
would be better to consider an exact method – one that does not rely on approximate
normality but rather uses the correct probability distribution of the test statistic.
One exact test is based on permuting the labels of treatment and control obser-
vations. Consider the strong null hypothesis that treatment has no effect on anyone.
The idea is to fix the data at their observed values, permute the treatment labels, and
compute the value of the test statistic assuming the permuted treatment labels were
the actual ones. After all, under the null hypothesis of no effect of treatment, the
same data would have been observed regardless of treatment received. Repeat this
process for all possible, or at least a large number of, permutations to generate a
reference distribution for the test statistic under the null hypothesis. The p-value is
the proportion of test statistic values in the reference distribution at least as extreme
as the observed test statistic value. For a one-sided test to determine whether
treatment produces larger outcome values than control, reference values “at least
as extreme” are those that are at least as large. For example, if the observed value of
the test statistic is 2.5, and only 1% of the reference distribution is 2.5 or larger, the p-
value is 0.01.
The permutation test can be used in many settings. When the outcome is
binary and the test statistic is the difference in sample proportions, the permuta-
tion reference distribution can be computed theoretically using probability, and
1624 P. A. Shaw and M. A. Proschan
the permutation test is equivalent to Fisher’s exact test. If the sample size is large,
the permutation test is nearly identical to the z-test of proportions, which is
equivalent to the chi-squared test. When the outcome is continuous and the test
statistic is the difference in sample means, the permutation test is nearly identical
to the t-test if the sample size is large. The advantage of the permutation test is
that, without any further assumptions, it provides a valid test of the strong null
hypothesis that treatment has no effect on anyone.
A disadvantage of permutation tests is that, although they give nearly the same
answer as t-tests or z-tests of proportions when sample sizes are large, they can be
quite conservative for smaller sample sizes. For instance, with only 3 patients per
arm, the smallest possible two-sided p-value is 0.10 (i.e., a statistically significant
result at the conventional alpha level of 0.05 is impossible). Many would say that
such conservatism is appropriate if the sample sizes are that small.
A common error when determining the distribution of a test statistic is failure to
account for correlation between observations. For example, suppose one were
interested in comparing the efficacy of two weight loss interventions and married
couples were recruited and assigned to the same intervention. Since married indi-
viduals tend to share meals, their weight loss may be positively correlated. Failure to
account for the correlation leads to a higher than intended probability of falsely
declaring benefit of a diet. Interestingly, a permutation test that adheres to the
original randomization (i.e., both members of the couple receive the same treatment)
automatically accounts for such correlation and provides a valid p-value. Further
discussion of permutation tests is provided in the section on ▶ Chap. 94, “Random-
ization and Permutation Tests” in the Analysis chapter.
Multiple Comparisons
The problem that multiple comparisons create for hypothesis testing is best illus-
trated with the following analogy. In the popular game of darts, a circular target
board is placed at a certain distance from a player, which makes throwing a dart and
hitting a target difficult. The bullseye, a small ring in the center of the board, is worth
the most points. Compare two players. One hits the bullseye on the first attempt and
one takes 100 attempts to hit the bullseye. Though the first player could have been
lucky, it seems clear that the second player is not particularly good at hitting the
target. If the second player reported that he hit the target without specifying how
many attempts it had taken him, it would be difficult to conclude how good a player
he was. One might even incorrectly assume he had only thrown the dart once.
Similarly, suppose a large study examining whether a certain compound was effica-
cious at preventing cancer reported that treatment had a significantly lower incidence
of stomach cancer than control. It would be important for investigators of this study
to disclose for how many different cancers was a hypothesis test done to compare
that treatment with control. Looking at 10 cancers increases the probability that one
hypothesis test would be significant by chance alone, even if the risk for none of the
cancers was influenced by the treatment. The alpha level, below which a p-value is
declared significant, must be adjusted for multiple comparisons in order to preserve
the type I error rate. Many methods exist for adjusting the testing procedure to
accommodate multiple comparisons so that it maintains the desired type I error rate
(Hochberg and Tamhane 1987). Issues of multiple comparisons are considered
further in the section on ▶ Chap. 85, “Confident Statistical Inference with Multiple
Outcomes, Subgroups, and Other Issues of Multiplicity” in the Analysis chapter.
1626 P. A. Shaw and M. A. Proschan
Sometimes the goal of a clinical trial is to show not that the new treatment is superior
to, but rather that it is almost as good as, the standard treatment. Such a design is
called a noninferiority trial. Noninferiority trials are appealing if the standard
treatment is onerous or has serious side effects. Even if the new treatment is almost
as good as the standard, it may be preferred by patients. The ACTG 076 trial in the
United States and France had already demonstrated the preventive benefit of a longer
course of AZT, but the longer course was prohibitively expensive for developing
countries. A superiority trial randomizing HIV-infected mothers to a shorter course
of AZT or placebo drew criticism on ethical grounds (Lurie and Wolfe 1997). Some
critics argued that a trial demonstrating noninferiority of the short course to the
longer course was more ethical and could have indirectly shown that the short course
was superior to placebo.
In a noninferiority trial, a new treatment N is compared to a standard treatment S.
In a noninferiority setting, S has already been shown superior to placebo in a
previous trial by some amount M1. That is, if pS and p0 denote the proportions
with events, say a heart attack, in the standard and placebo arms in the previous trial,
p0-pS ¼ M1 > 0. Suppose one can show that N is not worse than S by more than a
prespecified noninferiority margin M, i.e., pN-pS M. Then pN-p0 ¼ pN-pS +
pS-p0 M-M1. As long as M is smaller than M1, one can conclude that N would
have beaten the placebo (that is, pN-p0 < 0), had the current trial used a placebo. The
noninferiority design begins by prespecifying a noninferiority margin M. A common
choice for the noninferiority margin is half of the known effect of S relative to
placebo, M ¼ M1/2. That way, demonstration of noninferiority shows that the new
treatment preserves at least half of the benefit of the standard treatment seen in the
previous trial.
The null and alternative hypotheses are essentially reversed in a noninferiority
trial. The null hypothesis is that the new treatment is worse than the standard by more
than M: H0: pN-pS > M, and the alternative is that treatment is worse than the
standard by no more than M: H1: pN-pS M. Rejection of the null in favor of the
alternative hypothesis at the given alpha level demonstrates noninferiority. The
procedure is equivalent to constructing a 1-2α confidence interval for pN-pS and
declaring noninferiority if the upper limit of the interval is M or less. For example, if
the alpha level of the test of noninferiority is 0.05, the procedure is equivalent to
constructing a 90% confidence interval and declaring noninferiority if the upper limit
of the interval is M or less.
One of the biggest drawbacks of noninferiority designs is that things that ought to
be bad in a clinical trial actually help demonstrate noninferiority. For instance,
suppose that the new drug is so ineffective that 100% of patients in arm N abandon
the new treatment and start taking the standard treatment. Then the observed
difference between N and S will be close to 0, making it easier to establish
noninferiority. That can’t be good! For this reason, even though intent-to-treat is
the primary analysis method for superiority trials, an as-treated analysis is often the
primary analysis in noninferiority trials. Another downside of noninferiority trials is
83 Estimation and Hypothesis Testing 1627
that it assumes that the effect of the standard treatment relative to placebo is
unchanged from the previous to the current trial, the so-called the constancy
assumption. That is why the noninferiority margin is often taken to be half of the
“known” effect of S relative to placebo.
Because noninferiority trials are inherently problematic, they should be avoided
whenever the question can be answered in another way. For example, when a
placebo is considered unethical, one could provide everyone the standard treatment
and test whether the new treatment has additional benefit. Another alternative to a
noninferiority design is a superiority design in patients who do not benefit from, or
cannot tolerate, the standard treatment. Noninferiority designs are also discussed
further in the “Equivalence and Noninferiority Designs” section in the Advanced
Topics in Trial Design chapter.
In 1988, many cardiologists believed that patients with a prior heart attack and
cardiac arrhythmias could reduce their risk of cardiac arrest and sudden death by
suppressing those arrhythmias. After all, studies showed clearly that heart attack
patients with arrhythmias were at increased risk of sudden death. Therefore, when
the Cardiac Arrhythmia Suppression Trial (CAST) tested the “suppression hypoth-
esis,” their original alternative hypothesis was that the antiarrhythmic arm would
have a lower risk of cardiac arrest/sudden death than the placebo arm. More
specifically, with λ denoting the log-hazard ratio for sudden death/cardiac arrest in
the antiarrhythmic arm relative to placebo, the null and alternative hypotheses were
H0: λ ¼ 0 (no effect) and H1: λ < 0 (the antiarrhythmic arm is superior). As described
in Friedman et al. (1993), at its first review meeting, the Data and Safety Monitoring
Board (DSMB) recommended switching to the two-sided alternative hypothesis H1:
λ 6¼ 0, which allows a decrease or increase in the risk of cardiac arrest/sudden death
in the antiarrhythmic arm. This was a prescient move; the trial stopped early because
the event rate was much higher in the antiarrhythmic arm (CAST Investigators
1989). CAST reminds us that interventions can cause harm. The prevailing view is
that one should always use a two-sided alternative hypothesis. Some medical
journals have gone so far as not allowing one-sided testing.
A counter-argument to two-sided testing is that there is no interest in proving
harm. If results in the treatment harm were going the wrong way, the trial would be
stopped before the evidence was sufficient to show actual harm. But this was not true
in CAST. The widespread misconception about the benefits of suppression of cardiac
arrhythmias needed to be dispelled before medical practice could change.
A better argument against two-sided tests is that the two errors, (1) falsely
declaring treatment beneficial and (2) falsely declaring treatment harmful, are very
different with vastly different consequences. Declaring a drug harmful when it
actually has no effect may not have serious consequences because that drug should
1628 P. A. Shaw and M. A. Proschan
not be used anyway. On the other hand, declaring a drug beneficial when it is
ineffective is problematic because patients may eschew truly effective treatments
for the ineffective treatment. Therefore, it is important to consider each of the one-
sided error rates.
P-values have been viewed in the medical literature as the definitive measure of
evidence for many years. A counter-movement is underway to eliminate them. Both
viewpoints can be viewed as overreactions.
One must first understand what a p-value is and what role it plays. Imagine 10
people exposed to a level of radiation that is known to be 95% fatal. They are given
a new treatment, and half survive. How compelling is the evidence that the new
treatment saves lives? Are observed results consistent with chance? It can be
shown that the probability of 5 or more people out of 10 surviving a condition
that is 95% fatal is only 0.00006. The two possible conclusions are (1) the new
treatment saves lives or (2) the new treatment does not save lives, but an incredibly
rare event occurred. Chance is not a plausible explanation for the observed results.
A small p-value does not necessarily imply that the treatment effect was large. For
instance, suppose that 1,000 people had been exposed to a radiation level that is
known to be 95% fatal, and 80 people survived. The p-value in that case would be
0.00003, yet 92% still died. The observed treatment effect was small, but it was
large enough to effectively rule out chance. The sole purpose of a p-value is to see
whether results are consistent with chance; it is imperative to supplement p-values
with estimates and confidence intervals for the size of the treatment effect to
appreciate whether the effect is both statistically significant and clinically
important.
A criticism commonly levied against the p-value is that it is not reproducible. If
we repeat the same trial, the p-value may be completely different. This is especially
true if the true treatment effect is small. For example, if treatment has no true effect,
the p-value for many tests is uniformly distributed between 0 and 1, meaning that it is
equally likely to be large, medium, or small. Therefore, if treatment has no true
effect, we might see a relatively small p-value in the first trial and a much larger one
in the next trial. The p-value is less variable if treatment is truly effective. Repro-
ducibility worries are somewhat ameliorated by the common practice of lumping p-
values above 0.05, declaring them “not statistically significant.” One must bear in
mind that the purpose of a p-value is to determine whether chance provides a
plausible explanation for observed results.
Another criticism of the p-value is that it depends not only on the observed results
but also on what action we would have taken if other results had been observed. In
other words, to compute a p-value, one must define what results are at least as
extreme as the observed results. Critics question the logic of computing the proba-
bility of the actual result or a more extreme result, when no more extreme result
occurred.
83 Estimation and Hypothesis Testing 1629
The p-value has limitations. Nonetheless, the p-value is useful for its intended
purpose. Its primary competitor is Bayesian methodology, which has its own criti-
cisms. Although not covered in this chapter, Bayesian methodology has received
considerable attention in clinical trials (Berry et al. 2010).
Two principal aims of statistics are to use data to (1) provide an estimate of a
population parameter and (2) to test whether two populations may be different
with respect to this parameter. Sample estimates have uncertainty that can be
expressed with their associated confidence interval. Hypothesis testing is used to
make statements about the value of a parameter, such as whether treatment A is
superior to treatment B. To conduct a reliable hypothesis test, one must specify in
advance the null and alternative hypothesis for the parameter of interest and choose a
study design that has good power to reject the null in favor of the specified
alternative. The p-value summarizes the evidence against the null hypothesis. The
validity of the hypothesis test relies on calculating the p-value with a correctly
specified probability distribution. The study design and distribution of the study
outcome will determine which distribution is appropriate. In some cases, an exact or
nonparametric test may be desired to avoid unnecessary assumptions. When
interpreting results, one must remember that no reasonable hypothesis test has zero
type I or type II error. In many settings, other evidence, such as results from other
clinical trials or mechanistic laboratory studies, are useful to evaluate the totality of
evidence for the question under study.
Key Facts
• The confidence interval contains values of the parameter that are consistent with
study data. For a study repeated many times, the 95% confidence interval is
expected to contain the true value 95% of the time.
• A hypothesis test requires specifying a null and alternative hypothesis for the
parameter.
• The p-value is the proportion of test statistic values in the reference distribution at
least as extreme as the observed test statistic value. The chosen alternative
hypothesis determines whether this is a one-sided or two-sided p-value.
• Type I error rate, denoted by α, is the probability of rejecting the null hypothesis
when it is true. Type II error rate, denoted by β, is the probability of failing to
reject the null hypothesis when it is false.
• Power helps us interpret a null result; if power was high and the null was not
rejected, we can be reasonably confident that the effect was not as strong as
originally hypothesized. If the trial was not well-powered, a null result is difficult
to interpret.
1630 P. A. Shaw and M. A. Proschan
Cross-References
References
Berry SM, Carlin BP, Lee JJ, Muller P (2010) Bayesian adaptive methods for clinical trials. CRC
Press, Boca Raton
Cardiac Arrhythmia Suppression Trial (CAST) Investigators (1989) Preliminary report: effect of
encainide and flecainide on mortality in a randomized trial of arrhythmia suppression after
myocardial infarction. NEJM 321(6):406–412
Casella G, Berger RL (2002) Statistical inference. Duxbury, Pacific Grove
Cox DR (1972) Regression models and life-tables. Journal of the Royal Statistical Society: Series B
(Methodological) 34(2):187–202
Fisher RA (1925) Statistical methods for research workers. Oliver and Boyd, Edinburgh
Friedman LM, Bristow JD, Hallstrom A et al (1993) Data monitoring in the cardiac arrhythmia
suppression trial. Online J Curr Clin Trials, Doc. No. 79 [5870 words; 53 paragraphs]
Hackshaw A, Kirkwood A (2011) Interpreting and reporting clinical trials with results of borderline
significance. BMJ 343:d3340
Hochberg Y, Tamhane AC (1987) Multiple comparison procedures. Wiley, Hoboken
Hosmer DW Jr, Lemeshow S, May S (2011) Applied survival analysis: regression modeling of
time-to-event data. Wiley, Hoboken
Kyriacou DN (2016) The enduring evolution of the p value. JAMA 315(11):1113–1115
Lin DY, Dai L, Cheng G et al (2016) On confidence intervals for the hazard ratio in randomized
clinical trials. Biometrics 72(4):1098–1102
Lurie P, Wolfe SM (1997) Unethical trials of interventions to reduce perinatal transmission of the
human immunodeficiency virus in developing countries. NEJM 337(12):853–856
Rosner B (2015) Fundamentals of biostatistics. Brooks/Cole, Boston
Wendl MC (2016) Pseudonymous fame. Science 351(6280):1406–1406
Estimands and Sensitivity Analyses
84
Estelle Russek-Cohen and David Petullo
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1632
Randomization and Randomized Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1634
Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1635
Estimand Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1635
Intent to Treat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636
Types of Trials and Measurements in Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636
Strategies for Addressing Intercurrent Events when Formulating Estimands . . . . . . . . . . . . . . . . . 1637
Treatment Policy Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1638
Hypothetical Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1639
Principal Stratification Strategy: Estimands and Causal Inference . . . . . . . . . . . . . . . . . . . . . . . . 1639
Some Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1641
Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1643
Importance of Selecting an Estimand at the Planning Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1643
Role of Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1644
Types of Missingness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1644
Estimands and Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1645
Estimands in Studies with Time-to-Event Endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1645
Estimands in Complex Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1646
This chapter reflects the views of the authors and should not be construed to FDA’s views or
policies.
E. Russek-Cohen (*)
Office of Biostatistics, Center for Drug Evaluation and Research,
U.S. Food and Drug Administration, Silver Spring, MD, USA
e-mail: [email protected]
D. Petullo
Division of Biometrics II, Office of Biostatistics Office of Translational Sciences,
Center for Drug Evaluation and Research, U.S. Food and Drug Administration,
Silver Spring, MD, USA
e-mail: [email protected]
© This is a U.S. Government work and not under copyright protection in the U.S.; 1631
foreign copyright protection may apply 2022
S. Piantadosi, C. L. Meinert (eds.), Principles and Practice of Clinical Trials,
https://doi.org/10.1007/978-3-319-52636-2_115
1632 E. Russek-Cohen and D. Petullo
Abstract
An estimand is a quantity used to define a treatment effect in a clinical trial. In many
cases, clinical trial planners skipped the step of defining the estimand in their rush to
pick a test statistic and calculate planned sample size(s). This would sometimes lead to
ambiguity on how results of a trial were to be interpreted. In this chapter we describe
estimands in detail and explain the importance of defining estimands when planning
randomized trials and doing this before picking a test statistic to use in evaluating trial
outcomes. The estimand is key to defining the scientific question the trial needs to
address. When patients drop out or fail to follow a planned regime within a random-
ized clinical trial and stakeholders disagree on how this impacts the analysis of the
trial, interpretability of this trial can be called into question. A clear definition of
treatment effect ought to capture how dropouts and protocol violators will be handled.
In this chapter sensitivity analyses are tied to the definition of the estimand in a
trial. In practice, sensitivity analyses are often ad hoc and only addressed after a
study is completed. Considering both estimands and sensitivity analyses in
planning will improve the interpretation of results from completed randomized
trials. While regulators (e.g., the US Food and Drug Administration) have been
particularly interested in advancing these ideas, utilization of these ideas ought to
improve the interpretability of randomized trials more generally.
Keywords
Intent to treat · Intercurrent events · Protocol violations · Tipping point analyses ·
Treatment effect
Introduction
In other parts of this text, considerable attention is made to planning of clinical trials,
estimation of key summary measures, and finally reporting of clinical trial results.
The topic of an estimand may never enter into those discussions, and in clinical trial
textbooks written over a decade ago, it seems unlikely that the topic of an estimand
84 Estimands and Sensitivity Analyses 1633
would be covered. Yet estimands are defined as quantities used to capture treatment
effects within a clinical trial, and they are not always carefully considered during the
planning stage. Sensitivity analyses may be something you have seen before but, as
with estimands, are not often covered in a systematic way in textbooks. Discussions
on estimands force a clinical trial planning team to define the scientific question to be
answered in the trial. Sensitivity analyses are often used at the end of a trial to
confirm the results and the assumptions of any statistical methods used, but ought
to follow from first defining the estimand of interest. A systematic approach to
sensitivity analyses set up prior to starting the study is preferable to generating a
laundry list of data analyses after the study is over. This chapter stresses the
importance of selecting an appropriate estimand(s) and sensitivity analyses at the
planning stage to allow for a cleaner interpretation of results once the study is
completed.
If all studies went exactly “as planned” and everyone completed the trial without
exception to protocol guidelines, defining an estimand could be a trivial task and
possibly left till the end. However, as noted in a survey by Fletcher et al. (2017), the
overwhelming majority of clinical trials have missing data or protocol deviations.
Waiting till the study is over to decide how these issues will be addressed
when determining treatment effects is not good science and frankly, pretty naive.
Furthermore, to the extent that clinical trials mimic real-world use of a product,
dropouts and failure to take doses as prescribed are a common occurrence and one
should not be surprised by this at the end of the study. Regulators such as the US
Food and Drug Administration (FDA) and companies wanting to market a medical
product often negotiate success criteria for a clinical trial. The choice of an estimand
and showing how an estimate of treatment effect follows from it will be important.
However, if both groups are not on the same page, it would be painful to discover
this after a rather expensive clinical trial has been completed. So the desire
to prespecify estimands is of importance to regulators. It should be noted in
some cases, there could be more than one acceptable estimand, so regulators and
companies need to communicate early in the development process.
The FDA commissioned a report by the National Academy of Sciences (NAS)
and its research arm, the National Research Council (NRC 2010), dealing with the
prevention and treatment of missing data in clinical trials. One motivation for the
NAS report was the arbitrary use of “last observation carried forward” (LOCF) as a
way of filling in missing values when subjects drop out of a clinical trial submitted to
the FDA. It was easy to calculate and there may be settings in which LOCF
made sense but the option was often used without justification. One general recom-
mendation that came out of the NAS document was that one should design trials
that minimize the amount of missing data and estimands ought to be defined when
planning the trial. However, in spite of the NAS report, regulators realized that
current practice had not moved forward (LaVange and Permutt 2016). In 2014
multiple regulatory agencies and their industry counterparts under the umbrella of
the International Council for Harmonisation (ICH) agreed to develop an addendum
to an important international guideline on statistical principles in clinical trials (ICH
1998, 2014). The focus of the addendum is estimands and sensitivity analyses.
1634 E. Russek-Cohen and D. Petullo
While regulators and their industry counterparts are now considering estimands
earlier in their deliberations, it would be unfortunate to think these discussions are
solely related to medical product approval. There are many clinical trials sponsored
by others (e.g., National Institutes of Health) that have public health impacts and
thinking about how results will be interpreted as one plans a study should improve
the science. Practices such as increasing the sample size to account for an expected
dropout rate without thinking about why dropouts occur are a missed opportunity to
plan a better study.
The majority of clinical trials reported in a drug or medical device label are
randomized, and such trials are considered the gold standard in establishing treat-
ment differences. In this chapter, for simplicity, we focus on clinical trials with two
treatment groups, most commonly a treatment group and a control group. However,
the principles here ought to have relevance to other kinds of trials, e.g., traditional
trials with more than two arms, pragmatic clinical trials (Ford and Norrie 2016) that
may harness electronic health records, and/or relax eligibility requirements to assess
something closer to real-world effectiveness of an intervention and to trials that rely
more heavily on data from other sources (e.g., using external control data).
Randomization in most trials should result in comparability of subjects in the two
treatment groups with respect to baseline characteristics, but comparability can be
lost depending on events that occur post-randomization. What is often missing from
the characterization of a treatment effect is how post-randomization events were
handled. The new ICH E9 R1 document defines events such as leaving a study early,
use of rescue medications, and so on as “intercurrent” events. When these events are
not balanced across treatment arms or the reasons for why these occur are not the
same, the interpretability of the study may be problematic. Therefore, when choos-
ing an estimand, one should consider all relevant intercurrent events. In many
therapeutic areas, these can impact a substantial portion of the study subjects. The
NAS report (NRC 2010) encourages FDA to explore which post-randomization
events (i.e., intercurrent events) are common and in what settings so future clinical
trials can be better planned. This kind of activity is still going on at FDA.
Randomized clinical trials have appeal because one can attribute causation,
namely, if the randomization was done appropriately and the study went according
to plan, observed significant treatment differences can be attributed to the difference
in treatments under investigation. However, in long-term studies or any trial where
there are more than a few dropouts and/or protocol violators, treatment effects are
harder to interpret. The issue is worse if the number of dropouts or the reasons for
dropouts and protocol violations differs among treatment arms. For example, a
dropout on a placebo arm could be due to ineffectiveness of the intervention, but
dropouts on the arm with a new drug could be due to serious side effects.
84 Estimands and Sensitivity Analyses 1635
Causal Inference
The term estimand appears in the literature associated with causal inference (Little
and Rubin 2000) where one recognized the role of confounding and interpreting
treatment effects in something other than a randomized clinical trial had to be done
with caution. At the heart of many causal inference discussions, the reader is asked to
imagine how the same subject would respond if they were assigned to one treatment
group and then if they were assigned to the other treatment and think of the treatment
effect for that subject as the difference in the two values (Y(trt)-Y(control)). The
value for that subject in the unobserved arm is the unobserved potential outcome. In
most instances one only gets to observe the outcome on one treatment, but under
randomization, one could imagine the two groups being comparable at baseline and
treatment effect could logically be interpreted to be the mean of the subject level
treatment effects.
Estimand Framework
It is common to see a statistical analysis plan (SAP) state that because the trial
sponsor anticipated a 20% drop out rate, they would increase the sample size by
some corresponding value (e.g., using 125% of the calculated sample size), ignoring
why missing data occurs or that it may be imbalanced among the treatment arms.
This is incorrect even if one does need an increase in sample size, since a failure to
account for intercurrent events may still result in an uninterpretable trial. Others
involved in planning trials may ignore missing data issues altogether. Sponsors of
clinical trials would regularly use terms like intent to treat (ITT) that were interpreted
differently by other stakeholders. Some interpreted ITT as only recording data while
on treatment, yet others would follow subjects till the end irrespective of whether
they complied with the assigned treatment regime. In reality, many analysis plans
did not consider the basis of missing data and instead picked a method that used
simpler and possibly unrealistic assumptions such as missing completely at random
(see section “Types of Missingness”; Little and Rubin 2014).
Intent to Treat
The term “intent to treat” or ITT became much more commonly used in the clinical
trial literature after the publication of the ICH E9 document Statistical Principles in
Clinical Trials in 1998. This ICH guideline is used globally to assist companies in
designing clinical trials to generate evidence in support of drug approval. However,
people used the term ITT inconsistently after the 1998 document was published
and confused two concepts, namely, the need to account for all subjects enrolled in a
trial and the need to randomize subjects to avoid confounding due to imbalances
in baseline covariates. The new ICH document (ICH 2017) distinguishes these
concepts to add clarity to what treatment effect is being measured. See chapter on
“Intent to Treat.”
While there are many kinds of clinical trials, the topic of estimands has gained more
attention in the context of longitudinal studies where patients are randomized to one
of two (or more) treatments at baseline and patients are repeatedly assessed using the
84 Estimands and Sensitivity Analyses 1637
same measurement at fixed time points prespecified in the protocol (e.g., every
month for 6 months). In O’Neill and Temple (2012), these have been called
symptom trials though the outcomes could be laboratory measurements. For exam-
ple, glycosylated hemoglobin or hemoglobin A1C (HgA1C) is measured at sched-
uled visits after subjects are randomized to treatment arms in trials that evaluate
drugs to treat diabetes. Subjects may drop out at various times, but most are likely to
stay until the end. These types of longitudinal studies are common in the assessment
of treatments for diabetes, depression, pain, allergies, and other possibly chronic
disorders. For such studies, the estimand is often defined in terms of a treatment
effect at the end of the observed time period (e.g., the difference in average HgA1C
after 6 months on assigned treatment). Information collected at earlier times may
improve the precision of the estimate of treatment effect, particularly when a subject
discontinues before 6 months on treatment. For other settings, perhaps the interest
may be the average treatment difference over the observed time period (e.g.,
evaluating a treatment for symptom relief for seasonal allergies).
One alternative class of trials would be outcome trials (O’Neill and Temple 2012),
and these focus on a single event for each subject but may fall into two categories
based on the endpoint utilized. For example, in infectious disease trials, the primary
focus may be on whether the treatment cures a subject of a disease and outcomes
correspond to subject status at a given time point (disease present or not at 6 months
after start of therapy). Dropouts are often regarded as treatment failures, and
dropouts are an important consideration when evaluating a trial. For several thera-
peutic areas, including oncology, time to a prespecified major clinical event is the
most common form of endpoint used, but, for example, it could be time until disease
progression (with an agreed to basis of how this is defined) or time until death due to
any cause (overall survival) (FDA-NIH 2018). In some instances (e.g., in drug trials
in cardiology), time to event is a composite outcome (e.g., time until stroke, heart
attack, or death, whichever comes first). In these settings, when subjects have not
had the event of interest, an observation is considered censored. How protocol
violators are handled or whether some of these events can be treated as censored
should be considered when planning a trial much as intercurrent events are addressed
in a longitudinal symptom trial.
The ICH E9 R1 addendum (2017) defines a set of five strategies for selecting
estimands. These are referred to as treatment policy, composite, hypothetical, prin-
cipal stratum, and while on treatment. Statisticians, clinicians, and others with an
understanding of the disease including epidemiologists may need to weigh in on the
choice of an estimand. This would include selection of meaningful endpoints,
identifying clinically relevant intercurrent events likely to occur and then defining
the estimands in the presence of these intercurrent events. These strategies are
discussed below.
1638 E. Russek-Cohen and D. Petullo
The strategies presented are not exhaustive nor are they mutually exclusive. For
example, Mallinckrodt et al. (2012) define de Jure and de Facto estimands with de
Jure focusing on what might have been if subjects completed the planned course of
treatment and de Facto estimands focusing on what is actually observed. However,
these do not directly correspond to the five categories we provide below. The paper
by Phillips et al. (2017) states that even though Mallinckrodt implies de
Jure estimands are equated to efficacy and de Facto estimands are equated to
effectiveness, given the restricted nature of who enrolls in trials relative to
who may use a particular intervention once in practice, effectiveness may not be
characterized in the most common clinical trials. Others have provided approaches
that may not be defined exactly as we have below (Permutt 2016).
Hypothetical Strategy
A scenario is envisaged in which the intercurrent event would not occur. One would
choose a value to reflect the scientific question of interest assuming a particular
hypothetical scenario, i.e., what a subject’s pain score would have been at the end
of the study had they completed 12 weeks of treatment. Assuming a subject is
comparable to a placebo subject once treatment is discontinued (Mehrotra et al.
2017) or was never on a treatment (as might be implied when using baseline
observation carried forward (BOCF)) (Phillips et al. 2017) may be sensible.
However, each of these approaches to dealing with an intercurrent event is distinct,
and treatment estimates that follow from each of these estimands may result in
different estimates of treatment effect.
Principal stratification is a means of adjusting for variables that are observed after
randomization. An overview of these methods is provided by Fragakis and Rubin
(2002), but several basic tutorials are available (e.g., Baker et al. 2016; Stuart et al.
2008; Dunn et al. 2005). Methods vary depending on the context of the study, the
kind of post-randomization event, and the properties of the variable that describes
the post-randomization event (e.g., dichotomous or continuous) and the primary
outcome variable for the trial. However, the theory behind principal stratification
relies heavily on the concept of causal estimands which we described briefly earlier.
Namely, one needs to imagine there is a potential outcome for every subject on each
treatment, but one only gets to observe one of them. Additional assumptions specific
to principal stratification imply the potential outcomes for each person do not depend
on the treatment status of others in the study. If subjects were not blinded to their
treatment assignment, their behavior could be influenced by the outcomes of others.
This section briefly describes some simpler examples.
one treatment or the other. Dunn et al. (2005) start off with this same setting but then
adds discussions on missing data. The context for these is where patients can either
agree to be part of the treatment group they are assigned to via randomization
(“compliers”) or some may show some distinct preferences, namely, some subjects
would always stay with an active treatment (“always takers”) even if assigned to the
control and those that would always stay with a control treatment (the latter are
called “never takers” in the literature implying they would never take the active
treatment). Their models also imply that there are no subjects that are “defiers.”
Defiers would always go to the other treatment irrespective of which treatment they
are assigned to. The model assumes the distribution of always takers, never takers,
and compliers is the same in the two treatment arms, a logical result of randomiza-
tion. An additional assumption is that “always takers” and “never takers” will not
contribute to an overall treatment effect. Namely, the subject specific causal
estimand is zero for an “always taker” and zero for a “never taker.” The estimate
of treatment effect in compliers is then an estimate of the treatment policy estimand
(namely, an estimate of treatment effect in everyone) divided by the estimate of the
fraction that would comply with their randomization assignment. Generally, the
treatment effect in compliers is expected to be larger than one derived using a
treatment policy estimand. This very basic model does not account for dropouts,
loss to follow-up, and so on. However, it does not require that we label a subject as a
complier prior to randomization.
A hypothetical example in Dunn et al. (2005) illustrates the concepts. Patients are
randomized to be cared for as a day patient or an inpatient, whereas the treatment
received is either to be cared for as a day patient or an inpatient. Patients could in
theory elect to choose a treatment option other than that assigned via randomization,
and this could be the result of previous experience, costs, severity of the condition
being treated, ability to get to and from day care, etc. This in turn could impact their
outcome (e.g., going back to work within a week or not, a dichotomous outcome).
Compliance is all or none (they do comply with one of the two treatments and they
do not switch between day care and inpatient once they start a particular treatment).
This may not always be realistic so one should think through the assumptions before
selecting a model. Dunn et al. (2005) only consider compliance (and not the factors
that may drive compliance) and a binary response variable. The latent classes for this
problem are as in the paragraph above but translated to this particular problem.
Compliers will stay with the randomized treatment. “Always takers” will be those
that are day patients irrespective of how they were assigned, and “never takers” are
those that are inpatient no matter how they were assigned via randomization. As with
Stuart et al. (2008) and Baker et al. (2016), closed form equations for an estimate of
the causal effect are presented with the same assumption that only compliers have
a subject-specific treatment effect that is non-zero.
Stuart et al. (2008) point to a two-stage regression model that can be used to
generalize beyond this simple example including incorporation of covariates that
can predict participation with assigned treatments and covariates that impact the
outcome or variable of interest. These have been implemented in statistics packages
but are beyond the scope of this chapter.
84 Estimands and Sensitivity Analyses 1641
The covariates in each model need not be the same since there may be covariates
that drive one to stay in the trial and other covariates that help predict the outcome
variable. The model is fit iteratively, and Dunn et al. have proposed an approach
based on a variant of maximum likelihood called “ML EM.” The nice part is that one
can fit such models in a number of statistical packages though the fitting options are
apt to vary.
Some Cautions
Since latent variables are not “observed,” these approaches could be challenged
in that different assumptions regarding latent variables could yield different causal
estimates of treatment effect. However, ignoring imbalances in dropouts or arbi-
trarily using a treatment policy estimand without thinking about compliance is not
sensible. So, these models provide estimates of treatment effect allowing one to
1642 E. Russek-Cohen and D. Petullo
Composite strategy
The occurrence of intercurrent event is taken to be a component of the variable, i.e.,
the intercurrent event is integrated with one or more other clinical measures of
interest. For example, in a rare disease setting, the focus is on a treatment to treat
multiple symptoms of the disease (e.g., headaches, diarrhea, etc.). The estimand may
be defined in terms of the average number of days per week without a symptom. The
use of rescue medications to alleviate any one symptom could be considered a day
with a symptom. Assuming the rare disease is chronic (so subjects may be on the
84 Estimands and Sensitivity Analyses 1643
medication for a very long time in practice), one may wish to focus on the change
from baseline versus the last week on a treatment for a weekly average number of
days without symptoms.
Another approach that falls within a composite strategy would be a “responder”
analysis that bins subjects into two classes, namely, success and failure. Multiple
features can be considered in defining success or failure but dropouts are considered
failures. While this results in a loss of information and reduced power, how treatment
effect will be captured is quite clear. This should not be the sole consideration in
picking an estimand. Several estimands including composite estimands are illus-
trated in an example later in this chapter.
Other Considerations
By acknowledging which estimands are to be assessed, one can properly state what
data needs to be collected in the protocol. Choice of an estimand may mean one
needs to document why subjects leave, and trial sponsors may need to consider
incentives to keep subjects in the trial. Some advice on protocol writing is on an NIH
website, but it is geared for investigators needing to come to FDA to have their study
plan approved (NIH-FDA 2017). It is common practice to have a protocol complete
prior to finalizing an SAP. But it is logical to say an initial draft of an SAP ought to be
1644 E. Russek-Cohen and D. Petullo
evaluated with the protocol to be sure the right information will be collected. In the
early stages of most clinical trials, both the protocol and SAP are refined, but they
should be in sync with respect to the primary analyses.
Role of Covariates
The elements of an estimand (see Table 1) do not explicitly call for covariates. One
can consider covariates when the estimator or the estimate of treatment effect is
defined. Use of appropriate covariates can greatly improve the precision of an
estimate of treatment effect and the power of a test of hypotheses. However, consider
covariates that are reliably collected at screening or baseline so that use of covariates
does not generate a bigger missing data issue. Covariates can also be very useful
when deciding among approaches for imputing missing data (Little and Rubin
2014).
Types of Missingness
When selecting an estimand, one needs to think through the analysis options and
various anticipated patterns of missing data that may occur. This could be influenced
by the type of intercurrent events that are anticipated and the factors that influence
whether or not these occur. Of course, estimands that do not rely on data after a
patient stops taking their assigned medication means data at that stage is not
considered missing and may not always be collected (nor should it be imputed).
Similarly, there is no need to impute values for life after death in a sensible estimand.
See “▶ Chap. 86, Missing Data,” and Little and Rubin (2014).
Missing completely at random (MCAR) usually results in an analysis that ignores
the reason for the missing data. As we have discussed in the section on “Principal
Stratification Strategy: Estimands and Causal Inference,” if missing data is thought
to be unrelated to treatment assignment, intercurrent events may still impact the
precision of treatment effects, but the resulting estimator and estimate may still be
sensible. However, in most settings, this may be the least realistic. Missing at
random (MAR) usually means the chance of being missing is a function of terms
measured during the trial and incorporated into the analysis. Mehrotra et al. (2017)
suggest that MAR in a longitudinal study with repeated measures involves the
assumption that subjects that drop out are like the subjects that had the same values
until the time drop out occurs. In drug trials subjects often drop out because they are
not doing especially well on their assigned treatment group, so this assumption can
be misleading and can result in an overly optimistic estimate of treatment effect. One
of the most common approaches in analyzing the longitudinal studies we describe
here uses a mixed model with repeated measures (noted as MMRM), which does not
explicitly impute the gaps in the dataset. However, because the analysis is consistent
with assuming the data is MAR, it typically generates an overly optimistic estimate
of treatment effect.
84 Estimands and Sensitivity Analyses 1645
Missing not at random (MNAR) is missing data that is not MAR or MCAR and is
probably more common than one would like to admit. In studies with large amounts
of missing data, an analysis that is consistent with an MAR assumption would be
especially problematic even though one sees this in the literature on a regular basis.
The reality is values that would have occurred after a subject leaves a study are at
best a guess since these values do not really exist, so analysis methods that minimize
the assumptions made for unobserved values are the most appropriate.
For longitudinal studies, sometimes the term “monotone missingness” is used. In
the section on “Sensitivity Analysis,” an approach that assumes monotone mis-
singness (as the primary source of missing data in a trial) is discussed. Once a
subject drops out, they will no longer contribute data, and they do not return. If
clinicians are likely to use certain laboratory measurements to take a subject off a
treatment, how that information is reflected in the data analysis should be thought
out, but it would be inappropriate to consider that data as MCAR (Holzhauer et al.
2015).
Although estimands ought to be spelled out prior to selecting an estimator, there
likely will be an iterative process involved. Missing data and/or protocol violations
will have to be part of the conversation. This is normal when planning a clinical trial.
There can be one estimand (or more) for efficacy and other estimands for safety. For
example, in vaccine trials submitted to FDA in support of vaccines for healthy
subjects, only subjects completing a prescribed regimen are typically included in
the efficacy calculations. Those included in the safety assessments are those who
receive at least one dose of the treatment assigned. So, efficacy and safety estimands
differ and the resulting estimates may not involve using data from the same subjects.
For trials where safety is a primary outcome, such as a safety study to rule out an
elevated cardiovascular risk for a drug to treat diabetes (FDA 2008), the estimand
may be defined in terms of an intent to treat policy strategy where all subjects are
included, whether or not they adhere to the assigned treatment regimen. This could
differ from a study in which a more traditional efficacy endpoint is being used, and
there are no prespecified hypotheses associated with safety.
In oncology and many cardiovascular trials, time to event data, such as time until
death from any cause (overall survival), are common endpoints of interest. The most
common measure of treatment efficacy in that setting is a hazard ratio (e.g., using a
Cox proportional hazards regression). It would be hard to represent the hazard ratio
as a causal estimand (see section on “Causal Inference”). But it would be inappro-
priate to think that some of the principles we have identified here do not apply.
In many time-to-event studies, not all the patients will die (or have the event of
1646 E. Russek-Cohen and D. Petullo
interest) by the time the study is ended and that must be considered in the analysis.
Administrative censoring refers to subjects not having the event of interest at the
time the study is over and that would likely be considered non-informative (namely,
the time to event and the time at which censoring occurs are regarded as indepen-
dent). However, one would need to specify how other types of censoring are
considered, e.g., how would one treat patients that move onto another treatment
because of disease progression in a trial with overall survival as an endpoint. One
may want to treat results differently if (1) patients moved onto another active
treatment for the indication in question may be considered differently than
(2) patients moving from the active treatment to the control arm.
Clarity in endpoints and reasons for censoring can be relevant in other time to
event settings (FDA-NIH 2018). Planning in advance for how these are handled is
also better science.
Adaptive designs are more common today, and some designs can alter the estimand
after the start of the trial. One obvious case of this would be adaptive enrichment
(Rosenblum et al. 2016) where an interim analysis is planned. Based on the interim
analysis, restrict future recruitment to a prespecified subgroup of patients, e.g., those
with more severe disease. This may change the estimand in that this changes the
intended population; it may also alter the frequency of certain outcomes (e.g., deaths
may occur at a higher frequency in a study of severely ill patients). The decision to
study a subgroup should not jeopardize the overall integrity of the study (FDA 2019).
Meta-analyses are common when there are multiple trials designed to answer similar
questions. One common objective in a meta-analysis is to provide a global estimate
of treatment effect, combining information from several trials. It would be challeng-
ing to include trials with different estimands (and perhaps a different consideration of
intercurrent events) into a meta-analysis. Trials of different duration are a challenge
particularly if effect size is apt to change with duration. In addition, intercurrent
events could be more likely if patients are observed over a longer time period. It may
be necessary to obtain line data and reanalyze the data with a common estimand and
a consistent approach to handling missing data in mind. This could be a challenge
since many meta-analyses use summary measures from journal articles. Note this
concern is distinct from the issue of relying solely on published studies.
Network Meta-Analysis
NI studies are common in a regulatory environment where one wants to allow drugs
and other medical products to compete with other products on the market by
showing comparability rather than superiority (see chapter on “Non-inferiority”).
NI studies use an already established therapy (e.g., a medical product already
approved at FDA) to serve as an “active control.” Previous studies are used to
formulate a margin, namely, (1) considering how the active control compares to a
placebo and (2) defining how to compare a new treatment to an active control. “The
margin needs to be small enough to demonstrate that the novel treatment is still
effective relative to placebo and part of planning can involve trying to capture what
proportion of that effect needs to be conserved with the new product.” The US
guidance on NI studies for drugs and biologics (FDA 2016) provides advice on
determining an NI margin though in other settings, the margin can be determined in
other ways. In the guidance, the margin is often determined using a meta-analysis
that compares a proposed active control arm to a placebo. It would be useful to have
all the trials used in the meta-analysis and the planned trial to use the same estimand
or at least factor the use of different estimands into which trials are or are not
included when the margin is determined.
Because having a treatment effect that is close to zero is usually declared a
success in this setting, non-inferiority studies with large numbers of intercurrent
events could be suspected, and minimizing intercurrent events is critical (Rothmann
et al. 2011).
Estimands need to be selected such that once a trial is completed and one uses
estimators consistent with an estimand, the results are interpretable. Regulators are
obligated to approve products that are both safe and effective, and estimates of
treatment effect will drive decision-making. However, in medical practice, there may
be an interest in comparing different treatments for a given patient. Different
estimands and different ways of estimating treatment effects could hamper using
summary data from various trials to decide which treatments are best.
Benefit Risk
While most trials may handle effectiveness and safety separately, there is use in
considering benefit and risk together. So, if one were treating patients for certain
types of heart disease, drugs that reduce the risk of clots can come with an
elevated risk of serious bleeding, and one may wish to define an endpoint and
an estimand that formally weighs both kinds of events. Literature in this space is
limited.
Sensitivity Analyses
An Example
An example from the US FDA is presented (authors are not permitted to share the
line data). It illustrates the discussions presented on estimands and sensitivity
analyses (namely, tipping point analyses) in the context of formal decision-making.
First some issues that arise when designing clinical trials to evaluate pain medica-
tions are discussed first.
1650 E. Russek-Cohen and D. Petullo
the experimental drug are randomized into the double-blind treatment phase.
Randomized subjects either stay on active treatment or switch to the control
group. If the control group is a placebo, subjects are tapered off the active drug
to minimize discontinuations due to withdrawal symptoms and to assist in
maintaining the blinding of the study. This is referred to as an enriched enrollment
withdrawal design. Even with this design, subjects still discontinue active treat-
ment for lack of efficacy and adverse events (Katz 2009).
In chronic pain trials, subjects that discontinue treatment will most likely switch
to other therapies. This makes an estimand based on a treatment policy strategy
difficult to interpret since it measures the impact of “treatment plus other therapies”
versus “placebo plus other therapies.” An estimand using a composite strategy that
considers treatment discontinuations as failures is more commonly used. The focus
is on the difference in the two treatments at week 12 as this is a chronic condition
although subjects may stay on a drug considerably longer if approved.
Sometimes one sees a responder analysis in this setting. A responder is a subject
that shows a prespecified improvement in baseline pain, such as a 30% or 50%
improvement which has been deemed to be clinically relevant. This responder
definition may also include use of rescue medication such as no use or less than a
specific amount. For example, if a subject uses rescue medication for 7 consecutive
days or longer, they are considered a non-responder in the primary analysis
(Dworkin et al. 2005). So, subjects are either a responder or a non-responder.
This approach has often been criticized as having less power, but it is easy to
implement.
A method that retains more information than a dichotomy of response utilizes
a continuous responder curve with an appropriate corresponding analysis. We
illustrate this in our FDA example below.
This example was submitted to FDA in a New Drug Application (NDA). Statistical
and clinical reviews are provided by FDA at the Drugs@FDA weblink (FDA 2020).
The study will be briefly described along with a discussion of possible estimands
and estimators along with corresponding estimates of treatment effect. These were
previously presented (Petullo 2016).
This was an 18-week (4-week dose-adjustment phase, 12-week fixed-dose
maintenance phase, 1-week taper phase, 1-week follow-up phase) randomized,
double-blind, placebo-controlled, multicenter trial. Subjects were started on
150 mg per day or placebo and titrated to a target dose range of 150–600 mg per
day. Subjects were required to have a diagnosis of traumatic SCI of at least 1-year
duration with central neuropathic pain that had persisted continuously for at least
3 months or with remissions and relapses for at least 6 months. Concomitant
analgesics were allowed if subjects were on a stable dose regimen prior to
1652 E. Russek-Cohen and D. Petullo
randomization. Subjects must have had an average daily pain score of at least 4 (0–
10 NRS) during the 7 days prior to randomization.
The study randomized 219 subjects, 108 subjects in the placebo arm and 111 in
the active arm. There were approximately 15% of subjects with missing data at
Week 16 (15% in the placebo arm and 17% in the active arm). There were similar
numbers dropping out for adverse events and lack of efficacy in both
treatment arms.
When this study was planned, conducted, and reviewed by FDA, a primary
estimand of interest was not explicitly stated, but the estimand that corresponds to
the analysis to support approval is presented along with how it was estimated.
A composite strategy was used to define an estimand, i.e., to define the effect
of the active drug for treating neuropathic pain associated with SCI. Use of this
estimand did not require follow-up of subjects after treatment discontinuation as they
are considered treatment failures or non-responders. The four components of this
estimand are described below:
(A) Population: Subjects with traumatic SCI of at least 1-year duration with central
neuropathic pain that had persisted continuously for at least 3 months or with
remissions and relapses for at least 6 months.
(B) Variable: Change from baseline pain score Week 16. (Baseline pain was defined
as the pain prior to the OL titration phase.)
(C) Intercurrent Event: Subject did not complete 16 weeks of treatment. Subjects
that experienced the intercurrent event were considered as having no improve-
ment in baseline pain.
(D) Population-Level Summary: Difference in mean change in baseline pain at
Week 16 comparing subjects with neuropathic pain associated with spinal
cord injuries assigned to active drug versus those assigned to placebo.
For the estimands in the example above, subjects that discontinued treatment were
considered as treatment failures, and imputation was simple. If the primary analysis
uses a more optimistic assumption, namely, that those that dropped out were like
those that stayed in up until they left an analysis consistent with an MAR assumption
can be the primary analysis. A tipping point analysis evaluating how sensitive
the results was to this assumption is worth considering. Starting from a MAR
1654 E. Russek-Cohen and D. Petullo
Shift in mean change from baseline at Week 16: difference from placebo: mean (95% CI)
Placebo
0 0.5 1.0 1.5 2.0 2.5
0 0.8 (0.3, 1.4) 0.9 (−0.4, 1.5) 1.0 (0.4, 1.5) 1.1 (0.5, 1.6) 1.1 (0.6, 1.7) 1.2 (0.7, 1.8)
−0.5 0.8 (0.2, 1.3) 0.8 (−1.4, −0.3) 0.9 (0.4, 1.5) 1.0 (0.4, 1.5) 1.1 (0.5, 1.6) 1.1 (0.6, 1.7)
Lyrica
−1.0 0.7 (0.1, 1.2) 0.8 (−1.3, −0.2) 0.8 (0.3, 1.4) 0.9 (0.3, 1.5) 1.0 (0.4, 1.5) 1.1 (0.5, 1.6)
−1.5 0.6 (0.0, 1.2) 0.7 (0.1, 1.2) 0.7 (0.2, 1.3) 0.8 (0.3, 1.4) 0.9 (0.3, 1.5) 1.0 (0.4, 1.5)
−2.0 0.5 (0.1, 1.1) 0.6 (0.0, 1.1) 0.7 (0.1, 1.2) 0.7 (0.2, 1.3) 0.8 (0.2, 1.4) 0.9 (0.3, 1.5)
−2.5 0.4 (−0.1, 1.0) 0.5 (−0.1, 1.0) 0.6 (0, 1.21.2) 0.7 (0.1, 1.2) 0.7 (0.1, 1.3) 0.8 (0.2, 1.4)
assumption, the tipping point analysis varies the pains scores for subjects with
missing outcomes. The missing outcomes on each treatment arm are allowed to
vary independently and includes scenarios where dropouts on the active arm have
worse outcomes than dropouts on control. The goal is to explore the plausibility of
missing data assumptions under which there is no longer evidence of a treatment
effect. As seen in Table 2, the analysis only tips, i.e., p-value >0.05, if missing data
for active subjects are assumed to have a much worse outcome than missing data
for placebo subjects. In this example, the various analyses did not appear to impact
the final conclusions. When it does, those involved in the analysis and interpretation
of the trial need to determine what this means. The choice of a particular primary
analysis and sensitivity analysis could depend on the degree of missing data antic-
ipated, and MAR as a starting assumption for data analysis could be quite plausible
in settings where dropouts are not all that frequent. It may not be ideal in pain studies
where dropouts are common.
Conclusion
The importance of considering both estimands and sensitivity analyses and their role
in planning of clinical trials was described here. It is possible to have more than one
plausible estimand and more than one analysis. But it is anticipated that all
those involved in the trial planning and assessment will understand what is going
to be presented when a trial is completed.
It is hoped that greater considerations at the planning stage of things that can go
awry will lead to better writing of protocols and analysis plans. Thinking through the
options before a study starts could impact what data needs to be collected, what is
regarded as missing, and will limit the number and kinds of analyses that are
reasonable given the question of interest. It is recognized that one cannot anticipate
all things that can happen in a clinical trial, but that does not mean planning to
prevent or minimize problems that have occurred in related trials is unimportant.
Some of the attention to this topic has been motivated by regulators, but the
principles described are not solely applicable to trials for medical product approvals.
Finally, we recognize that more complex analytical approaches to handling
missing data exist. See “▶ Chap. 86, Missing Data.”
84 Estimands and Sensitivity Analyses 1655
Key Facts
Cross-References
▶ Missing Data
References
ACTTION (2002) Analgesic, Anesthetic, and Addiction Clinical Trial Translations, Innovations,
Opportunities, and Networks (ACTTION). www.acttion.org. Accessed Jan 2019
Alosh M, Fritsch K, Huque M, Mahjoob K, Pennello G, Rothmann M, Russek-Cohen E, Smith F,
Wilson S, Yue L (2015) Statistical considerations on subgroup analysis in clinical trials.
Stat Biopharm Res 7:286–303
Baker SG, Kramer BS, Lindemen KS (2016) Latent class instrumental variables: a clinical and
biostatistical perspective. Stat Med 35:147–160
Campbell G, Pennello G, Yue L (2011) Missing data in the regulation of medical devices.
J Biopharm Stat 21:180–195
Chiba, Vanderweele (2011) A simple method for principal strata effects when the outcome is
truncated due to death. Am J Epidemiol 173:745–751
Dunn G, Maracy M, Tomenton B (2005) Estimating treatment effects from randomized clinical
trials with noncompliance and loss to follow-up: the role of instrumental variable methods.
Stat Methods Med Res 14:369–395
Dworkin RH, Turk DC, Farrar JT, Haythornthwaite JA, Jensen MP, Katz NP, Kerns RD, Stucki G,
Allen RR, Bellamy N, Carr DB, Chandler J, Cowan P, Dionne R, Galer BS, Hertz S, Jadad AR,
Kramer LD, Manning DC, Martin S, McCormick CG, McDermott MP, McGrath P, Quessy S,
Rappaport BA, Robbins W, Robinson JP, Rothman M, Royal MA, Simon L, Stauffer JW,
Stein W, Tollett J, Wernicke J, Witter J (2005) Core outcome measures for chronic pain trials:
IMMPACT recommendations. Pain 113(1–2):9–19
Efthimiou O, Debray TPA, van Valkenhoef G, Trelle S, Panaidou K, Moons KGM, Reitsma JB,
Shang A, Salanti G et al (2016) GetReal in network meta-analysis: a review of the methodology.
Res Synth Methods 7:236–263
1656 E. Russek-Cohen and D. Petullo
Farrar JT, Dworkin RH, Max MB (2006) Use of the cumulative proportion analysis graph to
present data over a range of cut-off points: making clinical trial data more understandable.
J Pain Symptom Manag 31(4):369–377
FDA (2020) Drugs@FDA https:// HYPERLINK. http://www.accessdata.Fda.gov/scripts/cder/daf
FDA-NIH Biomarker Working Group BEST (Biomarkers, Endpoints, and other tools) Resource
(Internet). Silver Spring, Glossary Created 2016 Jan 28 (Updated 2018 May 2) Co-published by
National Institutes of Health (US), Bethesda
Fletcher C, Tsuchiya S, Mehrotra D (2017) Current practices in choosing Estimands and sensitivity
analyses in clinical trials: results of the ICH E9 survey. Therapeutic Innovation and Regulatory.
Science 51:69–76
Ford I, Norrie J (2016) Pragmatic trials. NEJM 375:454–463
Fragakis CE, Rubin DB (2002) Principal stratification in causal inference. Biometrics 58:21–29
Holzhauer B, Akacha M, Bermann G (2015) Choice of estimand and analysis methods in diabetes
trials with rescue medication. Pharm Stat 14:433–447
ICH (1998) E9: Guideline on statistical principles for clinical trials. http://www.ich.org
ICH (2014) E9 concept paper on estimands and sensitivity analyses. http://www.ich.org
ICH (2017) E9 R1: Addendum on estimands and sensitivity analyses in clinical trials. Step 2. www.
ich.org
IMMPACT (2011) Initiative on methods, measurement, and pain assessment in clinical trials
(IMMPACT). www.immpact.com. Accessed Jan 2019
Katz N (2009) Enriched enrollment randomized withdrawal trial designs of analgesics: focus on
methodology. Clin J Pain 25(9):797–807
Kurland BF, Johnson LL, Egleston BL, Diehr PH (2009) Longitudinal data with follow-up
truncated by death: match the analysis method to research aims. Stat Sci 24:211–222
Lavange LM, Permutt T (2016) A regulatory perspective on missing data in the aftermath of the
NRC report. Stat Med 35:2853–2864
Leuchs A, Zinserling J, Brandt A, Wirtz D, Benda N (2015) Choosing appropriate estimands in
clinical trials. Therapeutic innovation and Regulatory. Science 49:584–592
Little RJA, Rubin DB (2000) Casual effects in clinical and epidemiological studies via potential
outcomes: concepts and analytical approaches. Annu Rev Public Health 21:121–145
Little RJA, Rubin DB (2014) Statistical analysis with missing data, 2nd edn. Wiley, New York.
408pp
Mallinckrodt CH, Lin Q, Lipkovich I, Molenberghs (2012) A structured approach to choosing
estimands in longitudinal clinical trials. Pharm Stat 11:456–461
Mehrotra D, Liu F, Permutt T (2017) Missing data in clinical trials: control based mean imputation
and sensitivity analysis. Pharm Stat 16:378–392
National Research Council (2010) The prevention and treatment of missing data in clinical trials.
National Academies of Science Press, Washington, DC
O’Neill RT, Temple R (2012) The prevention and treatment of missing data in clinical trials: an
FDA perspective on the importance of dealing with it. Clin Pharmacol Ther 91:550–554
Ouyang J, Carroll KJ, Koch G, Li J (2017) Coping with missing data in phase III pivotal registration
trials: Tolvaptan in subjects with kidney disease, a case study. Pharm Stat 16:250–266
Permutt T (2016) A taxonomy of Estimands for regulatory clinical trials with discontinuations.
Stat Med 35:2865–2864
Permutt T, Li F (2017) Trimmed means for symptom trials with dropouts. Pharm Stat 16:20–283
Petullo D (2016) Statistical review and evaluation. https://www.accessdata.fda.gov/drugsatfda_
docs/nda/2012/021446Orig1s028StatR.pdf
Petullo D, Permutt T, Li F (2016) An alternative to data imputation in analgesic clinical trials.
American Pain Society Conference on Analgesic Trials, Austin Texas
Phillips A, Abellan-Andres J, Soren A, Bretz F, Fletcher C, FranceI GA, Harris R, Kjaer M,
Keene O, Morgan D, O’Kelley M, Roger J (2017) Estimands: discussion points from the PSI
estimands and sensitivity expert group. Pharm Stat 16:6–11
Rosenblum M, Qian T, Du Y, Qiu H, Fisher A (2016) Multiple testing procedures for adaptive
enrichment designs: combining group sequential and reallocation approaches. Biostatistics
17(4):650–662
84 Estimands and Sensitivity Analyses 1657
Rothmann MD, Wiens BL, Chan ISF (2011) Design and analysis of non-inferiority trials. Chapman
& Hall/CRC, CRC Press lists Boca Raton, FL, 454p
Scharfstein D, McDermott A, Olson W, Wiegand F (2014) Global sensitivity analyses with
informative dropouts: a fully parametric approach. Stat Biopharm Res 6:338–348
Stuart EA, Perry DF, Le H-N, Ialongo NS (2008) Estimating intervention effects of prevention
programs: accounting for noncompliance. Prev Sci 9:288–298
US FDA (2008) Guideline for industry: diabetes mellitus-evaluating cardiovascular risk in new
antidiabetic therapies to treat Type 2 diabetes. Dec 2008. https://www.fda.gov/downloads/
Drugs/Guidances/UCM071627
US FDA (2016) Non-inferiority clinical trials to establish effectiveness: guidance for industry.
Nov 2016. https://www.fda.gov/downloads/Drugs/Guidances/UCM202140
US FDA (2018) Product label for Cymbalta. https://www.accessdata.fda.gov/drugsatfda_docs/
label/2008/022148lbl.pdf. Accessed Jan 2019
US FDA (2019) Adaptive designs for clinical trials for drugs and biologics: guidance for industry.
https://www.fda.gov/media/78495/download. Accessed 24 April 2020
US National Institutes of Health-Food and Drug Administration (2017) NIH-FDA Protocol tem-
plate. https://osp.od.nih.gov/clinical-research/clinical-trials. Accessed 9 Nov 2018
Confident Statistical Inference with
Multiple Outcomes, Subgroups, and Other 85
Issues of Multiplicity
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1660
Patient Targeting for a Targeted Therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1660
Strength of Error Rate Controls in Patient Targeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1661
Respecting Logical Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1664
Three Kinds of Outcome Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1664
A Common Misconception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1665
Binary Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1666
Time-to-Event Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1669
The Subgroup Mixable Estimation (SME) Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1671
Effect of the Prognostic Factor on Permutation Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1672
Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1672
Predictive Null Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1673
Test Statistic and Reference Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1674
Numerical Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1675
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1678
Key Facts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1678
Cross-References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1678
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1679
S. Kil
LSK Global Pharmaceutical Services, Seoul, Republic of Korea
E. Kaizar
The Ohio State University, Columbus, OH, USA
e-mail: [email protected]
S.-Y. Tang
Roche Tissue Diagnostics, Oro Valley, AZ, USA
e-mail: [email protected]
J. C. Hsu (*)
Department of Statistics, The Ohio State University, Columbus, OH, USA
e-mail: [email protected]
Abstract
This chapter starts with a thorough discussion of different multiple comparison
error rates, including weak and strong control for multiple tests and noncoverage
probability for confidence sets. With multiple endpoints as an example, it
describes which error rate controls would translate to incorrect decision rate
controls. Then, using targeted therapy as the context, this chapter discusses a
potential issue with some efficacy measures in terms of respecting logical rela-
tionships among the subgroups. A statistical principle that helps avoid this issue is
described. As another example of multiplicity-induced issues to be aware of, it is
shown that permutation test for patient targeting may not control Type I error rate
in some situations. Finally, a list of the key points and a summary of the
conclusions are given.
Keywords
Subgroups · Multiple comparisons · Prognostic effect · Permutation tests
Introduction
Multiplicity issues arise in clinical trials due to having multiple treatments, end-
points, and subgroups. Using precision medicine as the context, this chapter
describes multiple comparison principles that help ensure proper error rate control.
To start, the extent of each multiple comparison Type I error rate control ensures
control of the incorrect decision rate is discussed. Then, which efficacy measures that
respect natural logical relationships among patient subgroups are enumerated. The
Subgroup Mixable Estimation (SME) principle for achieving statistical inference
that respects such logic is described. It is also shown that permutation testing is a
technique that should be avoided, as it does not produce a valid null distribution with
a discrete outcome even with only one subgroup classifier. Finally, a list of key
points is given, and a summary of the conclusions is provided.
Targeted therapies, which as Woodcock (2015) states are sometimes called “person-
alized medicine” or “precision medicine,” target specific pathways.
Suppose a companion diagnostic test (CDx) divides patients into a marker-
positive (g+) subgroup and its complementary marker-negative (g) subgroup.
Call the entire patient population {g+, g} “all-comers.” If all-comers or a patient
subgroup can confidently be inferred to receive clinically meaningful efficacy, then a
decision is made to target all-comers or that subgroup. If no patient group can be
85 Confident Statistical Inference with Multiple Outcomes, Subgroups. . . 1661
Statistical methods for patient targeting should control the probability of incorrect
targeting, the probability that a targeted patient group does not derive clinically
meaningful efficacy from the treatment it is given.
Null Control
Some methods for patient targeting offer error rate control under what can be called
the null null hypothesis. The null null hypothesis is that Rx has exactly the same
effect as C for all biomarker subgroups. In other words, there is no treatment effect or
biomarker effect whatsoever.
If the outcome is survival time, say, then under the null null, all patients come
from a single group with the same survival curve, regardless of whether they are
given Rx and C, or what biomarker value they have.
In a lucid paper written before the adoption of modern concepts of multiple
comparison error rate control, Miller and Siegmund (1982) suggest forming 2 2
tables of Rx vs. C and responder vs. nonresponder at every cut-point and then
selecting the cut-point with the maximum chi-square statistic value. Their critical
value calculation for testing that the observed differential efficacy between the g+
1662 S. Kil et al.
patients and g patients at the sample maximum chi-square statistic is not just due to
random sample fluctuation is under the null null hypothesis.
As John W. Tukey would say, “controlling the Type I error rate testing a null null
is a null guarantee,” because there will surely be some difference between Rx and C
effects, if measured to enough decimal places. A null null is even more restrictive
than a complete null used in weak control.
Weak Control
The complete null is where all the null hypotheses are true. Controlling the Type I
error rate under the complete null is termed weak control.
When there are subgroups, the null hypothesis for each subgroup is Rx and C have
the same effect in that subgroup. So, under the complete null, the biomarker can have
an effect (a prognostic effect), but effects of Rx and C do not differ in each of the
subgroups.
Thus, weak control controls the probability of inferring Rx and C have different
effects in at least one subgroup, when in fact Rx and C have the same effect in all the
subgroups.
Suppose there is originally a single primary endpoint E1. If only weak Type I
error rate control is required, then one can game the system by artificially introducing
a second primary endpoint E2, a co-primary endpoint so that Rx is approved only if
its efficacy relative to C is shown in both endpoints E1 and E2. It would seem that
adding E2 would not make getting Rx approved easier.
Suppose the Statistical Analysis Plan (SAP) is a two-step process, as follows:
Step 1 Test, at the 5% level, the complete null hypothesis that there is no difference
between Rx and C for either endpoint; if the complete null hypothesis is rejected,
then go to Step 2; otherwise Stop.
Step 2 Infer Rx is better than C.
This procedure controls the Type I error rate weakly at 5%, but what is its
incorrect decision rate?
Suppose E2, a clinically meaningless endpoint, is chosen on the basis that it is a
sure bet that efficacy in this endpoint can be proven. Then, at the end of the study,
the complete null will surely be rejected, and Rx is guaranteed to be approved by
this procedure (regardless of what the data indicates). If Rx is in fact slightly worse
than C on endpoint E1, then the probability of an incorrect decision can in fact be
close to one-half. Clearly, weak control of the Type I error rate may not translate to
control of the incorrect decision rate. Requiring strong control ameliorates this
concern.
Similarly, weak control is insufficient to control the incorrect targeting rate,
because it does not control the probability of inferring Rx is better than C for g
patients when they have the same effect in g, if it is true that Rx is better than C for
g+ patients, for example.
85 Confident Statistical Inference with Multiple Outcomes, Subgroups. . . 1663
Even with these known limitations, some methods for subgroup identification
rely on weak control. The machine learning approach of Lipkovich et al. (2011) and
the likelihood ratio testing approach of Jiang et al. (2007) compute the null distri-
butions by permutation which, as explained in Xu and Hsu (2007) and Kaizar et al.
(2011), requires the subtle MDJ (marginals-determine-the-joint) assumption to con-
trol Type I error rate weakly (The crux of the matter is that permutation generates a
null distribution assuming the joint distribution of observations are identical under
Rx and C across all biomarker values, while the complete null only specifies that the
marginal distributions are the same.).
MDJ If a subset of the null hypotheses that state under Rx and C the marginal
distributions of the observations are identical are true, then the joint distributions
of the observations are identical under Rx and C for that subset as well.
Strong Control
Strong control of Type I error rate means that even if some of the null hypotheses are
false, the probability of rejecting at least one true null hypothesis is controlled.
Strong control would control the probability of incorrect decision if the null
hypotheses are appropriately formulated.