Stat & Research
Stat & Research
Stat & Research
Introduction
(Piantadosi, 2005)
The term clinical trial is preferred over clinical experiment because the
latter may connote disrespect for the value of human life.
Clinical trials are used to develop and test interventions in nearly all areas of
medicine and public health. In many countries, approval for marketing new
drugs hinges on efficacy and safety results from clinical trials. Similar
requirements exist for the marketing of vaccines. The U.S. Food and Drug
Administration (FDA) now requires manufacturers of new or high-risk medical
devices to provide data demonstrating clinical safety and effectiveness. (Scott
2004). Surgical interventions pose unique challenges since surgical
approaches are typically undertaken for patients with a good prognosis and
may not be amenable to randomization or masking investigators and patients
to the intervention, all conditions which can lead to biases. Clinical trials are
useful for demonstrating efficacy and safety of various medical therapies,
preventative measures and diagnostic procedures if the treatment can be
applied uniformly and potential biases controlled.
Many studies have a window of opportunity during which they are most
feasible and will have the greatest impact on clinical practice. For comparative
trials, the window usually exists relatively early in the development of a new
therapy. If the treatment becomes widely accepted or discounted based on
anecdotal experience, it may become impossible to formally test the efficacy
of the procedure. Even when clinicians remain unconvinced of efficacy or
relative safety, patient recruitment can become problematic.
Some important medical advances have been made without the formal
methods of controlled clinical trials, i.e., without randomization, statistical
design, and analysis. Examples include the use of vitamins, insulin, some
antibiotics, and some vaccines.
2. The study subjects have to provide valid observations for the biological
question
1.2 - Summary
In this first lesson, Clinical Trials as Research, we learned to :
References:
Annas GJ and Grodin MA. (1992) The Nazi doctors and the Nuremberg Code:
Human Rights in Human Experimentation. New York: Oxford University Press.
Carter RL, Scheaffer RL, Marks RG. (1986) The role of consulting units in
statistics departments. Am. Stat. 40:260-264.
Friedman, L.M., Furberg, C.D., DeMets, D., Reboussin, D.M., Granger, C.B.
(2015). Chapter 2 Ethical Issues. In: Friedman, L.M.,Furberg, C.D.,DeMets,
D.,Reboussin, D.M.,Granger, C.B. Fundamentals of Clinical Trials. 5th ed.
Switzerland: Springer International Publishing. (Notes will refer to Friedman
et al 2015)
Piantadosi Steven. (2005) Clinical trials as research, Why clinical trials are
ethical, Contexts for clinical trials. In: Piantadosi Steven. Clinical Trials: A
Methodologic Perspective.2nd ed. Hobaken, NJ: John Wiley and Sons, Inc.
One area of ethical dilemma of physicians and health care workers can be
attributed to the conflicting roles of helping the patient and gaining scientific
knowledge, as stated by Schafer (1982):
In his traditional role of healer, the physician's commitment is exclusively to
his patient. By contrast, in his modern role of scientific investigator, the
physician engaged in medical research or experimentation has a commitment
to promote the acquisition of scientific knowledge.
Clinical trials are only one of several settings in which the physicians duty
extends beyond his responsibility to the individual patient. For example,
vaccinations against communicable disease are promoted by physicians, yet
the individual vaccinated incurs a small risk to benefit the population as whole.
Triage is another situation where for the sake of maximizing benefit to the
whole, the needs of an individual may not be met.
All clinical investigators should have training in research ethics. The US NIH
website has resources [1]for training in the areas of scientific integrity, data,
publication, peer review, mentor/trainee relationships, collaboration, human
and animal subjects and conflict of interest. Many funding sources, including
the US NIH and NSF require responsible conduct of research training for all
students, trainees, fellows, scholars and faculty utilizing their funds to conduct
research.
Conflict of Interest:
Competence:
At the time of the Nuremberg trial, there were no international standards for
ethical conduct in human experimentation. This resulted in the Nuremberg
Code, or directives for human experimentation, adopted in 1947:
6. The degree of risk for the patient should not exceed the humanitarian
importance of the problem to be solved.
Respect for persons (individual autonomy) means that patients have the
right to decide what should be done for them with respect to their illness
unless the result would be clearly detrimental to others. Respect for persons
means that potential subjects for clinical trials are informed of alternative
therapies and risks and benefits of participation in a particular trial before they
volunteer to particpate in that study. Since clinical trials often require
participants to surrender some measure of autonomy in order to be
randomized to treatment and follow the established protocol, these aspects
will be described to the potential subject, along with their freedom to choose to
discontinue the study at any time.
Justice addresses the question of fairly distributing the benefits and burdens
of research. Compensation for injury due to research is an application of
justice. Injustice occurs when benefits are denied without good reason or
when burdens are unduly imposed on particular individuals, such as the poor
or uninsured.
These principles are applied in the requirements for informed consent of
subjects, in assessment of risks and benefits and fair procedures and
outcomes and in the selection of subjects.
The U.S. National Institute of Health Policies for the Protection of Human
Subjects (1966) established the IRB (Institutional Review Board) as a
mechanism for the protection of human participants in research. In 1981, U.S.
regulations required IRB approval for all drugs or products regulated by the
US Food and Drug Administration (FDA), without regard to the funding source,
the research volunteers, or the location of the study. In 1991, core US DHHS
regulations (45 CFR Part 46, Subpart A) were adopted by most Departments
and Agencies involved in research with human subjects. This Federal Policy
for the Protection of Human Subjects, known as the "Common
Rule., [7] requires Institutional Review Board (IRB) review for all research
funded in whole or in part by the U.S. federal government.
6. The privacy of the participants and the confidentiality of the data are
protected
2. Reports of any serious adverse events in the human subjects when they
occur.
Informed Consent:
The principle of respect for persons implies that each study participant will be
made aware of potential risks, benefits and costs prior to participating in a
clinical study. To document this, study particpants (or parents/legally
authorized representatives) sign an informed consent document prior to
participation in a research study. The patient assents to having been informed
of the potential risks and benefits resulting from their participation in the
clinical study, to understanding their treatment alternatives and that their
participation is voluntary. There are numerous examples of studies in which patients have been
exposed to potentially or definitively harmful treatments without being fully apprised of the risk. The
consent document should be presented without any coercion. Even so, ill or dying patients and their
families are vulnerable, and it is questionable how much technical information about new treatments they
can truly understand, especially when it is presented to them quickly.
In the United States and many other countries, an IRB must evaluate and
approve the informed consent documents prior to beginning a study.
7. The voluntary nature of the study and the possibility of withdrawal at any
time
Study Question:
The study question also involves the choice the study population. Which
population can answer the question? Does the potential benefit outweigh the
risk to these study subjects? Is the selection just?
Study Sites:
The choice of the study population is also related to the place(s) the trial will
be conducted. There is greater generalizability and potentially faster
enrollment if a trial is conducted in multiple and varied geographic locales.
The concept of justice should be applied--is the location selected due to
prevalent disease and relevance of the results ? Or for sponsor conveniences
such as lower cost and fewer administrative and regulatory burderns? Is the
standard of care in this country less than optimal care and thus event rates
higher? What obligations do the trial sponsors have to the particpants or to
residents of the country once the trial is complete? Will the treatment be
available in this locale once the trial is complete?
Randomization:
Patients and physicians with firm preferences for treating a particular disease,
even those based on weak evidence, should not participate in a clinical trial
involving that disease. Patients with strong convictions about preferred
treatments are likely to become easily dissatisfied with randomization to a
treatment in the clinical trial. Physicians with strong convictions could bias the
clinical trial in a different direction, especially if they are not blinded to
treatment assignment.
Control Group:
Should the new intervention be compared with the best known therapy or with
placebo? Will a placebo control result in significant harm to subjects? What if
there is no accepted optimal therapy? What if the optimal therapy is very
costly or not available in some locations? The selection of the control group
has many ethical considerations.
Confidentiality:
The U.S. Department of Health and Human Services (HHS) issued the
Standards for Privacy of Individually Identifiable Health Information (the
Privacy Rule) under the Health Insurance Portability and Accountability Act of
1996 (HIPAA) to provide the first comprehensive Federal protection for
the privacy of personal health information. This became effective on April
14, 2003.
While certain provisions of the Rule specifically concern research and may
affect research activities, the Privacy Rule recognizes that the research
community has legitimate needs to use, access, and disclose Protected
Health Information (PHI) to carry out a wide range of health research
protocols and projects. The Privacy Rule protects the privacy of such
information while providing ways in which researchers can access and use
PHI when necessary to conduct research. The DHHS web site
(http://privacyruleandresearch.nih.gov/ [10]) should be examined for further
information about HIPAA requirements on research.
2.5 - Conduct
Recruitment:
Monitoring:
During the course of a comparative trial, evidence may become available that
one treatment is superior. Interim statistical analyses may be incorporated
into the study design to provide periodic investigations of treatment superiority
prior to study completion without sacrificing the statistical integrity of the trial.
(discussed later in this course). Should patients receiving an inferior treatment
continue in this manner? If there is evidence that a particular type of patient is
unlikely to respond to therapy, should entrance criteria be modified? Is the
adverse experience profile markedly worse for one therapy? Investigators are
required to report such circumstances to their IRB.
Data Integrity:
Authorship:
'Ghost authorship' occurs when people writing a paper are not fully disclosed
(i.e. draft written by contract writer) or when authors are included who did not
actually participate in the research project (for example, an influential name).
Journals combat such deception by asking authors to specify the contribution
of each person listed as an author.
2. Inform each participant about the nature and sponsorship of the project
and intended uses of the data
2.8 - Summary
In this second lesson, Ethics of Clinical Trials, we learned:
Clinical trial design has its roots in classical experimental design, yet has
some different features. The clinical investigator is not able to control as many
sources of variability through design as a laboratory or industrial experimenter.
Human responses to medical treatments display greater variability than
observations from experiments in genetically identical plants and animals or
measuring effects of tightly-controlled physical and chemical processes. And
of course, ethical issues are paramount in clinical research. To study a clinical
response with adequate precision, a trial may require lengthy periods for
patient accrual and follow-up. It is unlikely to enroll all the study subjects on
the same day. There is opportunity for study volunteers to decide to no longer
participate.
1. State 6 general objectives that will be met with proper trial design.
4. Compare and contrast the following study designs with respect to the
ability of the investigator to minimize bias: Case report or case series,
database analysis, prospective cohort study, case-control study, parallel
design clinical trial, crossover clinical trial.
4. Controls precision
Piantadosi (2005) states that clinical trial design should accomplish the
following:
Prospective studies tend to have fewer design problems and less bias than
retrospective studies, but they are more expensive with respect to time and
cost.
A classic example of a cohort study: U.S. National Heart Lung and Blood
Institute Framingham Heart Study [5]
Temperature and pressure in the chemical experiment are two factors that
comprise a two-way design in which it is of interest to examine various
combinations of temperature and pressure. Some clinical trials may have
a two-way factorial design, such as in oncology where various combinations
of doses of two chemotherapeutic agents comprise the treatments.
An incomplete factorial design may be useful if it is inappropriate to assign
subjects to some of the possible treatment combinations, such as no
treatment (double placebo). We will study factorial designs in a later lesson.
18 - 30 12 13
31 - 50 23 23
51-65 6 7
It is not necessary to have the same number of patients within each age
stratum. We do, however, want to have balance in the number on each
treatment within each age group..This is accomplished by blocking, in this
case, within the age strata. Blocking is a restriction of the randomization
process that results a balance of numbers of patients on each treatment after
a prescribed number of randomizations. For example, blocks of 4 within these
age strata would mean that after 4, 8, 12, etc. patients in a particular age
group had entered the study, the numbers assigned to each treatment within
that stratum would be equal.
Even ineffective treatments can appear beneficial in some patients. This may
be due to random fluctuations, or variability in the disease. If, however, the
improvement is due to the patients expectation of a positive response, this is
called a "placebo effect" . This is especially problematic when the outcome is
subjective, such as pain or symptom assessment. Placebo effect is widely
recognized and must be removed in any clinical trial. For example, rather than
constructing a nonrandomized trial in which all patients receive an
experimental therapy, it is better to randomize patients to receive either the
experimental therapy or a placebo. A true placebo is an inert or inactive
treatment that mimics the route of administration of the real treatment, e.g., a
sugar pill.
Placebos are not acceptable ethically in many situations, e.g., in surgical
trials. (Although there have been instances where 'sham' surgical procedures
took place as the 'placebo' control.) When an accepted treatment already
exists for a serious illness such as cancer, the control must be an active
treatment. In other situations, a true placebo is not physically possible to
attain. For example, a few trials investigating dimethyl sulfoxide (DMSO) for
providing muscle pain relief were conducted in the 1970s and 1980s. DMSO
is rubbed onto the area of muscle pain, but leaves a garlicky taste in the
mouth, so it was difficult to develop a placebo.
Validity
External validity in a human trial refers to how well study results can be
generalized to a broader population. External validity is irrelevant if internal
validity is low. External validity in randomized clinical trials is enhanced by
using broad eligibility criteria when recruiting patients .
Large simple and pragmatic trials emphasize external validity. A large simple trial
attempts to discover small advantages of a treatment that is expected to be used in a
large population. Large numbers of subjects are enrolled in a study with simplified
design and management. There is an implicit assumption that the treatment effect is
similar for all subjects with the simplified data collection. In a similar vein,
a pragmatic trial emphasizes the effect of a treatment in practices outside academic
medical centers and involves a broad range of clinical practices.
Studies of equivalency and noninferiority have different objectives than the usual trial
which is designed to demonstrate superiority of a new treatment to a control. A study to
demonstrate non-inferiority aims to show that a new treatment is not worse than an
accepted treatment in terms of the primary response variable by more than a pre-
specified margin. A study to demonstrate equivalence has the objective of
demonstrating the response to the new treatment is within a prespecified margin in both
directions. We will learn more about these studies when we explore sample size
calculations.
3.4 - Clinical Trial Phases
When a drug, procedure, or treatment appears safe and effective based on
preclinical studies, it can be considered for trials in humans. Clinical studies of
experimental drugs, procedures, or treatments in humans have been
classified into four phases (Phase I, Phase II, Phase III, and Phase IV) based
on the terminology used when pharmaceutical companies interact with the
U.S. FDA. Greater numbers of patients are assigned to treatment in each
successive phase.
Phase I trials investigate the effects of various dose levels on humans, The
studies are usually done in a small number of volunteers (sometimes persons
without the disease of interest or patients with few remaining treatment
options) who are closely monitored in a clinical setting. The purpose is to
determine a safe dosage range and to identify any common side effects or
readily apparent safety concerns. Data may be collected to provide a
description of the pharmacokinetics and pharmacodynamics of the compound,
estimate the maximum tolerated dose (MTD), or evaluate the effects of
multiple dose levels. Many trials in the early stage of therapy development
either investigate treatment mechanism (TM) or incorporate dose-finding (DF)
strategies.
A Phase III trial is a rigorous clinical trial with randomization, one or more
control groups and definitive clinical endpoints. Phase III trials are often multi-
center, accumulating the experience of thousands of patients. Phase III trials
address questions of comparative treatment efficacy (CTE). A CTE
trial involves a placebo and/or active control group so that precise and valid
estimates of differences in clinical outcomes attributable to the investigational
therapy can be assessed.
If things go well during Phase III, the company with the license for the
compound will submit an application for approval.to market the drug. U.S.
FDA approval hinges on adequate and well-controlled pivotal Phase III
studies that are convincing of safety and efficacy.
The terminology of phase I, II, III, and IV trials does not work well for non-
pharmacologic treatments and does not account for translational trials
Some studies performed prior to large scale clinical trials are characterized
as translational studies. Translational studies have as their primary outcome
a biological measurement or target that has been derived from an accepted
model of the disease process. The results of the translational study may
provide evidence of a mechanism of action for a compound. Target validation
can be an objective of such a study. Large effects on the target are sought.
For example, a large change in the level of a protein, or the activity of an
enzyme might support therapeutic activity of a compound. There is an
understanding that translational work may cycle from preclinical lab to a
clinical setting and back again. Although the translational studies have a
written protocol, the treatment may be modified during the study. The protocol
should clearly define what would be considered lack of effect and the next
experimental step for any possible outcome of the trial.
Vaccine investigations are a type of primary prevention trial They require large
numbers of patients and are very costly because of the numbers and the
length of follow-up that is required.
Every clinical trial experiences violations of the protocol. Some violations are
due to differences in interpretation, some are due to carelessness, and some
are due to unforeseen circumstances. Some protocol deviations are
inconsequential but others can affect the validity of the trial. For instance a
patient might be unaware of a condition that is present in its early or latent
stage or a patient may mislead a researcher intentionally, thinking they will
receive special treatment from participating in a study both result in
violations of the patient exclusion criteria established in the research protocol.
Protocol amendments are common as a long-term multi-center study
progresses. The most serious violations are those which may affect the
conclusions of the study.
3.7 - Summary
In this lesson, among other things, we learned:
to compare and contrast the following study designs with respect to the
ability of the investigator to minimize bias: Case report or case series,
database analysis, prospective cohort study, case-control study, parallel
design clinical trial, crossover clinical trial.
Systematic error or bias refers to deviations that are not due to chance alone.
The simplest example occurs with a measuring device that is improperly
calibrated so that it consistently overestimates (or underestimates) the
measurements by X units.
Bias, on the other hand, has a net direction and magnitude so that averaging
over a large number of observations does not eliminate its effect. In fact, bias
can be large enough to invalidate any conclusions. Increasing the sample size
is not going to help. In human studies, bias can be subtle and difficult to
detect. Even the suspicion of bias can render judgment that a study is invalid.
Thus, the design of clinical trials focuses on removing known biases.
Random error corresponds to imprecision, and bias to inaccuracy. Here is a
diagram that will attempt to differentiate between imprecision and inaccuracy.
(Click the 'Play' button.)
See the difference between these two terms? OK, let's explore these further!
2. State how the significance level and power of a statistical test are
related to random error.
If the data approximately follow a normal distribution or are from large enough
samples, then a two-sample t test is appropriate for comparing groups A and
B where:
tobs=(7.34.8)/1.2=2.1tobs=(7.34.8)/1.2=2.1
Each t value has associated probabilities. In this case, we want to know the
probability of observing a t value as extreme or more extreme than the t value
actually observed, if the null hypothesis is true. This is the p-value. At the
completion of the study, a statistical test is performed and its corresponding p-
value calculated. If the p-value < , then H0 is rejected in favor of H1.
Two types of errors can be made in testing hypotheses: rejecting the null
hypothesis when it is true or failing to reject the null hypothesis when it is
false. The probability of making a Type I error, represented by (the
significance level), is determined by the investigator prior to the onset of the
study. Typically, is set at a low value, say 0.01 or 0.05.
Here is an interactive table that presents these options. Roll your cursor over
the specific decisions (reject and fail to reject) to view results.
In our example, the p-value = [probability that |t| > 2.1] = 0.04
Thus, the null hypothesis of equal mean change for in the two populations is
rejected at the 0.05 significance level. The treatments were different in the
mean change in serum cholesterol at 8 weeks.
Note that (the probability of not rejecting H0 when it is false) did not play a
role in the test of hypothesis.
The importance of came into play during the design phase when the
investigator attempted to determine an appropriate sample size for the study.
To do so, the investigator had to decide on the effect size of interest, i.e., a
clinically meaningful difference between groups A and B in average change in
cholesterol at 8 weeks. The statistician cannot determine this but can help the
researcher decide whether he has the resources to have a reasonable chance
of observing the desired effect or should rethink his proposed study design..
The sample size should be determined such that there exists good statistical
power ( = 0.1 or 0.2) for detecting this effect size with a test of hypothesis
that has significance level .
A sample size formula that can be used for a two-sided, two-sample test with
= 0.05 and = 0.1 (90% statistical power) is:
nA=nA=212/2nA=nA=212/2
nA=nB=212/2=(2116)/9=37nA=nB=212/2=(2116)/9=37
Many studies suffer from low statistical power (large Type II error) because the
investigators do not perform sample size calculations.
If a study has very large sample sizes, then it may yield a statistically
significant result without any clinical meaning. Suppose in the serum
cholesterol example that xA=7.3xA=7.3 and xA=7.1mg/dlxA=7.1mg/dl ,
with nA = nB = 5,000. The two-sample t test may yield a p-value = 0.001,
but xAxB=7.37.1=0.2mg/dlxAxB=7.37.1=0.2mg/dl is not clinically
interesting.
Confidence Intervals
2.5(1.961.2)=[0.1,4.9]2.5(1.961.2)=[0.1,4.9]
Note that the 95% confidence interval does not contain 0, which is consistent
with the results of the 0.05-level hypothesis test (p-value = 0.04). 'No
difference' is not a plausible value for the difference between the treatments.
Notice also that the length of the confidence interval depends on the standard
error. The standard error decreases as the sample size increases, so the
confidence interval gets narrower as the sample size increases (hence,
greater precision).
1. Selection bias
5. Assessment bias
1. Selection Bias
The estimates of the response from the sample are clearly biased below the
population values. However, the observed difference between treatment and
control is of the same magnitude as that in the population. In other words, it
could be the observed treatment difference accurately reflects the population
difference, even though the observations within the control and treatment
groups are biased.
Post-entry exclusion bias can occur when the exclusion criteria for subjects
are modified after examination of some or all of the data. Some enrolled
subjects may be recategorized as ineligible and removed from the study. In
the past, this may have been done for the purposes of manufacturing
statistically significant results, but would be regarded as unethical practice
now.
Bias due to selective loss of data is related to post-entry exclusion bias. In this
case, data from selected subjects are eliminated from the statistical analyses.
Protocol violations (including adding on other medications, changing
medications or withdrawal from therapy) and other situations may cause an
invesigator to request an analysis using only the data from those who adhered
to the protocol or who completed the study on their assigned therapy.
The latter two types of biases can be extreme. Therefore, statisticians prefer
that intention-to-treat analyses be performed as the main statistical analysis..
5. Assessment bias
As discussed earlier, clinical studies that rely on patient self-assessment or
physician assessment of patient status are susceptible to assessment bias. In
some circumstances, such as in measuring pain or symptoms, there are no
alternatives, so attempts should be made to be as objective as possible and
invoke randomization and blinding. What is a mild cough for one person might
be characterized as a moderate cough by another patient. Not knowing
whether or not they received the treatment (blinding) when making these
subjective evaluations will help to minimize this self-assessment or
assessment bias..
s2=1n1i=1n(YiY)2s2=1n1i=1n(YiY)2
and
v2=1ni=1n(YiY)2v2=1ni=1n(YiY)2
4.4 - Summary
In this lesson, among other things, we learned:
Check to see if there are any homework problems associated with this lesson.
The endpoints (or outcomes), determined for each study participant, are the
quantitative measurements required by the objectives. In the Dansinger
weight loss study, the primary endpoint was identified to be mean absolute
change from baseline weight at 1 year. In a cancer chemotherapy trial the
clinical objective is usually improved survival. Survival time is recorded for
each patient; the primary outcome reported may be median survival time or it
could be five-year survival.
"Hard" endpoints are well-defined in the study protocol, definitive with respect
to the disease process, and require no subjectivity. "Soft" endpoints are those
that do not relate strongly to the disease process or require subjective
assessments by investigators and/or patients. Some endpoints fall between
these two classifications. For example: the grading of x-rays by radiologists
and the grading of cell and tissue lesions/tumors by pathologists. There is
some degree of subjectivity, but they are valid and reliable endpoints in most
settings.
This lesson will help to differentiate between these types of objectives and
endpoints. Ready, let's get started!
5.1 - Endpoints
The endpoints used in a clinical trial must correspond to the scientific
objectives of the study and the methods of outcome assessment should be
accurate (free of bias).
A wide variety of endpoints that are used in clinical trials as displayed below
Some endpoints are assessed many times during the study, leading to
repeated measurements.
Event times often are useful endpoints in clinical trials. Examples include
survival time from onset of diagnosis, time until progression from one stage of
disease to another, and time from surgery until hospital discharge. In each
case time is measured from study entry until the event occurs. With an
endpoint that is based on an event time, there always is the chance
of censoring. An event time is censored if there is some amount of follow-up
on a subject, but the event is not observed because of loss-to-follow-up, death
from a cause other than the trial endpoint, study termination, and other
reasons unrelated to the endpoint of interest.. This is known as right censoring
and occurs frequently in studies of survival..
Right-censoring example
Consider the table above which displays time until infection for Patients 1-6. In
some cases, the event did not occur, Patient 1 (from top) was followed for a
year and was censored at the end of the study). The second patient
experienced an infection at approximately 325 days. Patients 3 and 6 dropped
out of the study and were censored when this occurred.
Left censoring occurs when the initiation time for the subject, such as time of
diagnosis, is unknown. Interval censoring occurs when the subject is not
followed for a period of time during the trial and it is unknown if the event
occurred during that period.
There are three types of right censoring that are described in the statistical
literature.
Type I censoring occurs when all subjects are scheduled to begin the study at
the same time and end the study at the same time. This type of censoring is
common in laboratory animal experiments, but unlikely in human trials.
Type II censoring occurs when all subjects begin the study at the same time
and the study is terminated when a predetermined proportion of subjects have
experienced the event
Type III censoring occurs when the censoring is random, which is the case in
clinical trials because of staggered entry (not every patient enters the study on
the first day) and unequal follow-up on subjects.
Statistical methods appropriate for event time data, survival analyses, do not
discard the right-censored observations. Instead, the methods account for the
knowledge that the event did not occur in a subject up to the censoring time.
Survival methods include life table analysis, Kaplan-Meier survival curves,
logrank and Wilcoxon tests, and proportional hazards regression (more
discussion on these in a later lesson).
At first glance, death primarily due to the disease appears to be the most
appropriate. It is, however, susceptible to bias because the assumption of
independent causes of death may not be valid. For example, subjects with a
life-threatening cancer are prone to death due to myocardial infarction. It can
also be very difficult to determine the exact cause of death.
3. It yields the same statistical inference as that for the definitive endpoint
Surrogate Definitive
e >>> >>>
Endpoint Endpoint
The disease affects the surrogate endpoint, which in turn affects the definitive
endpoints.
Meaning
Early developmental trial that investigates mechanism of treatment effect, e.g., a pharmacokinetics study
nism
and elimination of the drug from the human body
Imprecise term for dose-ranging studies
Design or component of a design that specifies methods for increases in dose for subsequent subjects
Design that tests some or all of a prespecified set of doses (fixed design points)
Design that titrates dose to a prespecified optimum based on biological or clinical considerations
Dose-finding (DF) trials are Phase I studies with the objective of determining
the optimal biological dose (OBD) of a drug. In order to determine the dose
with highest potential for efficacy in the patient population that still meets
safety criteria, dose-finding studies are typically conducted by
administering sequentially rising doses to successive groups of individuals.
Such studies may be conducted in healthy volunteers or in patients with
disease.
An optimal dose can be selected on the basis of efficacy alone, such as when
a minimum effective dose (MED) is chosen for a pain-relieving medication,
and defined as the dose which eliminates mild-to-moderate pain in 80% of trial
participants. In another case, the optimal dose might be selected as the
highest dose that is associated with serious side effects in no more than 1 of
20 patients. This would be a maximum nontoxic dose (MND). In cancer
therapeutics, the optimal dose for a cytotoxic drug designed to shrink tumors
could be defined as the level that yields serious but reversible toxicity in no
more than 30% of the patients. This is a maximum tolerated dose (MTD). Care
in defining the conditions for optimality is critical to a dose-finding study.
Most DF trials are sequential studies such that number of subjects is itself an
outcome of the trial. Convincing evidence characterizing the relationship of
dose and safety can be obtained after studying a small set of patients. Hence
sample size is not a major concern DF trials.
Most likely, your answer is no because you would not want to risk being
assigned to the highest dose level of this unproven drug as your first
treatment. There is a principle here: it is unethical to treat humans at high
doses of a drug without any prior knowledge of their responses at lower
levels. Furthermore, ethics compel a design that minimizes the numbers of
patients treated with both low ineffective doses and high toxic doses.
5.5 - Summary
In this lesson, among other things, we learned how to:
Look for any homework assignments listed for this lesson in the ANGEL
course site...
Usually, sample size is calculated with respect to two circumstances. The first
involves precision for an estimator, e.g., requiring a 95% confidence interval
for the population mean to be within units. The second involves statistical
power for hypothesis testing, e.g., requiring 0.80 or 0.90 statistical power
(1-) for a hypothesis test when the significance level () is 0.05 and the
effect size (the clinically meaningful effect) is units.
The formulae for many sample size calculations will involve percentiles from
the standard normal distribution. The graph below illustrates the
2.5th percentile and the 97.5th percentile.
Fig. 1 Standard normal distribution centered on zero.
For a two-sided hypothesis test with significance level and statistical power
1 - , the percentiles of interest are z (1-/2) and z (1 - ).
z0.995=2.58,z0.99=2.33,z0.975=1.96,z0.95=1.65,z0.90=1.28,z0.80=0.84z0.995=2.5
8,z0.99=2.33,z0.975=1.96,z0.95=1.65,z0.90=1.28,z0.80=0.84 .
Also, we may base the sample size calculation on a t statistic for a hypothesis
test, which assumes an exact normal distribution of the outcome variable
when it only may be approximately normal.
Estimate the sample size required for a confidence interval for p for
given and , using normal approximation and Fisher's exact methods.
Estimate the sample size required for a confidence interval for for
given and , using normal approximation when the sample size is
relatively large.
References:
Friedman, Furberg, DeMets, Reboussin and Granger. (2015) Sample size. In:
FFDRG. Fundamentals of Clinical Trials. 5th ed. Switzerland: Springer.
Piantadosi Steven. (2005) Sample size and power. In: Piantadosi
Steven. Clinical Trials: A Methodologic Perspective. 2nd ed. Hoboken, NJ:
John Wiley and Sons, Inc.
The simplest example occurs when the outcome response is binary (success
or failure). Let p denote the true (but unknown) proportion of successes in the
population that will be estimated from a sample.
p^=rnp^=rn
If the sample size is large enough, then the 100(1 - )% confidence interval
can be approximated as:
p^z1/2p^(1p^)/np^z1/2p^(1p^)/n
Prior to the conduct of the study, however, the point estimate is undetermined
so that an educated guess is necessary for the purposes of a sample size
calculation.
n=z21/2p(1p)/2n=z1/22p(1p)/2
If you want the confidence interval to be tighter remember that splitting the
width of the confidence interval in half will involve quadrupling the number of
subjects in the sample size!
np^(1p^)5np^(1p^)5
In the exact binomial method, the lower 100(/2)% confidence limit for p is
determined as the value pL that satisfies
/2=k=rnC(n,k)(pL)k(1pL)nk/2=k=rnC(n,k)(pL)k(1pL)nk
/2=k=0rC(n,k)(pU)k(1pU)nk/2=k=0rC(n,k)(pU)k(1pU)nk
SAS PROC FREQ provides the exact and asymptotic 100(1 - )% confidence
intervals for a binomial proportion, p.
p^=0.16p^=0.16
np^(1p^)=19(0.16)(0.84)=2.55<5np^(1p^)=19(0.16)(0.84)=2.55<5
The 95% confidence interval for p, based on the exact method, is [0.03, 0.40].
The 95% confidence interval for p, based on the normal approximation, is [-
0.01, 0.32], which is modified to [0.00, 0.32] because p represents the
probability of success that is supposed to be restricted to lie within the [0, 1]
interval. Even with the correction to the lower endpoint, the confidence interval
based on the normal approximation does not appear to be very accurate in
this example.
Modify the SAS program above to reflect 11 successes out of 75 trials. (Click
the 'Inspect' icon and review what part of this program to change and how it
works if you need to.) Run the program. Do the results round to (0.08, 0.25)
for the 95% exact confidence limits?
Using the exact confidence interval for a binomial proportion is the better
option if you are not sure you are working in a standard normally distributed
population.
SAS PROF FREQ (trial-and-error) indicates that the exact one-sided 95%
upper confidence limit for p, when 0 out of 14 successes are observed, is
0.19. Thus, if the treatment fails in each of the first 14 patients, then the study
is terminated.
Work out your answer first, then click the graphic (left) to compare answers.
What is the upper 95% one-sided confidence limit for p when you have seen
no successes in 5 trials?
Work out your answer first, then click the graphic (left) to compare answers.
Here is another one to try... How many straight failures would it take to rule out
a 30% success rate?
where
Y(z1/2/n)Y(z1/2/n)
YY
then
n=z21/22/2n=z1/222/2
For example, the necessary sample size for estimating the mean reduction in
diastolic blood pressure, where = 5 mm Hg and = 1 mm Hg, is n =
(1.96)2(5)2/(1)2 = 96.
Z=(Y1Y2)/1n1+1n2Z=(Y1Y2)/1n1+1n2
which follows a standard normal distribution when the null hypothesis is true.
If the alternative hypothesis is two-sided, i.e., H1: 0, then the null
hypothesis is rejected for large values of |Z|.
Z=(Y1Y2)/1n1+1n2Z=(Y1Y2)/1n1+1n2
Suppose we let AR = n1/n2 denote the allocation ratio (AR), (in most cases we
will assign AR = 1 to get equal sample sizes). If we wish to a have large
enough sample size to detect an effect size with a two-sided, -significance
level test with 100(1 - )% statistical power, then
n2=(AR+1AR)(z1/2+z1)22/2n2=(AR+1AR)(z1/2+z1)22/2
and n1 = ARn2.
Note this formula matches the sample size formula in our FFDRG text on p.
180, assuming equal allocation to the two treatment groups and multiplying
the result here by 2 to get 2N, which FFDRG use to denote the total sample
size.
Although this sample size formula assumes that the standard deviation is
known so that a z test can be applied, it works relatively well when the
standard deviation must be estimated and a t-test applied. A preliminary guess
of must be available, however, either from a small pilot study or a report in
the literature. For smaller sample sizes (n1 30, n2 30) percentiles from a t
distribution can be substituted, although this results in both sides of the
formula involving n2 so that it must be solved iteratively:
n2=(AR+1AR)(tn1+n22,1/2+tn1+n22,1)22/2n2=(AR+1AR)
(tn1+n22,1/2+tn1+n22,1)22/2
Thus, the total sample size required is n1 + n2 = 189 + 189 = 378. SAS
Example (7.4_sample_size__normal_.sas [3]): This is a program that illustrates
the use of PROC POWER to calculate sample size when comparing two
normal means.
[3]
Notice that the 2:1 allocation, when compared to the 1:1 allocation, requires
an overall larger sample size (429 versus 382).
Try it yourself!
Work out your answer first, then click the graphic (left) to compare answers.
Here is another one to try... How many subjects are needed to have 80%
power in testing equivalence of two means when subjects were allocated 2:1,
using a = 0.05 two sided test? The standard deviation is 10 and the
hypothesized difference in means is 5.
6A.7 - Example 2: Comparative
Treatment Efficacy Studies
What if the primary response variable is binary?
When the outcome in a CTE trial is a binary response and the objective is to
compare the two groups with respect to the proportion of success, the results
can be expressed in a 2 2 table as
Group # 1 Group # 2
Success r1 r2
Failure n1 - r1 n2 - r2
There are a variety of methods for performing the statistical test of the null
hypothesis H0: p1 = p2, such as a z-test using a normal approximation, a 2 test
(basically, a square of the z-test), a 2 test with continuity correction, and
Fisher's exact test.
The normal and 2 approximations for comparing two proportions are relatively
accurate when these conditions are met:
n1(r1+r2)(n1+n25,n2(r1+r2)(n1+n25,n1(n1+n2r1r2)
(n1+n25,n2(n1+n2r1r2)(n1+n25n1(r1+r2)(n1+n25,n2(r1+r2)
(n1+n25,n1(n1+n2r1r2)(n1+n25,n2(n1+n2r1r2)(n1+n25
Basically when the expected number in each cell is greater than 5, the normal
or Chi Square approximation is useful.
Otherwise, Fisher's exact test is recommended. All of these tests are available
in SAS PROC FREQ of SAS and will be discussed later in the course.
A sample size formula for comparing the proportions p1 and p2 using the
normal approximation is given below:
n2=(AR+1AR)(z1/2+z1)2p(1p)/(p1p2)2n2=(AR+1AR)
(z1/2+z1)2p(1p)/(p1p2)2
where p1 - p2 represents the effect size and
p=(ARp1+p2)/(AR+1)p=(ARp1+p2)/(AR+1)
(Note this formula is the same as p. 173 in our text FFDRG if you assume the
allocation ratio is 1:1 and double the sample size here to get total sample size
2N as calculated in FFDRG)
[4]
SAS PROC POWER for Fishers exact test yields n1 = 85 and n2 = 85 for AR =
1, and n1 = 171 and n2 = 57 for AR = 3.
Work out your answer first, then click the graphic (left) to compare answers.
What would be the sample size required to have 80% power to detect that a
new therapy has a significantly different success rate than the standard
therapy success rate of 30%, if it was expected that the new therapy would
result in at least 40% successes? Use a two-sided test with 0.05 significance
level.
The hazard ratio is defined as the ratio of two hazard functions, 1(t) and 2(t),
corresponding to two treatment groups. Typically, we assume proportional
hazards, i.e., = 1(t)/2(t) is a constant function independent of time. The
graphs on the next two slides illustrate the concept of proportional hazards.
A hazard function may be constant, increasing, or decreasing over time, or
even be a more complex function of time. In trials in which survival time is the
outcome, an increasing hazard function indicates that the instantaneous risk
of death increases throughout the trial.
E=((AR+1)2AR)(z1/2+z1)2/(loge())2E=((AR+1)2AR)(z1/2+z1)2/
(loge())2
(Note this formula above matches FFDRG text p. 185 simple formula, if it is
assumed that all particpants will have an event. However, we most often have
censored data, that is a number of participants who do not experience the
event before the trials ends. )
Since we do not expect all persons in the trial to experience an event, the
sample size must be larger than the required number of events.
Suppose that p1 and p2 represent the anticipated event rates in the two
treatment groups. Then the sample sizes can be determined from n2 = E/
(ARp1 + p2) and n1 = ARn2
Example
E = (4)(1.96 + 1.28)2/{loge(2.29)}2 = 62
[5]
SAS PROC POWER for the logrank test requires information on the accrual
time and the follow-up time. It assumes that if the accrual (recruitment) period
is of duration T1 and the follow-up time is of duration T2, then the total study
time is of duration T1 + T2. It assumes, however, if a patient is recruited at time
T1/2, then the follow-up period for that patient is T1/2 + T2instead of T2. This
assumption may be reasonable for observational studies, but not for clinical
trials in which follow-up on each patient is terminated when the patient
reaches time T2. Therefore, for a clinical trial situation, set accrual time in SAS
PROC POWER equal to a very small positive number. For the given example,
SAS PROC POWER yields n1 = 109 and n2 = 109.
Pr[D=d]=(m)dexp(m)/d!Pr[D=d]=(m)dexp(m)/d!
=Pr[D1]=1Pr[D=0]=1exp(m)=Pr[D1]=1Pr[D=0]=1exp(m)
to be relatively large. With respect to the cohort size, this means that m should
be selected such that
m=loge(1)/m=loge(1)/
Example
(note the value in this problem is a probability, not quite the same as that
we use in calculating power)
6A.10 - Adjustment Factors for Sample
Size Calculations
When calculating a sample size, we may need to adjust our calculations due
to multiple primary comparisons or for nonadherence to therapy.
If there is more than one primary outcome variable (for example, co-
primary outcomes) or more than one primary comparison (for example,
3 treatment groups), then the significance level should be adjusted to
account for the multiple comparisons in order not to inflate the overall false-
positive rate.
For example, suppose a clinical trial will involve two treatment groups and a
placebo group. The investigator may decide that there are two primary
comparisons of interest, namely, each treatment group compared to placebo.
The simplest adjustment to the significance level for each test is the
Bonferroni correction, which uses /2 instead of .
Thus, a further adjustment to the sample size estimate may be made based
on the anticipated drop-out and drop-in rates in each arm (See Wittes
(2002 [1]). A similar formula is on p. 179 FFDRG.
Suppose a study has two treatment groups and will compare test therapy to
placebo. With only one primary comparison, we do not need to adjust the
significance level for multiple comparisons. Suppose that the sample size for
a certain power, significance level and clinically important difference works to
be 200 participants/group or 400 total.
These are relatively simple calculations to introduce the idea of adjusting for
noncompliance as well as for multiple comparisons. More complicated
processes can be modeled.
Finally, when estimating a sample size for a study, an iterative process may be
followed (adapted from Wittes, 2002)
2. What is the desired type I error rate and power? If more than one primary
outcome or comparison, make required adjustments to Type 1 error.
4. If the study is measuring time to failure, how long is the followup period?
What assumptions should be made about recruitment?
7. Select a sample size. Plot power curves as the parameters range over
reasonable values.
8. Iterate as needed.
Which of these adjustments (or others, such as modeling dropout rates that
are not independent of outcome) is important for a particular study depends
on the study objectives. Not only must we consider whether there is more than
primary outcome or multiple primary comparisions, we must also consider the
nature of the trial. For example, if the study results are headed to a regulatory
agency, using a primary intention-to-treat analysis, it is important to
demonstrate an effect of a certain magnitude. Adjusting the sample size to
account for non-adherence makes sense. On the other hand, in a comparative
effectiveness study, the objective may be to estimate the difference in effect
when the intervention is prescribed vs the control, regardless of adherence. In
this situation, the dilution of effect due to nonadherence may be of little
concern.
6A.11 - Summary
In this lesson, among other things, we learned to:
Estimate the sample size required for a confidence interval for p for
given and , using normal approximation and Fisher's exact methods
Estimate the sample size required for a confidence interval for for
given and , using normal approximation when the sample size is
relatively large
This week we continue exploring the issues of sample size and power, this
time with regard to the differing purposes of clinical trials. Often the objective
of the trial is to establish that a therapy is efficacious, but what is the proper
control group? Can superiority to placebo be clearly established when there
are other effective therapies on the market? These questions lead to special
considerations based on whether the trial has an objective of establishing
superiority, equivalence or non-inferiority. So, lets move ahead
o objectives
o control group
Active control groups are often used because placebo control groups are
unethical, such as when:
In the MIRACLE trial [1], the standard-of-care for the eligible patients with heart
failure during the course of the trial consisted of some combination of the
following medications:
Diuretic
Digitalis
Beta-blocker
versus
standard-of-care + pacemaker
versus
There are two other possibilities to consider for designing the clinical trial,
namely, an equivalence trial and a non-inferiority trial.
Thus, the difference in means between the two therapies does not exceed 2
mm Hg. Let's suppose that we are willing to accept this level.
The investigator needs to select an active control therapy for the equivalence
trial that has been proven to be superior to placebo. An important assumption
is that the active control would be superior to placebo (had placebo been a
treatment arm in the current trial).
The U.S. Food and Drug Administration (FDA) and the National Institutes of
Health (NIH) typically require intent-to-treat (ITT) analyses in placebo-
controlled trials. In an ITT analysis, data on all randomized patients are
included in the analysis, regardless of protocol violations, lack of adherence,
withdrawal, incorrectly taking the other treatment, etc. The ITT analysis
reflects what will happen in the real world, outside the realm of a controlled
clinical trial.
Assume that the larger response is the better response. The one-sided zone
of non-inferiority is defined by -, i.e., the difference in population means
between the experimental therapy and the active control, E - A, should lie
within (-, + ).
Many of the same issues that are critical for designing an equivalence trial
also are critical for designing a non-inferiority trial, namely, appropriate
selection of an active control and appropriate selection of the zone of clinical
non-inferiority defined by .
Hypertensive Example
The researchers may decide that the experimental drug is clinically not inferior
to the standard drug if its mean reduction in diastolic blood pressure is at least
3 mm Hg ( = 2). Thus, the difference in population means between the
experimental therapy and the active control therapy, E - A, should lie within (-
, + ). It does not matter if the experimental drug is much better than active
control drug, provided that it is not inferior to the active control drug.
Because a non-inferiority trial design allows for the possibility that the
experimental therapy is superior to the active control therapy, the non-
inferiority design is preferred over the equivalence design. The equivalence
design is useful when evaluating generic drugs.
or
In terms of the population means, the hypotheses for testing equivalence are
expressed as:
vs.
H1:{<EA<}H1:{<EA<}
also expressed as
In terms of the population means, the hypotheses for testing non-inferiority are
expressed as
With respect to two-sample t tests, reject the null hypothesis of inferiority if:
tinf=(YEYA+)/s1nE+1nZ
>tnE+nA2,1tinf=(YEYA+)/s1nE+1nZ>tnE+nA2,1
tsup=(YEYA+)/s1nE+1nA
<tnE+nA2,1tsup=(YEYA+)/s1nE+1nA<tnE+nA2,1
s2=(i=1nE(YEiYE)2+j=1nA(YAjYA)2)/
(nE+nA2)s2=(i=1nE(YEiYE)2+j=1nA(YAjYA)2)/(nE+nA2)
This confidence interval does provide 100(1 - )% coverage - (see Berger RL,
Hsu JC. Bioequivalence trials, intersection-union tests, and equivalence
confidence sets. Statistical Science 1996, 11: 283-319).
If the 100(1 - )% confidence interval lies entirely within (-, +), then the null
hypothesis of non-equivalence is rejected in favor of the alternative hypothesis
of equivalence at the significance level.
For a non-inferiority trial, the two-sample t statistic labeled tinf [2], previously
discussed,can be applied to test:
The FDA typically is more stringent than is required in non-inferiority tests. The
FDA typically requires companies to use = 0.025 for a non-inferiority trial, so
that the one-sided test or lower confidence limit is comparable to what would
be used in a two-sided superiority trial.
Equivalence
Non-Equivalence
Non-Inferiority
Inferiority
Example
YE=17.4,YA=20.6,s=6.5YE=17.4,YA=20.6,s=6.5
The t percentile, t58,0.95, can be found from the TINV function in SAS as
TINV(0.95,58), which yields that t58,0.95 = 1.67. Thus, using the formulas in the
section above, the lower limit = min{0, -3.2 - 2.8} = min{0, -6.0} = -6.0; the
upper limit = max{0, -3.2 + 2.8} = max{0, -0.4} = 0.0. This yields the 95%
confidence interval for testing equivalence of E - A is (-6.0, 0.0). Because the
95% confidence interval for E - A does not lie entirely within (-, +) = (-4,
+4), the null hypothesis of non-equivalence is not rejected at the 0.05
significance level. Hence, the investigator cannot conclude that the
experimental therapy is equivalent to the active control.
A real example of a non-inferiority trial is the VALIANT [3] trial in patients with
myocardial infarction and heart failure. Patients were randomized to valsartan
monotherapy (nV = 4,909), captopril monotherapy (nC = 4,909), or valsartan +
captopril combination therapy (nVC = 4,885). The primary outcome was death
from any cause. One objective of the VALIANT trial was to determine if the
combination therapy is superior to each of the monotherapies. Another
objective of the trial was to determine if valsartan is non-inferior to captopril,
defined by = 2.5% in the overall death rate.
Switching Objectives
Suppose that in a non-inferiority trial, the 95% lower confidence limit for E -
A not only lies within (-, +) to establish non-inferiority, but also lies within
(0, +). It is safe to claim superiority of the experimental therapy to the active
control in such a situation (without any statistical penalty).
In a superiority trial, suppose that the 95% lower confidence limit for E -
A does not lie within (0, +), indicating that the experimental therapy is not
superior to the active control. If the protocol had specified non-inferiority as a
secondary objective and specified an appropriate value of , then it is safe to
claim non-inferiority if the 95% lower confidence limit for E - Alies within (-,
+).
nA=(AR+1AR)(tn1+n22,1+tn1+n22,1)22/(||)2nA=(AR+1AR)
(tn1+n22,1+tn1+n22,1)22/(||)2
Notice the difference in the t percentiles between this formula and that for a
superiority comparison, described earlier. The difference is due to the two
one-sided testing that is performed.
(Note: the formula above simplifies to the formula on p. 189 in the FFDRG text
if AR =1, = 0 and substituting Z for t )
nA=(AR+1AR)(z1+z1)2p(1p)/(|pEpA|)2nA=(AR+1AR)
(z1+z1)2p(1p)/(|pEpA|)2
where
p=(ARpE+pA)/(AR+1)p=(ARpE+pA)/(AR+1)
How does this formula compare to FFDRG p. 189? The choice of the value for
p in our text is to use the control group value, assuming, that pe- pa=0.
For a time-to-event outcome, the zone of equivalence for the hazard ratio
between the experimental therapy and the active control, , is defined by the
interval (1/, +), where is chosen > 1. The number of patients who need
to experience the event to achieve 100(1 - )% statistical power with an -
level significance test is approximated by
E=((AR+1)2AR)(z1+z1)2/(loge(/))2E=((AR+1)2AR)(z1+z1)2/
(loge(/))2
If pE and pA represent the anticipated failure rates in the two treatment groups,
then the sample sizes can be determined from nA = E/(ARpE + pA) and nE =
ARnA
nA=(AR+1AR)(tn1+n22,1+tn1+n22,1)22/(||)2nA=(AR+1AR)
(tn1+n22,1+tn1+n22,1)22/(||)2
Notice that the sample size formulae for non-inferiority trials are exactly the
same as the sample size formulae for equivalence trials. This is because of
the one-sided testing for both types of designs (even though an equivalence
trial involves two one-sided tests). Also notice that the choice of Z in the
formulas above have assumed a one-sided test or two one-sided tests, but
the requirements of regulatory agencies and the approach in our FFDRG text
is to use the Z value that would have been used for a 2-sided hypothesis test.
In homework, be sure to state any assumptions and the approach you are
taking.
[4]
Come up with an answer to this question by yourself and then click on the
icon to the left to reveal the solution.
What happens to the total sample size if the power is to be 0.95 and the
investigator uses 2:1 allocation?
With equal allocation, the number of patients in the active control group is:
SAS PROC POWER does not contain a feature for an equivalence trial or a
non-inferiority trial with binary outcomes. Fishers exact test for a superiority
trial can be adapted to yield nE = nA = 1,882 for a total of 3,764 patients. The
discrepancy is due to the superiority trial using p-bar = 0.675 instead of 0.7.
Suppose the proportions were 0.65 and 0.75. How does the required sample
size, n, change?
Assuming constant hazard functions, then the effect size with pE = pA = 0.2 is
= 1. With pE = 0.25 and pA = 0.2, the zone of non-inferiority is defined by:
and the sample sizes are nA = E/(ARpE + pA) = 648/(0.2 + 0.2) = 1,620 and
nE = 1,620
Since SAS PROC POWER does not contain a feature for an equivalence trial
or a non-inferiority trial with time-to-event outcomes, the results from the
logrank test for a superiority trial were adapted to yield nE = nA = 1,457. The
discrepancy in numbers between the program and the calculated n is due to
the superiority trial using pE = 0.25 instead of 0.2 in nA = E/(ARpE + pA).
Notice that the resultant sample sizes in SAS Examples 7.7-7.9 all are
relatively large. This is because the zone of equivalence or non-inferiority is
defined by a small value of . Generally, equivalence trials and non-inferiority
trials will require larger sample sizes than superiority trials.
6B.9 - Summary
In this lesson, among other things, we learned:
o control group
Eligibility criteria also define the accrual rate for a trial. Although tighter
eligibility criteria lead to a more homogeneous trial, they yield a slower accrual
rate. It might be more difficult to meet all of the criteria you specify using strict
eligibility criteria.
References
Gotay, CC. (1991). Accrual to cancer clinical trials: directions from the
research literature. Soc. Sci. Med. 33: 569-577.
Although a narrowly-defined cohort may have some external validity for others
with the same disease if the treatment appears to be beneficial, in general it
will lack external validity because the study results may not apply to patients
with slightly altered versions of the disease. Again, these are examples of the
competing demands that the researcher must keep in mind.
Epidemiologists have defined the healthy worker effect as the phenomenon
that the general health of employed individuals is better than average. For
example, employed individuals may be unsuitable controls for a case-control
study if the cases are hospitalized patients. Similarly, individuals who
volunteer for clinical trials may have more favorable outcomes than those who
refuse to participate, even if the treatment is ineffective. This selection effect is
known as the trial participant effect and it can be strong. For a randomized
trial, however, this may not be a problem unless selection effects somehow
impact treatment assignment.
Because of the possible effects of prognostic (variables that can affect the
outcome) and selection factors on differences in outcome, the eligibility criteria
for the study cohort need to be defined carefully.
1. Define very narrow eligibility criteria so that the study cohort is relatively
homogeneous, which may yield an outcome variable that has less
variability and result in a smaller sample size; however, the results may
not have external validity.
(You may notice in this section we have defined a study cohort for the trial.
This doesnt mean however that every clinical trial is a cohort study in the
sense of a long-term study following a defined group of patients.)
The target assumes a constant accrual of patients. There was a lag in the
number of patients at the beginning that were recruited but it caught up with
the target for recruitment by the end of the study. This struggle in the number
of patients recruited is very typical. Recruitment is always a struggle.
Everyone on the research team needs to help with this process.
The reasons physicians give for failing to enroll patients in clinical trials are
the perception that the trial may compromise the physician-patient relationship
and the difficulties with informed consent. Many consent forms are
cumbersome, intimidating, and not written at an appropriate reading level. The
'experts' say that these documents should be written at an 8th grade reading
level. Using plain language is important. Also, many patients are mistrustful of
the medical establishment, although they may trust their individual physicians.
Often, ethnic minority groups express even stronger concerns about
participation in clinical trials.
Example
This study was examining a very specific biological question. The primary
objective of the trial was to establish whether increasing the dose for each ICS
yields a decrease in plasma cortisol (adrenal suppression). The researchers
were interested in looking at dose response curves. The DICE trial was not
powered to compare the dose-response curves of each ICS.
DICE is strictly an efficacy trial with very narrow eligibility criteria. Furthermore,
the protocol specified that the intent-to-treat paradigm would not be followed.
Subjects were dropped post-randomization if they received other forms of
steroids, became pregnant, or were non-compliant with dose schedules
and/or visit schedules.
On the other hand, over the past 20 years, there has been great interest in the
gender and ethnic composition of cohorts in clinical trials. Part of this interest
is due to ensuring external validity of the results of the trials. For many years
caucasian males were the only patients recruited for the purpose of assuring
homogeneity. This has been broadened by both the FDA and NIH in their
application process. The broader eligibility requirements will help to ensure
broader external validity.
The NIH typically requires one-half female participation and one-third ethnic
minority participation in CTE trials that it sponsors. Obviously, there are
exceptions to this based on the disease of interest. Required representation in
clinical trials, however, could be a hindrance to acquiring new knowledge if it
consumes too many resources.
7.4 - Summary
In this lesson, among other things, we learned:
Let's put what we have learned to use by completing the following homework
assignment:
Homework
1. reducing bias;
8.1 - Randomization
In some early clinical trials, randomization was performed by constructing two
balanced groups of patients and then randomly assigning the two groups to
the two treatment groups. This is not always practical as most trials do not
have all the patients recruited on day one of the study. Most clinical trials
today invoke a procedure in which individual patients, upon entering the study,
are randomized to treatment.
Simple Randomization
The moderate is fairly well balanced, the mild and severe groups are much
more imbalanced. This results in Group A getting more of the severe cases
and Group B more of the mild cases.
8.2 - Constrained Randomization
Randomization in permuted blocks is one approach to achieve balance across
treatment groups. The randomization scheme consists of a sequence of
blocks such that each block contains a pre-specified number of treatment
assignments in random order. The purpose of this is so that the randomization
scheme is balanced at the completion of each block. For example, suppose
equal allocation is planned in a two-armed trial (groups A and B) using a
randomization scheme of permuted blocks. The target sample size is 120
patients (60 in A and 60 in B) and the investigator plans to enroll 12 subjects
per week. In this situation, blocks of size 12 are natural, so the randomization
plan looks like
Week #1 BABABAABABAB
Week #2 ABBBAAABAABB
Week #3 BBBABABAABAA
This probability rule is based on the model of NA "A" balls and NB "B" balls in
an urn or jar which are sampled without replacement. The probability of being
assigned treatment A changes according to how many patients already have
been assigned treatment A and treatment B within the block.
Treatment A Treatment B
If there are too many strata in relation to the target sample size, then some of
the strata will be empty or sparse. This can be taken to the extreme such that
each stratum consists of only one patient each, which in effect would yield a
similar result as simple randomization. Keep the number of strata used to a
minimum for good effect.
This type of urn model for adaptive randomization yields tight control of
balance in the early phase of a trial. As nA and nB get larger, the scheme tends
to approach simple randomization, so the advantage of such an approach
occurs when the trial has a small target sample size.
8.5 - Minimization
Minimization is another, rather complicated type of adaptive randomization.
Minimization schemes construct measures of imbalance for each treatment
when an eligible patient is ready for randomization. The patient is assigned to
the treatment which yields the lowest imbalance score. If the imbalance
scores are all equal, then that patient is randomly assigned a treatment. This
type of adaptive randomization imposes tight control of balance, but it is more
labor-intensive to implement because the imbalance scores must be
calculated with each new patient. Some researchers have developed web-
based applications and automated 24-hour telephone services that solicit
information about the stratifiers and a computer algorithm uses the data to
determine the randomization
Suppose that patient #201 is ready for randomization and that this patient is
observed to have the low level of stratifier #1, the medium level of stratifier #2,
the high level of stratifier #3, and the high level of stratifier #4. Based on the
200 patients already in the trial, the number of patients with each of these
levels is totaled for each treatment group. (Notice that patients may be double
counted in this table.)
The advantage of the "play the winner" rule is that a higher proportion of
patients will be assigned to the more successful treatment. This seems to be
an ethical approach.
Thus, the "play the winner" rule is not practical for most trials. The procedure
can be modified, however, to be performed in stages. For example, if the
target sample size is 200 patients, then the trial can be put on hold after each
set of 50 patients to assess outcome and redefine the probability of treatment
assignment for the patients yet to be recruited, i.e., "play the winner" after
every 50 patients instead of every patient.
8.7 - Administration of the
Randomization Process
The RANUNI function in SAS yields random numbers from the Uniform(0,1)
distribution (randomly selected a decimal between 0 and 1). These random
numbers can be used to generate a randomization scheme. For example,
suppose that the probability of assignment to treatments A, B, and C are to be
0.25, 0.25, and 0.5, respectively. Let U denote the random number generated
and assign treatment as follows:
Come up with an answer to this question by yourself and then click on the
icon to the left to reveal the solution.
Can you generate a permuted blocks randomization scheme for a total
sample size of 32 with a block size of 4?
Many clinical trials rely on pharmacies to package the drugs so that they are
masked to investigators and patients. For example, consider a two-armed trial
with a target sample size of 96 randomized subjects (48 within each treatment
group). The pharmacist constructs 96 drug packets and randomly assigns
numeric codes from 01 to 96 which are printed on the drug packet labels. The
pharmacist gives the investigator the masked drug packets (with their numeric
codes). When a subject is eligible for randomization, the investigator selects
the next drug packet (in numeric order). In this way the investigator is kept
from knowing which treatment is assigned to which patient.
Another example where unequal allocation may be desirable occurs when one
therapy is extremely expensive in comparison to the other therapies in the
trial. For budget reasons you may not be able to assign as many to the
expensive therapy.
If it is known that one treatment is more variable (less precise) in the outcome
response than the other treatments, then the statistical power for treatment
comparisons is maximized with unequal allocation. The allocation ratio should
be
r=n1/n2=1/2r=n1/n2=1/2
which is a ratio of the known standard deviations. Thus, the treatment that
yields less precision (larger standard deviation) should receive more patients,
an unequal allocation. Because there is more 'noise', more patients, a larger
sample size will help to cut through this noise.
One particular scheme with experimental and standard treatments that has
received some attention is as follows. Eligible patients are randomized prior to
providing consent. If the patient is assigned to the standard therapy, then it is
offered to the patient without the need for consent. If the patient is randomized
to the experimental therapy, then the patient is asked for consent. If this
patient refuses, however, then he/she is offered the standard therapy. An
"intent-to-treat" analysis is performed based on the randomized assignment.
This approach can increase trial participation, but patients who are
randomized to the experimental treatment and refuse will dilute the treatment
difference at the time of data analysis. In addition, the "intent-to-treat" analysis
will introduce bias.
8.10 - Summary
In this lesson, among other things, we learned:
Let's put what we have learned to use by completing the following homework
assignment:
6. Definitive information is available from outside the study, making the trial
unnecessary or unethical, this is also related to the next item...
(Piantodosi, 2005)
This lesson will look examine different methods or guidelines that can be used
to help decide whether or not to terminate a clinical trial in progress.
Differentiate between valid and invalid reasons for interim analyses and
early termination of a trial.
Recognize the general effects of the choice the prior on the posterior
probability distribution from a Bayesian analysis.
References
DeMets DL, Lan KK, 1994, Interim analysis: The alpha spending function
approach, Statistics in Medicine 13: 1341-1352.
Ellenberg, SS. Fleming, TR. DeMets, DL. 2002, Data Monitoring Committees
in Clinical Trials, New York, NY: Wiley.
9.1 - Overview
Data-dependent stopping is a general term to describe any statistical or
administrative reason for stopping a trial. Consideration of the reasons given
earlier may lead you to stop the trial at an early stage, or at least change the
protocol.
All of the available statistical methods for interim analyses have some similar
characteristics. They
3. require some structure to the problem beyond the data being observed,
4. tend to have similar performance characteristics, or
The first likelihood method proposed for this situation is called the sequential
probability ratio test (SPRT) and it is based on the likelihood function. (This
method is very rarely implemented because it is impracticality, but is important
for historical reasons.) Let's review this method in general terms here.
L(p,K)=pK(1p)NKL(p,K)=pK(1p)NK
Suppose that N is the target sample size and that after n patients there
are k successes. After each treatment we will stop and analyze the data to
determine whether to continue the trial or not. Under this scenario, we stop
the trial if:
R=L(p0,K)L(p1,K=(p0p1)k(1p01p1)nkRL or RUR=L(p0,K)L(p1,K=(p
0p1)k(1p01p1)nkRL or RU
where RL and RU are prespecified constants. Let's not worry about the details
of the statistical calculation here. The values of RL and RU that correspond to
testing H0: p = p0 versus H1: p = p1 are RL = /(1 - ) and RU = (1 - )/.
A sample schematic of the SPRT in practice is shown below. Here you would
calculate R after the treatment of each patient. As you accumulate patients
you can see that R is moving around as the trial proceeds. Before we had
accrued all of the patients that we wanted we hit the upper boundary and
would not recruit the remaining patients.
Here is another example...
Next, the data from the trial are observed, say we call it X, and the likelihood
function of X given is constructed. Finally, the posterior distribution for
given X is constructed. In essence, the prior distribution for is revised into
the posterior distribution based on the data X. The data collection in the study
informs or revises the earlier assumptions.
The Bayesian statistician performs all inference for the treatment effect by
formulating probability statements based on the posterior distribution. This is a
very different approach and is not always accepted by the more traditional
frequentist oriented statisticians.
In the Bayesian approach, is regarded as a random variable, about which
probability statements can be made. This is the appealing aspect of the
Bayesian approach. In contrast, the frequentist approach regards as a fixed
but unknown quantity (called a parameter) that can be estimated from the
data.
Frequentist: "If a very large number of samples, each with the same sample
size as the original sample, were taken from the same population as the
original sample, and a 95% confidence interval constructed for each sample,
then 95% of those confidence intervals would contain the true value of ." This
is an extremely awkward and dissatisfying definition but technically represents
the frequentist's approach.
Bayesian: "The 95% confidence interval defines a region that covers 95% of
the possible values of ." This is much more simple and straightforward. (As a
matter of fact, most people when they first take a statistics course believe that
this is the definition of a confidence interval.)
Similarly, skeptical prior distributions are those that quantify the belief that
large treatment effects are unlikely. Enthusiastic prior distributions are those
that quantify large treatment effects. Let's not worry about the calculations, but
focus instead on the concepts here...
Next, suppose that at the time of the interim analysis, (45 events have
occurred), there are 31 events in one group and 14 events in the other group,
such that the estimated hazard ratio is 2.25 (calculations not shown). These
values are incorporated into the likelihood function, which modifies the prior
distribution to yield the posterior distribution for the estimated logehazard ratio
that has a mean = 0.474 and standard deviation = 0.228 (calculations not
shown). Therefore we can calculate the probability that is > 2. From the
posterior distribution we construct the following probability statement:
Pr[2]=1(loge(2)0.4740.228)=1(0.961)=0.168Pr[2]=1(loge(
2)0.4740.228)=1(0.961)=0.168
Conclusion: Based on the results from the interim analysis with a skeptical
prior, there is not strong evidence that the treatment is effective because the
posterior probability of the hazard ratio exceeding 2 is relatively small.
Therefore, there is not enough evidence here to suggest that the study be
stopped. What is too large? A reasonable value should be specified in your
protocol before these values are determined.
In contrast, suppose that before the onset of the trial the investigator is very
excited about the potential benefit of the treatment. Therefore, the investigator
wants to use an enthusiasticprior for the loge hazard ratio, i.e., a normal
distribution with mean = loge(2) = 0.693 and standard deviation = 0.35 (same
as the skeptical prior).
Suppose the interim data results are the same as those described above. This
time, the posterior distribution for the loge hazard ratio is normal with mean =
0.762 and standard deviation = 0.228. Then the probability for the posterior
prior is:
Pr[2]=1(loge(2)0.762.228)=1(0.302)=0.682Pr[2]=1(loge
(2)0.762.228)=1(0.302)=0.682
This is a drastic change in the probability based on the assumptions that were
made ahead of time. In this case, the investigator still may not consider this to
be strong evidence that the trial should terminate because the posterior
probability of the hazard ratio exceeding 2 does not exceed 0.90.
Nevertheless, the example demonstrates the controversy that can arise with a
Bayesian analysis when the amount of experimental data is small, i.e., the
selection of the prior distribution drives the decision-making process. For this
reason, many investigators prefer to use non-informative priors. Using the
Bayesian methods, you can make probability statements about your expected
results.
|Zr|Br,r=1,2,...,R|Zr|Br,r=1,2,...,R
The boundary points are chosen such that the overall significance level does
not exceed the desired . There are primarily three schemes for selecting the
boundary points which have been proposed. These are illustrated in the
following table for an overall significance level of = 0.05 and for R = 2,3,4,5.
The table is constructed under the assumption that n patients are accrued at
each of the R statistical analyses so that the total sample size is N = nR.
For example, if you were to have one interim analysis and a final analysis, in
this table, that means R=2. Use the first two rows of the table to find the
critical values.
If you were to have three interim analyses and then one final analysis, then
R=4. You would use the corresponding four rows in the middle of the table to
determine critical values.
Example
Patient accrual lasted over two years and 126 patients participated. Statistical
analyses were scheduled after approximately every 25 patients. Chi-square
tests (without the continuity correction) were performed at each of the five
scheduled analyses. The Pocock approach to group sequential testing
requires a significance level of 0.0158 at each analysis. Here is a table with
the results of these analyses.
Thus, the researchers were concerned that the CVP combination appeared to
be clinically better than the CP combination (53% success versus 34%
success), yet it did not lead to a statistically significant result with Pococks
approach. Further analyses with secondary endpoints convinced the
researchers that the CVP combination is superior to the CP combination.
Let denote the information fraction available during the course of a clinical
trial. For example, in a clinical trial with a target sample size, N, in which
treatment group means will be compared, the information fraction at an interim
analysis is = n/N, where n is the sample size at the time of the interim
analysis. If your target sample size is 500 and you have taken measurements
on 400 patients then = .8
The alpha spending function, (), is an increasing function with (0) = 0 and
(1) = , the desired overall significance level. In other words, every time you
are doing analysis you are in a sense "spending part of your alpha." For
the rth interim analysis, where the information fraction is r, 0 r 1, (r)
determines the probability of any of the first r analyses leading to rejection of
the null hypothesis when the null hypothesis is true. As an example, suppose
investigators are planning a trial in which patients are examined every two
weeks over a 12-week period. The investigators would like to incorporate an
interim analysis when one-half of the subjects have completed at least one-
half of the trial. This corresponds to = 0.25.
The fixed sample size plan is to toss the coin 500 times, count the number of
heads, X. But do we actually need to flip the coin 500 times? Using this futility
assessment procedure we could reject H0 at the 0.025 significance level if:
Z=X250(500)(0.5)(0.5)1.96Z=X250(500)(0.5)
(0.5)1.96
You can also look at this in the other direction. Suppose that after 400 tosses
of the coin there are 200 heads. The null hypothesis will be rejected if there
are at least 72 heads during the remaining 100 tosses.
Pr[X72|n=100,p=0.6]Pr[X72|n=100,p=0.6]
=Pr[X60(100)(0.6)(0.4)7260(100)(0.6)(0.4)]=Pr[X60(100)(0.6)(0.4)7260(100)
(0.6)(0.4)]
=Pr[X2.45]=0.007=Pr[X2.45]=0.007
Here are some practical issues as they relate to single center trials. Typically,
an investigator for a single-center trial needs to submit an annual report to
his/her IRB. The report should address whether the study is safe and whether
it is appropriate to continue.
4. Summary of response,
5. Summary of survival,
6. Adverse events,
Multi-Center Trials
A multi-center trial is one in which there are one or more clinical investigators
at each of a number of locations (centers). Obviously, multi-center trials are of
great importance when the disease is not common and a single investigator is
capable of recruiting only a handful of patients.
5. You need a data coordinating center (DCC) for storing, monitoring data
and organizing investigators,
6. A need develops to keep all investigators involved and motivated,
The NIH requires a Data and Safety Monitoring Board (DSMB) to monitor the
progress of a multi-center clinical trial that it sponsors. Although the FDA does
not require a pharmaceutical/biotech company to construct a DSMB for its
multi-center clinical trials, many companies are starting to use DSMBs on a
regular basis.
A DSMB typically examines the following issues when assessing the worth of
a multi-center clinical trial:
2. Are the accrual rates meeting initial projections and is the trial on its
scheduled timeline?
4. Are the treatment groups different with respect to safety and toxicity
data?
5. Are the treatment groups different with respect to efficacy data?
9.9 - Summary
In this lesson, among other things, we learned:
Differentiate between valid and invalid reasons for interim analyses and
early termination of a trial.
Recognize the general effects of the choice the prior on the posterior
probability distribution from a Bayesian analysis.
Let's explore Bayesian methods further in this week's discussion and apply
what we have learned to the assessment questions.
Missing data
Simple data imputation involves substituting one data point for each missing
value. Some substitution choices include the mean of the non-missing values
or a predicted value from a linear regression model.
Another simple data imputation method is the last observation carried forward
(LOCF) approach in longitudinal studies. With LOCF, the last observed value
for a patient is substituted for all of that patients subsequent missing values.
The problems with simple data imputation methods are that they can yield a
very biased result and they tend to underestimate variability during the data
analysis.
2. multiple imputed data sets are created in this manner (say 10-20 data
sets), and
In most clinical trials, it is common to find errors that yield ineligible patients
participating in the trial. Objective eligibility criteria are less susceptible to error
than subjective criteria. Also, patients can fail to comply with nearly every
aspect of treatment specification, such as reduced or missed doses and
improper dose scheduling.
Ineligible patients in the study can be (1) included in the analysis of the cohort
of eligible patients (pragmatic approach/intention-to-treat) or (2) excluded from
the analysis (explanatory approach).
In a randomized trial, if the eligibility criteria are objective and assessed prior
to randomization, then both approaches do not cause a bias. The pragmatic
approach, however, increases the external validity.
10.2 - Intention-to-Treat
Intention-to-treat (ITT) is the principle that patients in a randomized clinical
trial should be analyzed according to the group to which they were assigned,
even if they did not
Most statisticians favor the ITT principle because it yields the best properties
for the test of the null hypothesis of no treatment difference. "If randomized,
then analyzed" is the view widely held among clinical trial statisticians and
considered a critical component of the ITT Principle to avoid biases due to
post-randomization exclusions. ITT also is favored by the federal agencies
because a clinical trial is a test of treatment policy, not a test of treatment
received. After a meeting to discuss clinical trials methodology, which included
US FDA representatives, the International Conference on Harmonization
(ICH) published a document entitled "Statistical Principles for Clinical Trials
(E9) [1]" that discusses the ITT Principle under various circumstances.
10.3 - Summary
In this lesson, among other things, we learned:
The design of a clinical trial imposes structure on the resulting data. For
example, in pharmacologic treatment mechanism (Phase I) studies, blood
samples are used to display concentration time curves, which relate to
simple physiologic models of drug distribution and/or metabolism. As another
example, in SE (Phase II) trials of cytotoxic drugs, investigators are interested
in tumor response and toxicity of the drug or regimen. The usual study design
permits estimating the unconditional probability of response or toxicity in
patients who met the eligibility criteria.
For every trial, investigators must distinguish between those analyses, tests of
hypotheses, or other summaries of the data that are specified a priori and
justified by the design and those which are exploratory. Remember, the results
from statistical analyses of endpoints that are specified a priori in the protocol
carry more validity. Although exploratory analyses are important and might
uncover biological relationships previously unsuspected, they may not be
statistically reliable because it is not possible to account for the random nature
of exploratory analyses. Exploratory analyses are not confirmatory by
themselves but generate hypotheses for future research.
Reference
Estimates are made for each of these areas, i.e., absorption rate, distribution
rate, etc. for the drug in question.
When would the odds ratio and the relative risk be about the same?
When p1 and p2 are relatively small, for instance, when you are dealing with a
very rare event.
The odds ratio is useful and convenient for assessing risk when the response
outcome is binary, but it does have some limitations.
Notice in the table above that while the absolute risk difference is constant,
the relative risk varies greatly, as does the odds ratio. Thus, the magnitudes of
the odds ratio and relative risk are strongly influenced by the initial probability
of the condition.
When the outcome in a CTE trial is a binary response and the objective is to
compare the two groups with respect to the proportion of success, the results
can be expressed in a 2 2 table as
Group #1 Group #2
Success r1 r2
Failure n1 - r1 n2 - r2
The estimated relative risk is (r1/n1)/ (r2/n2) and the estimated odds ratio is:
^=r1/(n1r1)r2/(n2r2)=r1(n2r2)r2(n1r1)^=r1/(n1r1)r2/
(n2r2)=r1(n2r2)r2(n1r1)
There are a variety of methods for performing the statistical test of the null
hypothesis H0: = 1 (or H0: = 0) , such as a z-test using a normal
approximation, a 2 test (basically, a square of the z-test), a 2 test with
continuity correction, and Fisher's exact test.
The normal and 2 approximations for testing H0: = 1 are relatively accurate
if these conditions hold:
n1(r1+r2)n1+n25,n2(r1+r2)n1+n25,n1(n1+n2r1r2)n1+n25,n2(n1+
n2r1r2)n1+n25n1(r1+r2)n1+n25,n2(r1+r2)n1+n25,n1(n1+n2r1r2)n1+n25,
n2(n1+n2r1r2)n1+n25
This expression is basically what we would have calculated for the expected
values in the 2 2 table. The first part of the expression is the probability of
success times the probability of being in group 1 times the number of
subjects.
Otherwise, Fisher's exact test is recommended.
If the above condition is met, then the loge-transformed estimated odds ratio
has an approximate normal distribution:
loge(^)N(=loge(),2=1r1+1r2+1n1r1+1n2r2)loge(^)N(=loge(
),2=1r1+1r2+1n1r1+1n2r2)
loge(^)z1/21r1+1r2+1n1r1+1n2r2
loge(^)z1/21r1+1r2+1n1r1+1n2r2
SAS Example 12.1 [1]: An investigator conducted a small safety and efficacy
study comparing treatment to placebo with respect to adverse reactions. The
data are as follows:
Treatment Placebo
adverse reaction 12 4
no adverse reaction 32 40
The estimated odds ratio is calculated as:
^=(12)(40)(32)(4)=3.75^=(12)(40)(32)(4)=3.75
and the approximate 95% confidence interval for the loge odds ratio is
1.32(1.960.62)=(0.10,2.54)1.32(1.960.62)=(0.10,2.54)
Because the approximate 95% confidence interval for does not contain 1.0,
the null hypothesis of H0: = 1 is rejected at the 0.05 significance level.
Even though this data table satisfies the criteria for loge estimated odds ratio
to follow an approximate normal distribution, there still is a discrepancy
between the approximate results and the exact results.
From PROC FREQ of SAS, the exact 95% exact confidence interval for is
(1.00, 17.25). Because the 95% confidence interval for does contain 1.0, H0:
= 1 is not rejected at the 0.05 significance level based on Fisher's exact
test.
11.3 - Safety and Efficacy (Phase II)
Studies: The Mantel-Haenszel Test for
the Odds Ratio
Sometimes a safety and efficacy study is stratified according to some factor,
such as clinical center, disease severity, gender, etc. In such a situation, it still
may be desirable to estimate the odds ratio while accounting for strata effects.
The Mantel-Haenszel test for the odds ratio assumes that the odds ratio is
equal across all strata, although the rates, p1 and p2, may differ across strata.
This procedure calculates the odds ratio within each stratum and then
combines the strata estimates into one estimate of the common odds ratio.
For example,
Stratum p1 p2
SAS PROC FREQ yields an estimated odds ratio of 1.84 with an approximate
95% confidence interval is (1.28, 2.66).
The exact 95% confidence interval is (1.26, 2.69). The exact and asymptotic
confidence intervals are nearly identical due to the large sample size across
the six clinical centers.
For the sake of illustration, suppose that the response is continuous and that
we want to determine if there is a trend in the K + 1 population means.
H0: {0 = 1 = = K} versus
H1: {0 1 K with at least one strict inequality}
H0: {0 = 1 = = K} versus
H1: {0 1 K with at least one strict inequality}
H0: {0 = 1 = = K} versus
H1: {0 1 K or 0 1 K with at least one strict inequality}
More than likely we would use one of the one-sided tests as you probably
have a hunch about the effect that will result.
JT=k=0K1k=1KMWWkkJT=k=0K1k=1KMWWkk
MWWkk=i=1nki=1nksign(YkiYkiMWWkk=i=1nki=1nksign(YkiYki
The JT trend test actually is testing hypotheses about population medians, but
if the underlying probability distribution is symmetric, the population mean and
the population median are equal to one another. The JT trend test is available
in PROC FREQ of SAS.
k=0K1k=k+1K(YkYk)k=0K1k=k+1K(YkYk)
s2=1dk=0Ki=1nk(YkiYk)2,d=k=0K(nk1)s2=1dk=0Ki=1nk(YkiY
k)2,d=k=0K(nk1)
k=0KckYkk=0KckYk
T=(k=0KckYk)/(s2k=0Kck2nk2)
For example, if K = 3 (placebo, low dose, mid dose, and high dose), then c0 =
-3, c1 = -1, c2 = 1, c3 = 3. Notice, however, that if there are an odd number of
groups, then the middle group has a coefficient of zero. For example, with K =
2 (placebo, low dose, and high dose) c0 = - 1, c1 = 0, c2 = 1. This is not ideal
and there are better trend tests than JT and T for continuous data.
To use the actual dose values (denoted as d0, d1, , dK) in the parametric
test, set ck = dk - mean(d0, d1, , dK), k = 0, 1, , K.
The JT trend test works well for binary and ordinal data, as well as being
available for continuous data.
Another trend test for binary data is the Cochran-Armitage (CA) trend test.
The difference between the JT and CA trend tests is that for the latter test, the
actual dose levels can be specified. In other words, instead of designating the
dose levels as low, mid, or high, the actual numerical dose levels can be used
in the CA trend test, such as 20 mg, 60, 180 mg.
The CA trend test, however, can yield unusual results if there is unequal
spacing among the dose levels. If the dose levels are equally spaced and the
sample sizes are equal (n0 = n1 = ... = nK), then the JT and CA trend tests yield
exactly the same results. Each of these parameters needs to be taken into
account to make sure you are applying the best test for your data.
In order to construct the Kaplan-Meier survival curve, the actual failure times
need to be ordered from smallest to largest. In a sample size of n patients,
denote these times of failure as t1, t2, ... , tK. For convenience, let t0 = 0 denote
the start time and let tK+1 = .
At the kth failure time, tk, the number of failures, dk, are noted as well as the
number of patients who were at risk for failure immediately prior to tk, nk.
Notice that patients who are lost to follow-up (censored) prior to time tj are not
included in nk.
The algebraic formula for the Kaplan-Meier survival probability at time t is:
S^(t)=1,t0tt1S^(t)=1,t0tt1
S^(t)=k=1k(1dknk),tkttk+1,k=1,2,...,KS^(t)=k=1k(1dknk
),tkttk+1,k=1,2,...,K
The calculation of S(t) utilizes conditional probability: the probability of
surviving at time t, given that the person has survived up to time t. S(t) is the
probability of surviving beyond time t.
dk nk
k tk (days)
(events) (at risk) S^(tk)S^(tk)
Note that the probability estimate does not change until a failure event occurs.
Also, censored values do not affect the numerator, but do affect the
denominator. Thus, the Kaplan-Meier survival curve gives the appearance of a
step function when graphed.
The sample mean (sample standard deviation) is suitable if the data are
normally distributed or symmetric without heavy tails. The sample median
(sample inter-quartile range) is suitable for symmetric or asymmetric data. The
sample geometric mean (sample coefficient of variation) is suitable when the
data are log-normally distributed.
Usually two-sample t tests or Wilcoxon rank tests are applied to compare the
two randomized groups. In some instances, baseline measurements (prior to
randomized treatment assignment) of the primary endpoints are taken.
Suppose Yi1 and Yi2 denote the baseline and final measurements of the
endpoint, respectively, for the ith subject, i = 1, 2, , n. Instead of statistically
analyzing the Yi2s, there could be an increase in precision by analyzing the
change (or gain) in the response, namely, the Yis where Yi = Yi2 - Yi1.
Suppose that the variance for each Yi1 and Yi2 is 2 and that the correlation
between Yi1 and Yi2 is (we assume that subjects are independent of each
other but that the pair of measurements within each subject are correlated).
This leads to
Var(Yi2Yi1)=Var(Yi2)+Var(Yi1)2Cov(Yi2,Yi1)=22(1)Var(Yi2Yi1)=Var(
Yi2)+Var(Yi1)2Cov(Yi2,Yi1)=22(1)
Therefore,
If > , which often is the case for repeated measurements within patients,
then Var(Yi2 - Yi1) < Var(Yi2). Thus, there may be more precision if
the Yi2 - Yi1 are analyzed instead of the Yi2. This happens all the time. Using
the patient as their own control is a good thing. We are interested in the
differences that are occurring, therefore we will subtract the treatment period
measurements from the baseline data for the patient. A two-sample t test or
the Wilcoxon rank sum test can be applied to the change-from-baseline
measurements if the CTE trial consists of two randomized groups, such as
placebo and an experimental therapy.
E(Yi2)=p+Ti(EP)+Yi1E(Yi2)=p+Ti(EP)+Yi1
where P is the population mean for the placebo group, E is the population
mean for the experimental treatment group, Ti = 0 if the ith patient is in the
placebo group and 1 if in the experimental treatment group, and is the slope
for the baseline measurement.
The only difference between the two approaches is that in the change-from-
baseline measurements, is set equal to 1.0. In the ANCOVA approach, is
estimated in the analysis and may differ from 1.0. Thus, ANCOVA approach is
more flexible and can yield slightly more statistical power and efficiency.
The assumptions for the logrank test are that (1) the censoring patterns are
the same for the two treatment groups, and (2) the hazard functions for the
two treatment groups are proportional.
For each of the K distinct failure times across the two randomized groups at
times t1, t2, , tK, a 2 2 table is constructed. For failure time tk , k = 1, 2,
, K, the table is:
The logrank statistic constructs an observed minus expected score, under the
assumption that the null hypothesis of equal event rates is true, for each of the
K tables and then sums over all tables:
OE=k=1K(nPkdEknEkdPknPk+nEk)OE=k=1K(nPkdEknEkdPknPk+nE
k)
VL=Var(OE)=k=1K((dPk+dEk)
(nPk+nEkdPkdEk)nPknEk(nPk+nEk1)
(nPk+nEk)2)VL=Var(OE)=k=1K((dPk+dEk)
(nPk+nEkdPkdEk)nPknEk(nPk+nEk1)(nPk+nEk)2)
ZL=(OE)/VLZL=(OE)/VL
The first step in constructing the generalized Wilcoxon statistic is to pool the
two samples of survival times (including censored values) and order them
from lowest to highest. For the ith observation in the ordered sample with
survival (or censored) time ti, construct a score, Ui, which represents the
number of survival (or censored) times less than ti minus the number of
survival (or censored) times greater than ti. The Ui are summed over the
experimental treatment group and a variance calculated, i.e.,
U=i=1nEUiand VU=Var(U)=(nPnE(nP+nE)
(nP+nE1))i=1nP+nEU2iU=i=1nEUiand VU=Var(U)=(nPnE(nP+nE)
(nP+nE1))i=1nP+nEUi2
such that:
ZU=(OE)/VUZU=(OE)/VU
10 Placebo 1 6 -5
12 Exp Treat 2 4 -2
17 Placebo 3 2 1
21 Placebo 4 1 3
25+ Placebo 5 0 5
Although p-values are useful for hypothesis tests that are specified a priori,
they provide poor summaries of clinical effects. In particular, they do not
convey the magnitude of a clinical effect. The size of a p-value depends on
the magnitude of the estimated treatment effect and its estimated variability
(also a function of sample size). Thus, the p-value partially reflects the size of
the trial, which has no biological interpretation. In addition, the p-value can
mask the magnitude of the treatment effect, which does have biological
importance. P-values only quantify the type I error and do not characterize the
biologically important effects in the trial. Thus, p-values should not be used to
describe the strength of evidence in a trial. Investigators have to look at the
magnitude of the treatment effect.
Original sample: 17, 25, 16, 32, 27, 19, 25, 23, 22, 30
... ...
Thus, for b = 1,...,B, the bootstrap sample Yb1, Yb2, ... , YbN is constructed and
the sample median within the bth bootstrap sample is formed as:
From the B estimates of the median we construct the estimated variance as:
S2=1B1b=1B(^b)2 where=1Bb=1B^bS2=1B1b=1B(^b)
2 where=1Bb=1B^b
to get a sense about how these medians are varying over the 100 samples.
The variance estimate can then be used to construct a Z statistic for
hypothesis testing, i.e.,
Z=(^)/SZ=(^)/S
Some statisticians at first were leery of this approach, essentially using one
sample to create many others samples from this original, i.e., "pulling oneself
up by your bootstraps". The bootstrap process, however, over time has shown
to have sound statistical properties. The disadvantage of this approach has to
do with the random selection with replacement which could result in slight
variations in results. The FDA, for instance, requires definitive results. This is
simply a non-parametric approach for estimating the variance of a sample.
Data in sufficient quantity and detail can be made to yield some effect. A few
statistical sayings attest to this, such as "the data will confess to anything if
tortured enough." It has been well documented that increasing the number of
hypothesis tests inflates the Type I error rate. Exploratory analyses typically
fall into this category and the chances of finding statistically significant results,
when none truly exist, can be very high.
Subset analyses are a form of exploratory analyses that are very popular with
clinical trials data. For example, after performing the primary statistical
analyses, the investigators might decide to compare treatment groups within
certain subsets, such as male subjects, female subjects, minority subjects,
subjects over the age of 50, subjects with serum cholesterol above 220, etc.
Unless it is planned ahead of time, such analyses should remain exploratory.
11.9 - Summary
In this lesson, among other things, we learned:
Let's put what we have learned to use by completing the following homework
assignment:
Homework
One reason for studying prognostic factors is to learn the relative importance
of several variables that might affect, or be associated with, disease outcome.
A second reason for studying prognostic factors is to improve the design of
clinical trials. For example, if a prognostic factor is identified as strongly
predictive of disease outcome, then investigators of future clinical trials with
respect to that disease should consider using it as a stratifying variable.
Reference
The adjusted means for groups A and B, respectively, from the ANCOVA are:
12.2 - Interactions
It is important to examine treatment covariate interactions. For example, it is
possible that the responses in the treatment groups differ for low levels of the
prognostic factor, but not differ for high levels of the prognostic factor.
Interpretations of statistical results in the presence of treatment covariate
interactions can become complex.
Now, in the figure above, the difference between Treatments A and B is not
constant and does depend on the value of the covariate.
Prognostic factors can be continuous measures (age, baseline cholesterol,
etc.), ordinal (age categories, baseline disease severity, etc.), binary (gender,
previous tobacco use, etc.), or categorical (ethnic group, geographic region,
institution in multi-center trials, etc.).
Most prognostic factors are measured at baseline and do not change over
time, hence they are called time-independent covariates. Time-independent
covariates are easily included in many types of statistical models.
On the other hand, some prognostic factors do change over time; these are
called time-dependent covariates. For example, consider a clinical trial in
diabetics with selected dietary intake variables measured on a regular basis
over the course of a six-month clinical trial. Dietary intake could affect the
severity of disease and exacerbate problems for diabetic patients, so a
statistical analysis of the trial might incorporate these prognostic factors as
time-dependent covariates. We have to be very careful when using time-
dependent covariates.
The schematic below represents the situation where the treatment is affecting
both the covariates and the outcome. The covariates also affect the outcome.
Adjusting for the effect of the covariate over time may account for the majority
of the treatment effect on the outcome.
Dr. George Box once stated: "All models are wrong, but some are useful."
An example is the linear model that is used for multiple regression. Let Y
denote the outcome variable and X1, X2, . . . , XK denote K different regressors
(predictors) that are measured on each of n patients. Then the statistical
model for patient i, i = 1, 2, . . . , n, is
Yi=0+1X1i+0+2X2i+...+K+1XKi+iYi=0+1X1i+0+2X2i+...
+K+1XKi+i
Yij=i+ijYij=i+ij
Yij=i+1X1ij+2X2ij+3X3ij+ijYij=i+1X1ij+2X2ij+3X3ij+ij
where the notation is similar to that for the one-way ANOVA with K treatment
groups, and X1ij, X2ij, X3ij denote the values of the three covariates for the
jth patient within the ith treatment group.
For example, suppose that there are four centers in a multi-center trial and
that it is desirable to model for center effects. The above ANCOVA model can
be invoked with center #4 as the reference level:
Statistical software packages for multiple regression typically require the user
to recode categorical regressors/covariates in this manner (SAS PROC REG),
whereas the statistical software packages for ANOVA and ANCOVA can
recode categorical regressors/covariates for the user (the CLASS statement in
SAS PROC ANOVA and SAS PROC GLM).
If the investigator focuses on a specific region of values for the covariate, such
as high baseline cholesterol, then it may be possible to determine which
treatment is superior in this region.
Sometimes it is possible to make general conclusions if the interactions are
due to the magnitude of the effect.
Profile plots (mean outcome response versus the covariate) for each
treatment group will indicate graphically whether the interactions are
qualitative (not parallel and crossing, below),
[1]
2. counseling;
3. control.
SAS Example ( 13.2_ANCOVA.sas [3] ): The longitudinal data from the one-
year follow-up are provided but not analyzed (beyond the scope of this
course).
Here are a couple of places at the end of the program that you will want to
make note of:
Now, run the program and look at the output. Notice 66 observations read, 47
used. This means 19 patients are missing data for a term in the model so
SAS cannot use their data. In this case, 19 are missing bilirubin4
measurements.
Considering the Type III SS in the output, do you see that baseline bilirubin
(bilirubin0) was an important covariate? The treatment groups do not differ
significantly. Nor was significant interaction observed between center and
treatment or baseline bilirubin and treatment.
log(p(X1i,X2i,...,XKi1p(X1i,X2i,...,XKi))=0+1X1i+2X2i+...
+KXKilog(p(X1i,X2i,...,XKi1p(X1i,X2i,...,XKi))=0+1X1i+2X2i+...+KXKi
Notice that 0 represents the reference log odds, i.e., when X1i = 0, X2i = 0, ... ,
XKi = 0. Consider a simple model with one covariate (K = 1) which is binary,
e.g., X1i = 0 if the ith patient is in the placebo group and 1 if the ith patient is in
the treatment group. Then the log odds ratio for comparing the treatment to
the placebo group is
log(p(X1i=1)1p(X1i=1)/p(X1i=0)1p(X1i=0))=(0+1)0=1log(p(X
1i=1)1p(X1i=1)/p(X1i=0)1p(X1i=0))=(0+1)0=1
0=1xlog(p(X1i=x)1p(X1i=x)/p(X1i=0)1p(X1i=0))=(0+1x)0=1x
so that the odds ratio is exp(1x). This illustrates that changes in a covariate
have a multiplicative effect on the baseline risk.
For example, suppose x represents (age - 18) in a study of adults, and that
the estimated coefficient is ^1^1 = 0.04 with a p-value < 0.05. Then the
estimated odds ratio is exp(0.04) = 1.041. This may not seem like a clinical
meaningful odds ratio, but remember that it represents the increase in odds
between a 19-year-old and an 18-year-old. For a 25-year-old person, the
estimated odds ratio is exp(0.04 7) = 1.323.
For the logistic regression model, each j, j = 1, 2, ... , K, represents the log
odds ratio for the jth covariate. An equivalent expression for the logistic
regression model in terms of the probability is
p=(X1i,X2i,...,XKi)=11+exp{(0+1X1i++2X2i+...
+KXKi)}p=(X1i,X2i,...,XKi)=11+exp{(0+1X1i++2X2i+...+KXKi)}
log(Pr[yc|X1i,X2i,...,XKi]1Pr[yc|
X1i,X2i,...,XKi])=0c+1X1i+2X2i+...+KXKi,c=1,2,...,Clog(Pr[yc|
X1i,X2i,...,XKi]1Pr[yc|X1i,X2i,...,XKi])=0c+1X1i+2X2i+...+KXKi,c=1,2,...,C
The ordinal logistic regression model has C intercept terms, but only one term
for each regressor. This reduced modeling for an ordinal outcome assumes
proportional odds (beyond the scope of this course).
The proportional hazards regression model states that the log of the hazard
function to the baseline hazard function at time t is a linear combination of
parameters and regressors, i.e.,
log((t|X1i,X2i,...,XKi)0(t))=1X1i,2X2i,...,KXKilog((t|
X1i,X2i,...,XKi)0(t))=1X1i,2X2i,...,KXKi
(t|X1i,X2i,...,XKi)=0(t)exp(1X1i,2X2i,...,KXKi)(t|
X1i,X2i,...,XKi)=0(t)exp(1X1i,2X2i,...,KXKi)
4. The software may not handle the problem of missing data very well.
Approaches
When a variable is entered into or removed from a model, the p-values of the
other variables will change. Consider a linear model with two potential
regressors, X1 and X2, and suppose that they are strongly correlated
(independent variables is a misnomer). Suppose that in a model
with X1 only, X1 is significant, and in a model with X2 only, X2 is significant.
When a model is constructed with both X1 and X2, however, the contribution
by X2 to the model is no longer statistically significant. Because X1 and X2 are
strongly correlated, X2 has very little predictive power when X1 already is in the
model.
12.8 - Example
SAS Example ( 13.5_ph regression.sas [7] ): A safety and efficacy study was
conducted in 83 patients with malignant mesothelioma, an uncommon lung
cancer that is strongly associated with asbestos exposure. Patients underwent
one of three types of surgery, namely, biopsy, limited resection, and
extrapleural pneumonectomy (EPP). Treatment assignment was
nonrandomized and based on the extent of disease at the time of diagnosis.
Thus, there can be a strong procedure selection bias.
Run the program. Do you agree that histologic subtype is the only statistically
significant covariate (p = 0.025) ?
Hopefully, missing data among the regressors/covariates are not related to the
outcome. If this is not the case, then it may not be possible to develop a
model that is unbiased. For example, if the patients with the most severe form
of the disease are the ones with missing values for the regressors/covariates,
then the resultant model that does not include these patients will be biased.
As has been discussed earlier, data imputation is one way to handle the
situation of missing values. Data imputation involves the estimation of the
missing values in a manner that is consistent and then imputing the
estimated values for the missing values. Thus, every subject will have a
complete set of regressors/covariates and the statistical analysis can proceed
without eliminating any subjects.
Make every effort to collect complete data to avoid such problems. When data
are missing, be certain to report the numbers of patients used in each analysis
and any methods used to impute missing values.
The estimation data set is used to build the model, and hence, estimate the
parameters. The validation data set is used to validate the model by inserting
a patients set of observed regressors into the estimated model equation and
predicting the outcome response for that subject.
If the predicted outcome is relatively close to the observed outcome for the
subjects in the validation data set, then the model is considered valid.
Another approach is called the leave-one-out method and consists of
eliminating the first patient from the data set with n subjects, estimating the
model equation based on the remaining n - 1 patients, calculating the
predicted outcome for the first patient, and then comparing the first patients
predicted and observed outcomes.
This process is performed for each of the n patients and an overall validation
statistic is constructed.
These validation procedures work fine for nonrandomized studies, but for
randomized clinical trials, they probably should be applied only to secondary
and exploratory statistical analyses.
This may be true, but the use of prognostic factors in ANCOVA models can
improve precision and verify biological information.
Let's put what we have learned to use by completing the following homework
assignment:
Homework
Look for homework assignment and the dropbox in the folder for this week in
ANGEL.
Reference
The U.S. NIH policy is that results of its funded research should be available
to the public.
Many medical journals now have policies of only publishing a manuscript from
a completed clinical trial if the trial has been registered at ClinicalTrials.gov.
Some journals require structured titles and abstracts because they are the
only part of many reports that some readers examine. Therefore, the abstract
becomes very important - in the medical literature the abstract is critical. A
good abstract for the report of a clinical trial includes objectives, design,
setting or types of practices, characteristics of the study population,
interventions used, primary outcome measurements, principal results, and
conclusions. Abstracts should be no longer than 250 words and usually do not
include descriptions of the statistical methods.
With respect to the latter objective, the report should recognize the potential
for strong selection bias and avoid overly-enthusiastic statements about
relative efficacy. The reports for safety and efficacy studies should consist of
the following outline:
introduction
objectives
study design
study setting
demographics
treatments
outcome measures
statistical methods
results
The motivation and assumptions for the target sample size should be
included, especially in situation where the primary results are negative
findings. The impact of various prognostic variables should be addressed with
appropriate statistical analyses to demonstrate that treatment effects are not
due entirely to them. Although the intent-to-treat principle should be followed
in randomized trials, it is helpful to report on the results of various exploratory
analyses as well.
Since the methods have direct implications on the validity of the results, top-
line journals require thorough descriptions. They also expect supplemental
reports. When the article is available online, there can be links to more
detailed descriptions, figures, graphs and tables.
There are several extensions of the original CONSORT statement which can
also be examined at the CONSORT website [14] focusing on reporting patient
safety. equivalence and non-inferiority trials, cluster trials and other topics.
Ioannidis JPA, Evans SJW, Getzsche PC, ONeill RT, Altman DG, Schultz K,
Moher D for the Consort Group. Better reporting of harms in randomized trials:
An extension of the CONSORT statement. [15] Annals of Internal
Medicine 2004; 141:781-788.
Piaggio,G., Elbourne, D., Altman, D., Pocock, S., Evans, S. for the CONSORT
Group. Reporting of Noninferiority and Equivalence Randomized Trials: An
Extension of the CONSORT statement [16]. JAMA 2006; 295: 1152-1160.
Conflict of Interest
Major medical journals require that manuscript authors report any financial
support for the research presented in the article, and complete a form
describing their conflicts of interest. The information on the financial support
and conflicts usually appears at the end of the article, prior to the references.
Example: In the VALIANT trial (NEJM 2003, in Wk 5 course material) the
authors state
(1) Supported by a grant from Novartis Pharmaceuticals. and
(2) that some of them also received financial payments from Novartis for
serving as consultants, and some of them also have stock equity in Novartis.
Anyone who reads the article should attempt to examine the statements about
financial support and conflicts, in order to judge whether the article may
present a biased viewpoint.
The International Committee of Medical Journal Editors (ICMJE
http://www.icmje.org/ [17]) has developed a standardized form for authors to
provide information about their financial interests that could influence how
their work is viewed. The form is designed to be completed electronically and
stored electronically. It contains programming that allows appropriate data
display. Each author listed on the manuscript should submit a separate form
and is responsible for the accuracy and completeness of the information. The
disclosure form is a fillable pdf file
(http://www.icmje.org/coi_disclosure.pdf [18]). The complete list of journals that
require completion of the ICMJE form appears
at http://www.icmje.org/journals.html [19]
13.5 - Summary
In this lesson, among other things, we learned to:
Let's put what we have learned to use by completing the following homework
assignment:
Homework
Factorial clinical trials are experiments that test the effect of more than one
treatment using a type of design that permits an assessment of potential
interactions among the treatments.
In a factorial design there are two or more factors with multiple levels that are
crossed, e.g., three dose levels of drug A and two levels of drug B can be
crossed to yield a total of six treatment combinations:
recognize the situation for which a min test is the appropriate analysis
Reference
Placebo A + Placebo B
Placebo A + Active B
Active A + Placebo B
Active A + Active B
For example, here you could have a placebo for each treatment. In one case
you might have a placebo injection for A and a placebo pill for B. Such a
design allows the comparison of the levels of factor A (A main effects), the
comparison of the levels of factor B (B main effects), and the investigation of A
B interactions.
14.2 - Interactions
Factorial designs provide the only way to study interactions between
treatment A and treatment B. This is because the design has treatment groups
with all possible combinations of treatments.
Placebo A + Active B
Active A + Placebo B
Active A + Active B
Notice that the Placebo A + Placebo B group is not included in the design,
hence the incompleteness. The incomplete factorial design has become
popular.
Why?
Notice that the null hypothesis indicates that the AB combination therapy is
not better than at least one of the monotherapies, whereas the alternative
indicates that the AB combination is better than the A monotherapy and the B
monotherapy.
How do we do this?
The appropriate test statistic to use for this situation is called the min test. If
the data are normally distributed, construct two two-sample t statistics, one
comparing the AB combination therapy to the A monotherapy (call it tA) and
the other comparing the AB combination therapy to the B monotherapy (call
it tB).
tA=(YABYA)/s1nAB+1nAtA=(YABYA)/s1nAB+1nA ,
tB=(YABYB)/s1nAB+1nBtB=(YABYB)/s1nAB+1nB
where
YA=1nAi=1AYA,i,YB=1nBi=1BYB,i,YAB=1nABi=1ABYAB,iYA=1nAi
=1AYA,i,YB=1nBi=1BYB,i,YAB=1nABi=1ABYAB,i
and
s2=1nA+nB+nAB3(i=1nA(YA,iYA)2+i=1nB(YB,iYB)2+i=1nAB(
YAB,iYAB)2)s2=1nA+nB+nAB3(i=1nA(YA,iYA)2+i=1nB(YB,iYB)2+i=1nA
B(YAB,iYAB)2)
minimum(tA,tb)>tnA+nB+nAB3,1minimum(tA,tb)>tnA+nB+nAB3,1
YA=20,YB=21,YAB=24,nA=nB=nAB=50,s=10YA=20,YB=21,YAB=24,nA
=nB=nAB=50,s=10
Then tA = 2, tB = 1.5, and minimum(tA, tB) = 1.5, which is not greater than t147,
0.95 = 1.66. Thus, the null hypothesis cannot be rejected at the 0.05
significance level, i.e., the AB combination is not significantly better than the A
monotherapy and the B monotherapy. It is close, but there clearly is not
enough statistical evidence to show significant difference.
14.4 - Summary
In this lesson, among other things, we learned to:
recognize the situation for which a min test is the appropriate analysis
Let's put what we have learned to use by completing the following homework
assignment:
Homework
Look for homework assignment and the dropbox in the folder for this week in
ANGEL.
The reason to consider a crossover design when planning a clinical trial is that
it could yield a more efficient comparison of treatments than a parallel design,
i.e., fewer patients might be required in the crossover design in order to attain
the same level of statistical power or precision as a parallel design.(This will
become more evident later in this lesson...) Intuitively, this seems reasonable
because each patient serves as his/her own matched control. Every patient
receives both treatment A and B. Crossover designs are popular in medicine,
agriculture, manufacturing, education, and many other disciplines. A
comparison is made of the subject's response on A vs. B.
Reference
Piantadosi Steven. (2005) Crossover Designs. In: Piantadosi Steven. Clinical
Trials: A Methodologic Perspective. 2nd ed. Hobaken, NJ: John Wiley and
Sons, Inc.
The sequences should be determined a priori and the experimental units are
randomized to sequences. The most popular crossover design is the 2-
sequence, 2-period, 2-treatment crossover design, with sequences AB and
BA, sometimes called the 2 2 crossover design.
Sequence AB A B
Sequence BA B A
Sequence ABB A B B
Sequence BAA B A A
and
Sequence ABA A B A
Sequence BAA B A A
Sequence ABC A B C
Sequence BCA B C A
Sequence CAB C A B
and
Sequence ABC A B C
Sequence BCA B C A
Sequence CAB C A B
Sequence ACB A C B
Sequence BAC B A C
Sequence CBA C B A
Sequence AB A B
Sequence BA B A
Sequence AA A A
Sequence BB B B
Balaams design is unusual, with elements of both parallel and crossover
design. There are advantages and disadvantages to all of these designs; we
will discuss some and the implications for statistical analysis as we continue
through this lesson.
15.2 - Disadvantages
The main disadvantage of a crossover design is that carryover effects may be
aliased (confounded) with direct treatment effects, in the sense that these
effects cannot be estimated separately. You think you are estimating the effect
of treatment A but there is also a bias from the previous treatment to account
for. Significant carryover effects can bias the interpretation of data analysis, so
an investigator should proceed cautiously whenever he/she is considering the
implementation of a crossover design.
A carryover effect is defined as the effect of the treatment from the previous
time period on the response at the current time period. In other words, if a
patient receives treatment A during the first period and treatment B during the
second period, then measurements taken during the second period could be a
result of the direct effect of treatment B administered during the second
period, and/or the carryover or residual effect of treatment A administered
during the first period. These carryover effects yield statistical bias.
The rationale for this is that the previously administered treatment is washed
out of the patient and, therefore, it can not affect the measurements taken
during the current period. This may be true, but it is possible that the
previously administered treatment may have altered the patient in some
manner, so that the patient will react differently to any treatment administered
from that time onward. An example is when a pharmaceutical treatment
causes permanent liver damage so that the patients metabolize future drugs
differently. Another example occurs if the treatments are different types of
educational tests. Then subjects may be affected permanently by what they
learned during the first period.
How long of a wash out period should there be?
Actually, it is not the presence of carryover effects per se that leads to aliasing
with direct treatment effects in the AB|BA crossover, but rather the presence of
differential carryover effects, i.e., the carryover effect due to treatment A differs
from the carryover effect due to treatment B. If the carryover effects for A and
B are equivalent in the AB|BA crossover design, then this common carryover
effect is not aliased with the treatment difference. So, for crossover designs,
when the carryover effects are different from one another, this presents us
with a significant problem.
For example, one approach for the statistical analysis of the 2 2 crossover is
to conduct a preliminary test for differential carryover effects. If this is
significant, then only the data from the first period are analyzed because the
first period is free of carryover effects. Essentially you be throwing out half of
your data!
If the preliminary test for differential carryover is not significant, then the data
from both periods are analyzed in the usual manner. Recent work, however,
has revealed that this 2-stage analysis performs poorly because the
unconditional Type I error rate operates at a much higher level than desired.
We won't go into the specific details here, but part of the reason for this is that
the test for differential carryover and the test for treatment differences in the
first period are highly correlated and do not act independently.
Even worse, this two-stage approach could lead to losing one-half of the data.
If differential carryover effects are of concern, then a better approach would be
to use a study design that can account for them.
Within time period j, j = 2, ... , p, it is possible that there are carryover effects
from treatments administered during periods 1, ... , j - 1. Usually in period j we
only consider first-order carryover effects (from period j - 1) because:
Uniformity
For example, AB/BA is uniform within sequences and period (each sequence
and each period has 1 A and 1 B) while ABA/BAB is uniform within period but
is not uniform within sequence because the sequences differ in the numbers
of A and B.
Latin Squares
equence ABCD A B C D
equence BCDA B C D A
equence CDAB C D A B
equence DABC D A B C
and
quence ABCD A B C D
quence BDAC B D A C
quence CADB C A D B
quence DCBA D C B A
Latin squares are uniform crossover designs, uniform both within periods and
within sequences. Although with 4 periods and 4 treatments there are 4! = (4)
(3)(2)(1) = 24 possible sequences from which to choose, the Latin square only
requires 4 sequences.
Balanced Designs
The Latin square in [Design 8] has an additional property that the Latin square
in [Design 7] does not have. Each treatment precedes every other treatment
the same number of times (once). For example, how many times is treatment
A followed by treatment B? Only once. How many times do you have one
treatment B followed by a second treatment? Only once. This is an
advantageous property for Design 8. This same property does not occur in
[Design 7]. When this occurs, as in [Design 8], the crossover design is said to
be balanced with respect to first-order carryover effects.
Come up with an answer to this question by yourself and then click on the
icon to the left to reveal the solution.
Look back through each of the designs that we have looked at thus far and
determine whether or not it is balanced with respect to first-order carryover
effects.
Here is an example:
D A B C D D
C B D A C C
B C A D B B
A D C B A A
Latin squares yield uniform crossover designs, but strongly balanced designs
constructed by replicating the last period of a balanced design are not uniform
crossover designs. The following 4-sequence, 4-period, 2-treatment crossover
design is an example of a strongly balanced and uniform design.
uence ABBA A B B A
uence BAAB B A A B
uence AABB A A B B
uence BBAA B B A A
15.4 - Statistical Bias
Why are these properties important in statistical analysis?
The approach is very simple in that the expected value of each cell in the
crossover design is expressed in terms of a direct treatment effect and the
assumed nuisance effects. Then these expected values are averaged and/or
differenced to construct the desired effects.
uence AB A + + B + - + A
uence BA B - + A - - + B
A natural choice of an estimate of A (or B) is simply the average over all cells
where treatment A (or B) is assigned: [12]
Will this give us a good estimate of the means across the treatment? Not
quite...
The mathematical expectations of these estimates are as follows: [13]
E(^A)=12(A+++A+B)=A+12BE(^A)=12(A+++A+B)
=A+12B
E(^B)=12(B++B++A)=B+12AE(^B)=12(B++B++A)
=B+12A
E(^A^B)=(AB)12(AB)E(^A^B)=(AB)12(AB)
From [Design 13] it is observed that the direct treatment effects and the
treatment difference are not aliased with sequence or period effects, but are
aliased with the carryover effects.
The treatment difference, however, is not aliased with carryover effects when
the carryover effects are equal, i.e., A = B. The results in [13] are due to the
fact that the AB|BA crossover design is uniform and balanced with respect to
first-order carryover effects. Any crossover design which is uniform and
balanced with respect to first-order carryover effects, such as the designs in
[Design 5] and [Design 8], also exhibits these results.
Example
Consider the ABB|BAA design, which is uniform within periods, not uniform
with sequences, and is strongly balanced.
A + + 1 B + +2 + A B + - 1 - 2 + B
B - + 1 A - +2 + B A - - 1 - 2 + A
A natural choice of an estimate of A (or B) is simply the average over all cells
where treatment A (or B) is assigned: [15]
From [16], the direct treatment effects are aliased with the sequence effect
and the carryover effects, whereas the treatment difference only is aliased
with the sequence effect. The results in [16] are due to the ABB|BAA
crossover design being uniform within periods and strongly balanced with
respect to first-order carryover effects.
A + + 1 B + + 2 + A B + - 1 - 2 + B + 2A
B - + 1 A - + 2 + B A - - 1 - 2 + A + 2B
[18] E(^A^B)=(AB)2313(2A2B)E(^A^B)=(AB)
2313(2A2B)
The ensuing remarks summarize the impact of various design features on the
aliasing of direct treatment and nuisance effects.
Complex Carryover
For example, some researchers argue that sequence effects should be null or
negligible because they represent randomization effects. Another example
occurs in bioequivalence trials where some researchers argue that carryover
effects should be null. This is because blood concentration levels of the drug
or active ingredient are monitored and any residual drug administered from an
earlier period would be detected.
The following is a listing of various crossover designs with some, all, or none
of the properties.
It would be a good idea to go through each of these designs and diagram out
what these would look like, the degree to which they are uniform and/or
balanced. Make sure you see how these principles come into play!
During the design phase of a trial, the question may arise as to which
crossover design provides the best precision. For our purposes, we label one
design as more precise than another if it yields a smaller variance for the
estimated treatment mean difference.
In order for the resources to be equitable across designs, we assume that the
total sample size, n, is a positive integer divisible by 4. Then:
1. n patients will be randomized to each sequence in the AB|BA design
The following table provides expressions for the variance of the estimated
treatment mean difference for each of the two-period, two-treatment designs:
Variance
With respect to a sample size calculation, the total sample size, n, required for
a two-sided, significance level test with 100(1 - )% statistical power and
effect size A - B is:
n=(z1/2+z1)22/(AB)2n=(z1/2+z1)22/(AB)2
AA = BB = 100
The sample sizes for the three different designs are as follows:
Parallel n = 190
Balaam n = 105
Crossover n = 21
The crossover design yields a much smaller sample size because the within-
patient variances are one-fourth that of the inter-patient variances (which is
not unusual).
Remember the statistical model we assumed for continuous data from the 2
2 crossover trial:
uence AB A + + B + - + A
uence BA B - + A - - + B
For a patient in the AB sequence, the Period 1 vs. Period 2 difference has
expectation AB = A - B + 2 - .
For a patient in the BA sequence, the Period 1 vs. Period 2 difference has
expectation BA = B - A + 2 - .
Therefore, we construct these differences for every patient and compare the
two sequences with respect to these differences using a two-sample t test or a
Wilcoxon rank sumtest. Thus, we are testing:
H0 : AB - BA = 0
The expression:
AB - BA = 2( A - B )
H0 : p1. - p.1 = 0
This indicates that only the patients who display a (1,0) or (0,1) response
contribute to the treatment comparison. For instance, if they failed on both, or
were successful on both, there is no way to determine which treatment is
better. Therefore we will let:
Failure on B Success on B
denote the frequency of responses from the study data instead of the
probabilities listed above.
McNemar's test for this situation is as follows. Given the number of patients
who displayed a treatment preference, n10 + n01 , then n10 follows a
binomial(p, n10 + n01) distribution and the null hypothesis reduces to testing:
H0 : p = 0.5
i.e., we would expect a 50-50 split in the number of patients that would be
successful with either treatment in support of the null hypothesis, looking at
only the cells where there was success with one treatment and failure with the
other. The data in cells for both success or failure with both treatment would
be ignored.
Failure on B Success on B
ure on A 21 15
cess on A 7 7
The Rationale:
To account for the possible period effect in the 2 2 crossover trial, a term for
period can be included in the logistic regression analysis.
ure on A 10 7
cess on A 3 5
ure on A 11 8
cess on A 4 2
The logistic regression analysis yielded a nonsignificant result for the
treatment comparison (exact p = 0.2266). There is still no significant statistical
difference to report.
If the time to treatment failure on A is less than that on B, then the patient is
assigned a (0,1) score and prefers B.
If the time to treatment failure on B is less than that on A, then the patient is
assigned a (1,0) score and prefers A.
If the patient does not experience treatment failure on either treatment, then
the patient is assigned a (1,1) score and displays no preference.
Pharmaceutical scientists use crossover designs for such trials in order for
each trial participant to yield a profile for both formulations. The blood
concentration time profile is a multivariate response and is a surrogate
measure of therapeutic response. The pharmaceutical company does not
need to demonstrate the safety and efficacy of the drug because that already
has been established.
Are the reference and test blood concentration time profiles similar? The
test formulation could be toxic if it yields concentration levels higher than the
reference formulation. On the other hand, the test formulation could be
ineffective if it yields concentration levels lower than the reference formulation.
Typically, pharmaceutical scientists summarize the rate and extent of drug
absorption with summary measurements of the blood concentration time
profile, such as area under the curve (AUC), maximum concentration (CMAX),
etc. These summary measurements are subjected to statistical analysis (not
the profiles) and inferences are drawn as to whether or not the formulations
are bioequivalent.
Prescribability requires that the test and reference formulations are population
bioequivalent, whereas switchability requires that the test and reference
formulations have individual bioequivalence.
The FDA recommended values are 1 = 0.80 and 2 = 1.25, ( i.e., the ratios
4/5 and 5/4), for responses such as AUC and CMAX which typically follow
lognormal distributions.
AUC CMAX
T) 0.0893 -0.104
1.09 0.90
15.13 - Summary
In this lesson, among other things, we learned:
Understand and modify SAS programs for analysis of data from 2x2
crossover trials with continuous or binary data.
Let's put what we have learned to use by completing the following homework
assignment:
Homework
Look for homework assignment and the dropbox in the folder for this week in
ANGEL.
Lesson 16: Overviews and Meta-analysis
Introduction
Overviews, which are relied upon by many physicians, are important because
there usually exist multiple studies that have addressed a specific research
question. Yet these types of studies may differ with respect to:
Design
Patient population
Quality
Results
What does this process involve? There are six basic steps to an overview:
Reference
If the question is too broad, it may not be useful when applied to a particular
patient. For example, whether chemotherapy is effective in cancer is too broad
a question (the number of studies addressing this question could exceed
10,000).
If the question is too narrow, there may not be enough evidence to answer the
question. For example, the following question is too narrow: Is a particular
asthma therapy effective in Caucasian females over the age of 65 years in
Central Pennsylvania?
Conference proceedings
Theses/dissertations
Personal contacts
Unpublished reports
A "funnel plot" can be constructed to investigate the latter issue. Plot sample
size (vertical axis) versus p-value or magnitude of effect (horizontal axis).
Notice that the p-values for some of the small studies are relatively large,
yielding a "funnel" shape for the scatterplot.
Notice that none of the p-values for the small studies are large, yielding a
"band" shape for the scatterplot and the suspicion of publication bias. This is
evidence to suggest that there does exist a degree of 'publication bias'.
No (0 points)
Yes (1 point)
Ideally, the statistical analysis for a systematic review will be based on the raw
data from each eligible study. This has rarely occured. Either the raw data
were no longer available or the authors were unwilling to share the raw data.
However, the success of shared data in the Human Genome Project has
given impetus to increased data sharing to promote rapid scientific progress.
Since the US NIH now requires investigators receiving large new NIH grants
to have a plan for data-sharing.( NIH Data Sharing Policy Guide [2] ) and has
provided more guidance [3]on how federal data are to be shared. we may
anticipate more meta-analyses based on raw data.
16.5 - 5. Meta-analysis
The obvious advantage for performing a meta-analysis is that a large amount
of data, pooled across multiple studies, can provide increased precision in
addressing the research question. The disadvantage of a meta-analysis is that
the studies can be very heterogeneous in their designs, quality, and patient
populations and, therefore, it may not be valid to pool them. This issue is
something that needs to be evaluated very critically.
The estimated treatment effect (e.g., difference between the sample treatment
and control means) in the kth study, k = 1, 2, ... , K, is Yk .
The weight for the estimated treatment effect in the kth study
is wk=1/S2kwk=1/Sk2.
Y=(k=1KwkYk)/(k=1Kwk)Y=(k=1KwkYk)/(k=1Kwk)
The 100(1 - )% confidence interval for the overall weighted treatment effect
is:
Q=k=1Kwk(YkY)2Q=k=1Kwk(YkY)2
Control Treatment
16.7 - Example
Consider the following example for the difference in sample means between
an inhaled steroid and montelukast in asthmatic children. The outcome
variable is FEV 1 (L) from four clinical trials. Note that only the first study
yields a statistically significant result (p-value < 0.05).
Yk Sk wk p-value
(0.070977)+(0.043416)+(0.058370)+(0.075595)977+416+370+595
(0.070977)+(0.043416)+(0.058370)+(0.075595)977+416+370+595
The magnitude of the effect, or the 95% confidence interval is [0.025, 0.105]
The statistic for testing homogeneity is Q = 0.303 which does not exceed 7.81,
the 95th percentile from the 2332 distribution. Therefore, we have further
evidence that the studies are homogenous, although the small number of
studies involved in this overview does not give this result very much power.
Based on the evidence presented above, we can conclude that the inhaled
steroid is significantly better than montelukast in improving lung function in
children with asthma.
Yk = + ek
where Yk is the observed effect in the kth study, is the pooled population
parameter of interest (difference in population treatment means, natural
logarithm of the population odds ratio, etc.) and ek is the random error term for
the kth study.
Yk = + tk + ek
A weighted analysis will be applied, analogous to the weighted analysis for the
fixed-effects linear model, but the weights are different. The overall weighted
treatment effect is:
Y=(k=1KwkYk)/(k=1Kwk)Y=(k=1KwkYk)/(k=1Kwk)
where
wk=1/(S2k+^2)wk=1/(Sk2+^2)
^2=max(0,QK+1)
(k=1Kwk)/(k=1Kwk)2k=1Kw2k^2=max(0,QK+1)(k=1Kwk)/
((k=1Kwk)2k=1Kwk2)
and where Q is the heterogeneity statistic and wk is the weight for the kth study,
which were defined previously for the weighted analysis in the fixed-effects
linear model.
S2=1/(k=1Kwk)S2=1/(k=1Kwk)
If there exists a large amount of study heterogeneity, then ^2^2 will be very
large and will dominate in the expression for the weight in the kth study, i.e.,
wk=1/(S2k+^2)1/^2wk=1/(Sk2+^2)1/^2
Remove the first of the K studies and conduct the meta-analysis on the
remaining K - 1 studies
Here are a series of questions that we can ask ourselves as we evaluate the
value of a meta-analysis. You will have the opportunity to evaluate a meta-
analysis in the homework exercise.
2. Was the search for relevant studies detailed and exhaustive? Were the
inclusion/exclusion criteria for studies developed and applied
appropriately?
16.10 - Summary
In this lesson, among other things, we learned how to:
Let's put what we have learned to use by completing the following homework
assignment:
Homework
Look for homework assignment and the dropbox in the folder for this week in
ANGEL.
A diagnostic test is any approach used to gather clinical information for the
purpose of making a clinical decision (i.e., diagnosis). Some examples of
diagnostic tests include X-rays, biopsies, pregnancy tests, medical histories,
and results from physical examinations.
From a statistical point of view there are two points to keep in mind:
she had no illnesses during the preceding year and there is no family
history of breast cancer,
Based on the woman's age and medical history, the initial (prior) probability
estimate of breast cancer is 0.003. The physician recommends that the
woman have a mammogram, due to her age. Unfortunately, the results of the
mammogram are abnormal. This yields a modification of the women's prior
probability of breast cancer from 0.003 to 0.13 (notice the Bayesian flavor of
this approach - prior probability modified via existing data). Next, the woman is
referred to a surgeon who agrees that the physical breast exam is normal. The
surgeon consults with a radiologist and they decide that the woman should
undergo fine needle aspiration (FNA) of the abnormal breast detected by the
mammogram. (diagnostic test #2) The FNA specimen reveals abnormal cells,
which again revises the probability of breast cancer, from 0.13 to 0.64. Finally,
the woman is scheduled for a breast biopsy the following week to get a
definitive diagnosis.
Disease No Disease
a b
Test Positive
true positives false positives
c d
Test Negative
false negatives true negatives
a (true-positives) = individuals with the disease, and for whom the test is
positive
b (false-positives) = individuals without the disease, but for whom the test is
positive
c (false-negatives) = individuals with the disease, but for whom the test is
negative
d (true-negatives) = individuals without the disease, and for whom the test is
negative
Sensitivity is the probability that an individual with the disease of interest has a
positive test. It is estimated from the sample as a/(a+c).
Accuracy is the probability that the diagnostic test yields the correct
determination. It is estimated from the sample as (a+d)/(a+b+c+d).
Tests with high sensitivity are useful clinically to rule out a disease. A negative
result for a very sensitive test virtually would exclude the possibility that the
individual has the disease of interest. If a test has high sensitivity, it also
results in a low proportion of false-negatives. Sensitivity also is referred to as
"positive in disease" or "sensitive to disease".
Tests with high specificity are useful clinically to confirm the presence of a
disease. A positive result for a very specific test would give strong evidence in
favor of diagnosing the disease of interest. If a test has high specificity, it also
results in a low proportion of false-positives. Specificity also is referred to as
"negative in health" or "specific to health".
Sensitivity and specificity are, in theory, stable for all groups of patients.
Cancer No Cancer
FNA Positive 14 8
FNA Negative 1 91
As the output shows below, the exact 95% confidence intervals for sensitivity
and specificity are (0.680, 0.998) and (0.847, 0.965), respectively.
17.3 - Estimating the Probability of
Disease
Sensitivity and specificity describe the accuracy of a test. In a clinical setting,
we do not know who has the disease and who does not - that is why
diagnostic tests are used. We would like to be able to estimate the probability
of disease based on the outcome of one or more diagnostic tests. The
following measures address this idea.
Prevalence is the probability of having the disease, also called the prior
probability of having the disease. It is estimated from the sample as (a+c)/
(a+b+c+d).
Negative Predictive Value (PV - ) is the probability of not having the disease
when the test result is negative. It is estimated as as d/(c+d).
In the FNA study of 114 women with nonpalpable masses and abnormal
mammograms,
Thus, a woman's prior probability of having the disease is 0.13 and is modified
to 0.64 if she has a positive test result. A women's prior probability of not
having the disease is 0.87 and is modified to 0.99 if she has a negative test
result.
If the disease under study is rare, the investigator may decide to invoke a
case-control design for evaluating the diagnostic test, e.g., recruit 50 patients
with the disease and 50 controls. Obviously, prevalence cannot be estimated
from a case-control study because it does not represent a random sample
from the general population.
Predictive values allow us to determine the usefulness of a test and they vary
with the sensitivity and specificity of a test. If all other characteristics held
constant, then:
Predictive values vary with the prevalence of the disease in the population
being tested or the pre-test probability of disease in a given individual.
PV+=PrevalenceSensitivity(PrevalenceSensitivity)+
{(1Prevalence)(1Specificity)}PV+=PrevalenceSensitivity(PrevalenceSen
sitivity)+{(1Prevalence)(1Specificity)}
PV=(1Prevalence)Specificity{(1Prevalence)Specificity)}+
{Prevalence(1Sensitivity)}PV=(1Prevalence)Specificity{(1Prevalence)S
pecificity)}+{Prevalence(1Sensitivity)}
45-year-old man
no coronary risk factors except smoking one pack of cigarettes per day
The physician is not sure whether the patient should undergo an exercise
electrocardiogram (ECG). How useful would this test be for this patient?
Suppose it is known from the literature that the sensitivity and specificity of the
exercise ECG in coronary artery stenosis (as compared to the gold standard
of coronary arteriography) are 60% and 91%, respectively.
Then:
From the FNA study in 114 women with nonpalpable masses and abnormal
mammograms, LR+ = 0.933/0.081 = 11.52 and LR - = 0.067/0.919 = 0.07.
Thus, positive FNA results are 11.52 times more likely in women with cancer
as compared to those without, and negative FNA results are .07 times as
likely in women with cancer as compared to those without.
The appropriate statistical test depends on the setting. If diagnostic tests were
studied on two independent groups of patients, then two-sample tests for
binomial proportions are appropriate (chi-square, Fisher's exact test). If both
diagnostic tests were performed on each patient, then paired data result and
methods that account for the correlated binary outcomes are necessary
(McNemar's test).
Positive 82 30
Negative 18 70
Diagnostic Test #2 Disease No Disease
Positive 140 10
Negative 60 90
The SAS program also indicates that the p-value = 0.0262 from Fisher's exact
test for testing H0 : p1 = p2 .
Thus, diagnostic test #1 has a significantly better sensitivity than diagnostic
test #2.
Suppose both diagnostic tests (test #1 and test #2) are applied to a given set
of individuals, some with the disease (by the gold standard) and some without
the disease.
Diagnostic Test #2
Positive 30 35
Negative 23 12
The appropriate test statistic for this situation is McNemar's test. The patients
with a (+, +) result and the patients with a ( - , - ) result do not distinguish
between the two diagnostic tests. The only information for comparing the
sensitivities of the two diagnostic tests comes form those patients with a (+, - )
or ( - , +) result.
The positivity criterion is the cutoff value on a numerical scale that separates
normal values from abnormal values. It determines which test results are
considered positive (indicative of disease) and negative (disease-free).
Because the distributions of test values for diseased and disease-free
individuals are likely to overlap, there will be false-positive and false-negative
results. When defining a positivity criterion, it is important to consider which
mistake is worse.
Now suppose a greater value is selected for the cutoff point. The chosen
cutoff value will yield a good sensitivity because nearly all of the diseased
individuals will have a positive result. Unfortunately, many of the healthy
individuals also will have a positive result (false positives), so this cutoff value
will yield a poor specificity.
In the following example, a high value of the diagnostic test (positive result) is
indicative of disease. The chosen cutoff value will yield a poor sensitivity
because many of the diseased individuals will have a negative result (false
negatives). On the other hand, nearly all of the healthy individuals will have a
negative result, so the chosen cutoff value will yield a good specificity.
When the consequences for missing a case are potentially grave, choose a
value for the positivity criterion that minimizes the number of false-negatives.
For example, in neonatal PKU screening, a false-negative result may delay
essential dietary intervention until mental retardation is evident. False-positive
results, on the other hand, are usually identified during follow-up testing.
When false-positive results may lead to a risky treatment, choose a value for
the positivity criterion that minimizes the number of false-positive results. For
example, false-positive results indicating certain types of cancer can lead to
chemotherapy which can suppress the patient's immune system and leave the
patient open to infection and other side effects.
The figure below depicts an ROC curve (drawn with xs). The point in the
upper left corner of the figure, (0,1), represents a perfect test, in which
sensitivity and specificity both are 1. When false-positive and false-negative
results are equally problematic, there are two choices: 1. Set the positivity
criterion to the point on the ROC curve closest to the upper left corner. (This
will also be closest to the dashed line, as the cutoff in the figure indicates.) or
2. Set the positivity criterion to the point on the ROC curver farthest (vertical
distance) from the line of chance (Youdon Index).
When false-positive results are more undesirable, set the positivity criterion to
the point farthest left on the ROC curve (increase specificity). If instead, false-
negative results are more undesirable, set the positivity criterion to a point
farther right on the ROC curve (increase sensitivity).
In the ACRN SOCS trial, the investigators wanted to determine if low values of
the methacholine PC20 at baseline are predictive of significant asthma
exacerbations. The methacholine PC20 is a measure of how reactive a
persons airways are to an irritant (methacholine) a low value of the
PC20 corresponds a to high level of airway reactivity.
17.6 - Summary
In this lesson, among other things, we learned how to:
This lesson will focus only on correlation and agreement, (issues numbered 1
and 2 listed above).
X=1ni=1nXi,SXX=1n1i=1n(XiX)2X=1ni=1nXi,SXX=1n1i=1n(Xi
X)2
Y=1ni=1nYi,SYY=1n1i=1n(YiY)2Y=1ni=1nYi,SYY=1n1i=1n(YiY
)2
SXY=1n1i=1n(XiX)(YiY)SXY=1n1i=1n(XiX)(YiY)
These statistics above represent the sample mean for X, the sample variance
for X, the sample mean for Y, the sample variance for Y, and the sample
covariance between X and Y, respectively. These should be very familiar to
you.
The sample Pearson correlation coefficient (also called the sample product-
moment correlation coefficient) for measuring the association between
variables X and Y is given by the following formula:
rp=SXYSXXSYYrp=SXYSXXSYY
p=XYXXYYp=XYXXYY
Xi * = aXi + b
Yi * = cYi + d
With SAS, PROC CORR is used to calculate rp . The output from PROC
CORR includes summary statistics for both variables and the computed value
of rp . The output also contains a p-value corresponding to the test of:
H0 : p = 0 versus H0 : p 0
It should be noted that this statistical test generally is not very useful, and the
associated p-value, therefore, should not be emphasized. What is more
important is to construct a confidence interval.
zp=12loge(1+rp1rp)N(p,sd=1n3
)zp=12loge(1+rp1rp)N(p,sd=1n3)
where
p=12loge(1+p1p)p=12loge(1+p1p)
We will use this to get the usual confidence interval, so, an approximate 100(1
- )% confidence interval for p is given by [zp, /2 , zp, 1-/2 ], where
zp,/2=zp(tn3,1/2/n3),zp,1/2=zp+(tn3,1/2/n3
)zp,/2=zp(tn3,1/2/n3),zp,1/2=zp+(tn3,1/2/n3)
rp,/2=exp(2zp,/2)1exp(2zp,/2)+1,rp,1/2=exp(2zp,1/2)1exp(2zp,1
/2)+1rp,/2=exp(2zp,/2)1exp(2zp,/2)+1,rp,1/2=exp(2zp,1/2)1exp(2zp,1/
2)+1
Again, you do not have to do this by hand. PROC CORR in SAS will do this
for you but it is important to have an idea of what is going on.
Suppose two observations (Xi , Yi ) and (Xj , Yj ) are concordant if they are in
the same order with respect to each variable. That is, if
They are discordant if they are in the reverse ordering for X and Y, or the
values are arranged in opposite directions. That is, if
The total number of pairs that can be constructed for a sample size of n is
N=(n2)=12n(n1)N=(n2)=12n(n1)
N=P+Q+X0+Y0+(XY)0N=P+Q+X0+Y0+(XY)0
This value becomes scaled and ranges between -1 and +1. Unlike Spearman
it does estimate a population variance as:
The Kendall tau-b has properties similar to the properties of the Spearman rs.
Because the sample estimate, tb , does estimate a population parameter, tb ,
many statisticians prefer the Kendall tau-b to the Spearman rank correlation
coefficient.
The 95% confidence intervals are (0.5161, 0.9191) and (0.4429, 0.9029),
respectively for the Pearson and Spearman correlation coefficients. Because
the Kendall correlation typically is applied to binary or ordinal data, its 95%
confidence interval can be calculated via SAS PROC FREQ (this is not shown
in the SAS program above).
3. The correlation of two variables that both have been recorded repeatedly
over time can be misleading and spurious. Time trends should be removed
from such data before attempting to measure correlation.
5. Care should be taken when attempting to correlate two variables where one
is a part and one represents the total. For example, we would expect to find a
positive correlation between height at age ten and adult height because the
second quantity "contains" the first quantity.
7. Small correlation values do not necessarily indicate that two variables are
unassociated. For example, Pearson's rp will underestimate the association
between two variables that show a quadratic relationship. Scatterplots should
always be examined.
rc=2SXYSXX+SYY+(XY)2rc=2SXYSXX+SYY+(XY)2
c=2XYXX+YY+(XY)2c=2XYXX+YY+(XY)2
Let's look at an example that will help to make this concept clearer.
SAS Example (19.2_agreement_concordanc.sas [2]) : The ACRN DICE trial
was discussed earlier in this course. In that trial, participants underwent hourly
blood draws between 08:00 PM and 08:00 AM once a week in order to
determine the cortisol area-under-the-curve (AUC). The participants hated
this! They complained about the sleep disruption every hour when the nurses
came by to draw blood, so the ACRN wanted to determine for future studies if
the cortisol AUC calculated on measurements every two hours was in good
agreement with the cortisol AUC calculated on hourly measurements. The
baseline data were used to investigate how well these two measurements
agreed. If there is good agreement, the protocol could be changed to take
blood every two hours.
n the program to view the output. This is higher level SAS than you are expected to program yourself in this course, but some of
The SAS program yielded rc = 0.95 and a 95% confidence interval = (0.93,
0.96). The ACRN judged this to be excellent agreement, so it will use two-
hourly measurements in future studies.
What about binary or ordinal data? Cohen's Kappa Statistic will handle this...
| | | ... |
| | | ... |
p0=1ni=1gfiip0=1ni=1gfii
pe=1n2i=1gfi+f+ipe=1n2i=1gfi+f+i
where fi+ is the total for the ith row and f+i is the total for the ith column. The
kappa statistic is:
^=p0pe1pe^=p0pe1pe
SAS PROC FREQ provides an option for constructing Cohen's kappa and
weighted kappa statistics.
The weighted kappa coefficient is 0.57 and the asymptotic 95% confidence
interval is (0.44, 0.70). This indicates that the amount of agreement between
the two radiologists is modest (and not as strong as the researchers had
hoped it would be).
Note: Updated programs for examples 19.2 and 19.3 are in the folder for this
lesson. Take a look.
18.8 - Summary
In this lesson, among other things, we learned how to:
Let's put what we have learned to use by completing the following homework
assignment:
Homework
Look for homework assignment and the dropbox in the folder for this week in
ANGEL.