Biostatistics Primer: What A Clinician Ought To Know: Subgroup Analyses

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

BIOSTATISTICS FOR CLINICIANS

Biostatistics Primer
What a Clinician Ought to Know: Subgroup Analyses
Helen Barraclough, MSc,* and Ramaswamy Govindan, MD†‡

identify subsets of patients that are more likely to benefit


Abstract: Large randomized phase III prospective studies con-
from the experimental treatment and conversely, by also
tinue to redefine the standard of therapy in medical practice.
detecting subsets of patients, which are at greater risk of
Often when studies do not meet the primary endpoint, it is
being adversely affected. Subsequently, new hypotheses
common to explore possible benefits in specific subgroups of
and trials can be generated from these findings. Ultimately,
patients. In addition, these analyses may also be done, even in the
this may lead to changes in clinical practice. In addition,
case of a positive trial to find subsets of patients where the
subgroup analyses can be useful in investigating whether
therapy is especially effective or ineffective. These unplanned
overall treatment effects (e.g., increased efficacy or toler-
subgroup analyses are justified to maximize the information that
ability of the new treatment over the comparator) are
can be obtained from a study and to generate new hypotheses.
consistent across subsets of patients. This is commonly
Unfortunately, however, they are too often overinterpreted or
referred to as “robustness checking.” For these reasons,
misused in the hope of resurrecting a failed study. It is important regulatory guidelines endorse appropriate subgroup anal-
to distinguish these overinterpreted, misused, and unplanned yses to be performed.1– 4
subgroup analyses from those prespecified and well-designed
subgroup analyses. This overview provides a practical guide to
the interpretation of subgroup analyses. WHAT ARE THE PROBLEMS WITH SUBGROUP
Key Words: Biostatistics, Subgroup analysis. ANALYSES?
There are two key statistical limitations of subgroup
(J Thorac Oncol. 2010;5: 741–746)
analyses. First, they are frequently underpowered. This is
because the sample size of a clinical trial is calculated to
evaluate the primary objective of the study with sufficient
power in all randomized patients, not in a subset of patients.
WHAT ARE SUBGROUP ANALYSES? Hence, the interaction test to detect whether the treatment
In randomized clinical trials, subgroup analyses evalu- effect observed in one level of a subgroup (e.g., males) is
ate the treatment effect (e.g., a hazard ratio [HR]) for a significantly different to that observed in another level of the
specific endpoint (e.g., overall survival) in subgroups of subgroup (e.g., females) is often underpowered. Conse-
patients defined by baseline characteristics (e.g., age, gender, quently, subgroup analyses are prone to generating “false-
histology, and ethnicity). It is not recommended to base negative” results.
subgroups on postrandomization measures because the des- The second major limitation of subgroup analyses is
ignation of patients to a subgroup may be affected by the that they are particularly prone to multiplicity. Multiplicity
study treatments. is the inflated probability of getting a “false-positive”
Subgroup analyses are useful in endeavoring to ob- result, i.e., incorrectly concluding that there is a significant
tain maximum information from a clinical trial by trying to difference between treatment arms where one does not in
fact exist, when several comparisons are performed. For
example, when the primary objective of a trial is analyzed,
*Intercontinental Information Sciences, Eli Lilly and Company, Sydney,
Australia; †Division of Oncology, Department of Medicine, Washington this represents one comparison of the treatment arms. A
University School of Medicine; and ‡Alvin J Siteman Cancer Center at 5% probability of obtaining a false-positive result is ac-
Washington University School of Medicine, St Louis, Missouri. cepted as the null hypothesis is rejected if the p value is
Disclosure: Helen Barraclough, MSc, is employed by Eli Lilly Australia and less than 0.05.
holds stock in Eli Lilly and Company.
Address for correspondence: Ramaswamy Govindan, MD, Division As more comparisons of the treatment arms are
of Medical Oncology, Washington University School of Medicine, made, by performing multiple subgroup analyses of the
660 S. Euclid, Box 8056, St Louis, MO 63110. E-mail: rgovinda@ primary endpoint, there is a greater chance of one or more
im.wustl.edu
Copyright © 2010 by the International Association for the Study of Lung
of these comparisons generating a significant result by
Cancer chance alone. For example, if 10 comparisons of the
ISSN: 1556-0864/10/0505-0741 primary endpoint were done, there is a 40% chance of at

Journal of Thoracic Oncology • Volume 5, Number 5, May 2010 741


Barraclough and Govindan Journal of Thoracic Oncology • Volume 5, Number 5, May 2010

BOX 1. Information to document


when prespecifying a subgroup
analysis.

least one of these giving a false-positive result. Hence, a p To overcome the two major statistical limitations of multi-
value of less than 0.05 in a single comparison does not plicity and reduced power described above, the following five
provide adequate evidence that there is a significant dif- steps outline the best way to appropriately carry out, interpret,
ference between treatment arms when multiple subgroup and report prespecified subgroup analyses: (i) prespecify the
analyses are performed. subgroup analysis in the protocol and/or the statistical anal-
ysis plan (SAP), (ii) use an interaction test, (iii) estimate the
treatment effect for each level of the subgroup, (iv) validate
HOW NOT TO DO A SUBGROUP ANALYSIS results using confirmatory evidence, and (v) report results
Subgroup analyses can sometimes be presented to responsibly.
“save” a failed study. This is when the primary objective of
the trial was not met, but the new treatment was found to be Prespecify the Subgroup Analysis in the
significantly better than the comparator in a particular subset Protocol and/or the SAP
of patients. Many subgroups would have been analyzed to try Prespecified subgroup analyses are documented before
to find the one (or a few) subsets(s) of patients in which the any inspection of the data, whereas unplanned subgroup
new treatment was significantly better than the comparator. analyses are not. In most cases, prespecified subgroup anal-
This is sometimes described as “data dredging” or a “fishing yses will be recorded in the protocol. However, they can also
trip.” Misinterpretation of subgroup analyses can initiate be detailed in the SAP before unblinding of the data or before
future research based on unsubstantiated hypotheses and can first patient visit in open-label studies. Box 1 outlines the
even eventuate in suboptimal patient care.5 These detrimental information that should be documented when prespecifying a
consequences are extremely costly but can easily be pre- subgroup analysis.
vented by understanding the basic principles of subgroup Prespecified subgroup analyses are regarded as more
analyses. credible because they were planned before any examination
of the data. This provides reassurance against “data dredg-
ing.” However, both prespecified and unplanned subgroup
HOW TO CORRECTLY CONDUCT AND analyses are prone to multiplicity, that is, the increased
INTERPRET SUBGROUP ANALYSES probability of a false-positive result because of testing mul-
To conduct and interpret a subgroup analysis appropri- tiple subgroups described above. Hence, simply prespecify-
ately, it first needs to be established whether the subgroup ing a subgroup analysis does not make it automatically valid:
analysis was prespecified. This is because the purpose of it must still be conducted, interpreted, and reported appropri-
prespecified and unplanned subgroup analyses are distinct. ately as outlined by the following steps.
Prespecified subgroup analyses are used for hypothesis test-
ing. In contrast, unplanned (also called exploratory, retro- Use an Interaction Test
spective, or posthoc) subgroup analyses are used for gener- Interaction tests are the most appropriate statistical
ating new hypotheses and for “robustness checking.” It is method for conducting subgroup analyses. The concept of an
imperative to understand that both can provide valuable interaction test can be illustrated with the following hypo-
information but for different reasons. Conclusive inferences thetical example. In a randomized clinical trial, there are two
and any subsequent changes in clinical practice can only be treatments arms: treatment A (Tx A) and treatment B (Tx B),
made from prespecified subgroup analyses. Hence, the re- and the primary endpoint is overall survival. Gender is the
mainder of this article will focus on how to appropriately baseline characteristic used to define the subgroup into two
perform and interpret prespecified subgroup analyses only. levels: males and females.

742 Copyright © 2010 by the International Association for the Study of Lung Cancer
Journal of Thoracic Oncology • Volume 5, Number 5, May 2010 Biostatistics Primer

FIGURE 1. What is an interaction test? In this hypothetical example, there are two treatment (Tx) arms in the clinical
trial: A and B. There are also two levels of the subgroup of patients defined by the baseline characteristic of gender:
males and females. The regression lines linking the circles and squares represent the efficacy of treatment A and B, re-
spectively, for overall survival (as estimated by the log hazard from the Cox Proportional Hazards Model). The log hazard
estimates the log risk of death. Hence, the higher the regression line, the higher the risk of death. The treatment effect is
illustrated by an arrow in each level of the subgroup (which in this example is the log HR[Tx A vs Tx B]). If the regres-
sion lines are parallel, there is no interaction between treatment and gender (A). Hence, the treatment effect in males is
the same as in females. However, if the regression lines are not parallel (B and C), there is a statistically significant inter-
action between treatment and gender. Thus, the treatment effect in males is significantly different to that observed in
females.

A significant interaction test shows that the treatment of death is higher). Hence, in this example, the direction
effect in males is not the same as in females (Figure 1). In of the treatment effect in males was opposite to that
the case of a nonsignificant interaction test, the treatment observed in females (as shown by the arrows pointing in
effect observed in males is not significantly different to the different directions). This is an example of a “qualitative
treatment effect observed in females. In this example, both interaction.”
males and females treated with treatment A had better An interaction test is usually carried out as part of a
overall survival than those patients treated with treatment regression model. The type of regression model depends
B. This is shown in Figure 1A by the estimate for treatment on the endpoint being analyzed. For “time-to-event” end-
A being lower than that for treatment B as the risk of death points, such as overall survival and progression-free sur-
on treatment A is lower than on treatment B. The magni- vival, a Cox Proportional Hazards model is used, whereas
tude of the overall survival improvement observed with for binary endpoints, such as tumor response rate, a logis-
tic regression model is used. The Cox Proportional
treatment A compared with treatment B was also the same
Hazards Model is the standard method for analyzing time-
in both males and females (as shown by the identical
to-event endpoints in clinical trials.6 Therefore, in the case
arrows in Figure 1A). of this hypothetical example, the “treatment-by-gender”
A significant interaction test shows that the treatment interaction test is carried out by using a Cox model
effect significantly varies across the levels of the subgroup. containing:
This can be described as either a “quantitative” or “qual-
itative” interaction (it may also be called heterogeneity). Y A treatment term (treatment A vs treatment B)
Figures 1B and 1C illustrate two scenarios where the Y A gender term (males vs females)
interaction test was significant. In Figure 1B, both males Y A treatment-by-gender interaction term (males assigned
and females assigned to treatment A experienced better treatment A vs all other patients)
overall survival than those assigned to treatment B. How- Y Plus any predefined prognostic factors based on baseline
ever, the size of the treatment effect was smaller in females patient and disease characteristics (optional)
than in males (as shown by the shorter arrow for females).
This is an example of a “quantitative interaction.” In The interaction HR is a ratio of two HRs:
Figure 1C, males had better overall survival when assigned HR (treatment A vs treatment B) for males
to treatment A, but females experienced worse overall
survival when assigned to treatment A (because their risk HR (treatment A vs treatment B) for females

Copyright © 2010 by the International Association for the Study of Lung Cancer 743
Barraclough and Govindan Journal of Thoracic Oncology • Volume 5, Number 5, May 2010

This can be alternatively written as: The associated p value of the HR in each level of the
subgroup should be interpreted with caution. For example,
HR (males vs females) for patients assigned treatment A suppose the associated p value ⫽ 0.001 in males and p ⫽ 0.08
HR (males vs females) for patients assigned treatment B in females. These p values give the probability of observing
the estimated treatment difference or a more extreme one in
The test for interaction has the null hypothesis that the each level of the subgroup by chance alone, given the null
interaction HR ⫽ 1, i.e., the treatment effect in males is the hypothesis that there really is no treatment difference is true.
same as in females. The Cox proportional hazards model A common mistake is to claim that there is a differential
provides an estimate of the interaction HR and an associated treatment effect because the p value associated with the HR is
p value. statistically significant in males but nonsignificant in females.
If the p value for the interaction test is statistically This is incorrect because only the interaction test p value
significant, the null hypothesis can be rejected and a signif-
determines whether the HR observed in males is significantly
icant “treatment-by-gender interaction” can be claimed.
different to the HR observed in females. This is because the
Hence, the interaction HR differs significantly from 1. This
means that the treatment effect observed in males is signifi- interaction test takes into account: (i) the prognosis of pa-
cantly different to the treatment effect observed in females. tients in different levels of the subgroup, e.g., females may
The size and direction of the treatment effect, i.e., HR (Tx A have better overall survival than males regardless of the
vs Tx B), can now be estimated for males and for females. If treatment they were assigned and (ii) the intergroup variabil-
the interaction test result is nonsignificant, a differential ity between males and females in addition to the intragroup
treatment effect is not found, and thus, further analyses to test variability.
a predefined hypothesis are not recommended.
Estimate the Treatment Effect in Each Level of Validate Subgroup Results Using Confirmatory
the Subgroup Evidence
An estimate of the treatment effect in males and in Validation of results is a fundamental scientific princi-
females can be obtained from either (i) the same Cox ple. To confirm a subgroup result from an individual clinical
model described above or (ii) by removing the gender term trial, presence of the subgroup effect in an independent study
and the “treatment-by-gender” interaction term and rerun- or meta-analysis is required. Additional, but less compelling
ning the model for males only and then separately for types of confirmatory evidence that may be used to support
females. the validity of a subgroup analysis result include a prespeci-
Both approaches provide a HR (Tx A vs Tx B), 95% fied biologic rationale and the existence of the subgroup
confidence intervals, and an associated p value for each effect for related endpoints. It should be emphasized that until
level of the subgroup. These are often presented on a forest confirmatory evidence exists to validate a subgroup analysis
plot (Figure 2). From the estimated HRs in males and result, it is hypothesis generating only and the treatment
females, it can be determined whether the interaction is effect observed in all randomized patients is still regarded as
“quantitative” (Figure 1B) or “qualitative” (Figure 1C). If the most appropriate estimate for patients in each level of the
the interaction is “quantitative,” the HRs would be in the subgroup.
same direction, e.g., less than 1, for both males and
females. In contrast, if the interaction is “qualitative” then
the HRs would be in opposite directions for each level of Report Results Responsibly
the subgroup, e.g., a HR(Tx A vs Tx B) ⬍1 for males and Subgroup results need to be reported responsibly for
a HR(Tx A vs Tx B) ⬎1 for females. others to be able to interpret them appropriately. The results
of the primary endpoint analysis in all randomized patients
should be emphasized in abstract and conclusions. Further-
Hazard Ratio more, the prespecified subgroup analyses should be named,
Males (n=200) 0.80
and the number of prespecified and unplanned subgroup
Females (n=200) 1.05
analyses that were carried out should be clearly stated. The
validity of a subgroup analysis result should also be discussed
0.6 0.8 1.0 1.2 1.4 in context of current confirmatory evidence and the scientific
Overall Survival Treatment Hazard Ratio (95% CI )
literature.

Favors Tx A Favors Tx B
SUMMARY
FIGURE 2. Forest plot. Forest plot are commonly used to These concepts apply to any type of endpoint, such as
graphically present subgroup analyses results. Above is a hy-
pothetical result corresponding to the qualitative interaction categorical (e.g., responder or nonresponder), continuous
example described in Figure 1C. The diamond represents the (e.g., systolic blood pressure), or time to event data (e.g.,
point estimate of the HR(Tx A vs Tx B) and the horizontal overall survival). Box 2 summarizes the key points to aid
lines the 95% confidence intervals. clinicians to interpret subgroup analyses correctly.

744 Copyright © 2010 by the International Association for the Study of Lung Cancer
Journal of Thoracic Oncology • Volume 5, Number 5, May 2010 Biostatistics Primer

BOX 2. Key points of subgroup analyses


in randomized clinical trials.

Copyright © 2010 by the International Association for the Study of Lung Cancer 745
Barraclough and Govindan Journal of Thoracic Oncology • Volume 5, Number 5, May 2010

ACKNOWLEDGMENTS Consider on Multiplicity Issues in Clinical Trials (CPMP/EWP/908/99).


London: EMEA, 2002.
The authors thank Lorinda Simms, Nicolas Scheuer, 3. Moher D, Schulz KF, Altman DG. The CONSORT statement: revised
and Mauro Orlando for helpful discussions and critical recommendations for improving the quality of reports of parallel-group
reading of the article. randomized trials. Ann Intern Med 2001;134:657– 662.
4. Altman DG, Schulz KF, Moher D, et al. The revised CONSORT
statement for reporting randomized trials: explanation and elaboration.
Ann Intern Med 2001;134:663– 694.
REFERENCES 5. Lagakos SW. The challenge of subgroup analyses—reporting without
1. International Conference on Harmonization (ICH) Topic E9. Statistical distorting. N Engl J Med 2006;354:1667–1669.
Principles for Clinical Trials, 1998. 6. Cox DR. Regression models and life-tables. J Royal Stat Soc Ser B
2. Committee for Proprietary Medicinal Products (CPMP). Points to 1972;34:187–220.

746 Copyright © 2010 by the International Association for the Study of Lung Cancer

You might also like