Biostatistics Primer: What A Clinician Ought To Know: Subgroup Analyses
Biostatistics Primer: What A Clinician Ought To Know: Subgroup Analyses
Biostatistics Primer: What A Clinician Ought To Know: Subgroup Analyses
Biostatistics Primer
What a Clinician Ought to Know: Subgroup Analyses
Helen Barraclough, MSc,* and Ramaswamy Govindan, MD†‡
least one of these giving a false-positive result. Hence, a p To overcome the two major statistical limitations of multi-
value of less than 0.05 in a single comparison does not plicity and reduced power described above, the following five
provide adequate evidence that there is a significant dif- steps outline the best way to appropriately carry out, interpret,
ference between treatment arms when multiple subgroup and report prespecified subgroup analyses: (i) prespecify the
analyses are performed. subgroup analysis in the protocol and/or the statistical anal-
ysis plan (SAP), (ii) use an interaction test, (iii) estimate the
treatment effect for each level of the subgroup, (iv) validate
HOW NOT TO DO A SUBGROUP ANALYSIS results using confirmatory evidence, and (v) report results
Subgroup analyses can sometimes be presented to responsibly.
“save” a failed study. This is when the primary objective of
the trial was not met, but the new treatment was found to be Prespecify the Subgroup Analysis in the
significantly better than the comparator in a particular subset Protocol and/or the SAP
of patients. Many subgroups would have been analyzed to try Prespecified subgroup analyses are documented before
to find the one (or a few) subsets(s) of patients in which the any inspection of the data, whereas unplanned subgroup
new treatment was significantly better than the comparator. analyses are not. In most cases, prespecified subgroup anal-
This is sometimes described as “data dredging” or a “fishing yses will be recorded in the protocol. However, they can also
trip.” Misinterpretation of subgroup analyses can initiate be detailed in the SAP before unblinding of the data or before
future research based on unsubstantiated hypotheses and can first patient visit in open-label studies. Box 1 outlines the
even eventuate in suboptimal patient care.5 These detrimental information that should be documented when prespecifying a
consequences are extremely costly but can easily be pre- subgroup analysis.
vented by understanding the basic principles of subgroup Prespecified subgroup analyses are regarded as more
analyses. credible because they were planned before any examination
of the data. This provides reassurance against “data dredg-
ing.” However, both prespecified and unplanned subgroup
HOW TO CORRECTLY CONDUCT AND analyses are prone to multiplicity, that is, the increased
INTERPRET SUBGROUP ANALYSES probability of a false-positive result because of testing mul-
To conduct and interpret a subgroup analysis appropri- tiple subgroups described above. Hence, simply prespecify-
ately, it first needs to be established whether the subgroup ing a subgroup analysis does not make it automatically valid:
analysis was prespecified. This is because the purpose of it must still be conducted, interpreted, and reported appropri-
prespecified and unplanned subgroup analyses are distinct. ately as outlined by the following steps.
Prespecified subgroup analyses are used for hypothesis test-
ing. In contrast, unplanned (also called exploratory, retro- Use an Interaction Test
spective, or posthoc) subgroup analyses are used for gener- Interaction tests are the most appropriate statistical
ating new hypotheses and for “robustness checking.” It is method for conducting subgroup analyses. The concept of an
imperative to understand that both can provide valuable interaction test can be illustrated with the following hypo-
information but for different reasons. Conclusive inferences thetical example. In a randomized clinical trial, there are two
and any subsequent changes in clinical practice can only be treatments arms: treatment A (Tx A) and treatment B (Tx B),
made from prespecified subgroup analyses. Hence, the re- and the primary endpoint is overall survival. Gender is the
mainder of this article will focus on how to appropriately baseline characteristic used to define the subgroup into two
perform and interpret prespecified subgroup analyses only. levels: males and females.
742 Copyright © 2010 by the International Association for the Study of Lung Cancer
Journal of Thoracic Oncology • Volume 5, Number 5, May 2010 Biostatistics Primer
FIGURE 1. What is an interaction test? In this hypothetical example, there are two treatment (Tx) arms in the clinical
trial: A and B. There are also two levels of the subgroup of patients defined by the baseline characteristic of gender:
males and females. The regression lines linking the circles and squares represent the efficacy of treatment A and B, re-
spectively, for overall survival (as estimated by the log hazard from the Cox Proportional Hazards Model). The log hazard
estimates the log risk of death. Hence, the higher the regression line, the higher the risk of death. The treatment effect is
illustrated by an arrow in each level of the subgroup (which in this example is the log HR[Tx A vs Tx B]). If the regres-
sion lines are parallel, there is no interaction between treatment and gender (A). Hence, the treatment effect in males is
the same as in females. However, if the regression lines are not parallel (B and C), there is a statistically significant inter-
action between treatment and gender. Thus, the treatment effect in males is significantly different to that observed in
females.
A significant interaction test shows that the treatment of death is higher). Hence, in this example, the direction
effect in males is not the same as in females (Figure 1). In of the treatment effect in males was opposite to that
the case of a nonsignificant interaction test, the treatment observed in females (as shown by the arrows pointing in
effect observed in males is not significantly different to the different directions). This is an example of a “qualitative
treatment effect observed in females. In this example, both interaction.”
males and females treated with treatment A had better An interaction test is usually carried out as part of a
overall survival than those patients treated with treatment regression model. The type of regression model depends
B. This is shown in Figure 1A by the estimate for treatment on the endpoint being analyzed. For “time-to-event” end-
A being lower than that for treatment B as the risk of death points, such as overall survival and progression-free sur-
on treatment A is lower than on treatment B. The magni- vival, a Cox Proportional Hazards model is used, whereas
tude of the overall survival improvement observed with for binary endpoints, such as tumor response rate, a logis-
tic regression model is used. The Cox Proportional
treatment A compared with treatment B was also the same
Hazards Model is the standard method for analyzing time-
in both males and females (as shown by the identical
to-event endpoints in clinical trials.6 Therefore, in the case
arrows in Figure 1A). of this hypothetical example, the “treatment-by-gender”
A significant interaction test shows that the treatment interaction test is carried out by using a Cox model
effect significantly varies across the levels of the subgroup. containing:
This can be described as either a “quantitative” or “qual-
itative” interaction (it may also be called heterogeneity). Y A treatment term (treatment A vs treatment B)
Figures 1B and 1C illustrate two scenarios where the Y A gender term (males vs females)
interaction test was significant. In Figure 1B, both males Y A treatment-by-gender interaction term (males assigned
and females assigned to treatment A experienced better treatment A vs all other patients)
overall survival than those assigned to treatment B. How- Y Plus any predefined prognostic factors based on baseline
ever, the size of the treatment effect was smaller in females patient and disease characteristics (optional)
than in males (as shown by the shorter arrow for females).
This is an example of a “quantitative interaction.” In The interaction HR is a ratio of two HRs:
Figure 1C, males had better overall survival when assigned HR (treatment A vs treatment B) for males
to treatment A, but females experienced worse overall
survival when assigned to treatment A (because their risk HR (treatment A vs treatment B) for females
Copyright © 2010 by the International Association for the Study of Lung Cancer 743
Barraclough and Govindan Journal of Thoracic Oncology • Volume 5, Number 5, May 2010
This can be alternatively written as: The associated p value of the HR in each level of the
subgroup should be interpreted with caution. For example,
HR (males vs females) for patients assigned treatment A suppose the associated p value ⫽ 0.001 in males and p ⫽ 0.08
HR (males vs females) for patients assigned treatment B in females. These p values give the probability of observing
the estimated treatment difference or a more extreme one in
The test for interaction has the null hypothesis that the each level of the subgroup by chance alone, given the null
interaction HR ⫽ 1, i.e., the treatment effect in males is the hypothesis that there really is no treatment difference is true.
same as in females. The Cox proportional hazards model A common mistake is to claim that there is a differential
provides an estimate of the interaction HR and an associated treatment effect because the p value associated with the HR is
p value. statistically significant in males but nonsignificant in females.
If the p value for the interaction test is statistically This is incorrect because only the interaction test p value
significant, the null hypothesis can be rejected and a signif-
determines whether the HR observed in males is significantly
icant “treatment-by-gender interaction” can be claimed.
different to the HR observed in females. This is because the
Hence, the interaction HR differs significantly from 1. This
means that the treatment effect observed in males is signifi- interaction test takes into account: (i) the prognosis of pa-
cantly different to the treatment effect observed in females. tients in different levels of the subgroup, e.g., females may
The size and direction of the treatment effect, i.e., HR (Tx A have better overall survival than males regardless of the
vs Tx B), can now be estimated for males and for females. If treatment they were assigned and (ii) the intergroup variabil-
the interaction test result is nonsignificant, a differential ity between males and females in addition to the intragroup
treatment effect is not found, and thus, further analyses to test variability.
a predefined hypothesis are not recommended.
Estimate the Treatment Effect in Each Level of Validate Subgroup Results Using Confirmatory
the Subgroup Evidence
An estimate of the treatment effect in males and in Validation of results is a fundamental scientific princi-
females can be obtained from either (i) the same Cox ple. To confirm a subgroup result from an individual clinical
model described above or (ii) by removing the gender term trial, presence of the subgroup effect in an independent study
and the “treatment-by-gender” interaction term and rerun- or meta-analysis is required. Additional, but less compelling
ning the model for males only and then separately for types of confirmatory evidence that may be used to support
females. the validity of a subgroup analysis result include a prespeci-
Both approaches provide a HR (Tx A vs Tx B), 95% fied biologic rationale and the existence of the subgroup
confidence intervals, and an associated p value for each effect for related endpoints. It should be emphasized that until
level of the subgroup. These are often presented on a forest confirmatory evidence exists to validate a subgroup analysis
plot (Figure 2). From the estimated HRs in males and result, it is hypothesis generating only and the treatment
females, it can be determined whether the interaction is effect observed in all randomized patients is still regarded as
“quantitative” (Figure 1B) or “qualitative” (Figure 1C). If the most appropriate estimate for patients in each level of the
the interaction is “quantitative,” the HRs would be in the subgroup.
same direction, e.g., less than 1, for both males and
females. In contrast, if the interaction is “qualitative” then
the HRs would be in opposite directions for each level of Report Results Responsibly
the subgroup, e.g., a HR(Tx A vs Tx B) ⬍1 for males and Subgroup results need to be reported responsibly for
a HR(Tx A vs Tx B) ⬎1 for females. others to be able to interpret them appropriately. The results
of the primary endpoint analysis in all randomized patients
should be emphasized in abstract and conclusions. Further-
Hazard Ratio more, the prespecified subgroup analyses should be named,
Males (n=200) 0.80
and the number of prespecified and unplanned subgroup
Females (n=200) 1.05
analyses that were carried out should be clearly stated. The
validity of a subgroup analysis result should also be discussed
0.6 0.8 1.0 1.2 1.4 in context of current confirmatory evidence and the scientific
Overall Survival Treatment Hazard Ratio (95% CI )
literature.
Favors Tx A Favors Tx B
SUMMARY
FIGURE 2. Forest plot. Forest plot are commonly used to These concepts apply to any type of endpoint, such as
graphically present subgroup analyses results. Above is a hy-
pothetical result corresponding to the qualitative interaction categorical (e.g., responder or nonresponder), continuous
example described in Figure 1C. The diamond represents the (e.g., systolic blood pressure), or time to event data (e.g.,
point estimate of the HR(Tx A vs Tx B) and the horizontal overall survival). Box 2 summarizes the key points to aid
lines the 95% confidence intervals. clinicians to interpret subgroup analyses correctly.
744 Copyright © 2010 by the International Association for the Study of Lung Cancer
Journal of Thoracic Oncology • Volume 5, Number 5, May 2010 Biostatistics Primer
Copyright © 2010 by the International Association for the Study of Lung Cancer 745
Barraclough and Govindan Journal of Thoracic Oncology • Volume 5, Number 5, May 2010
746 Copyright © 2010 by the International Association for the Study of Lung Cancer