Guía para Reportar Estadística en Investigación Clínica

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

E U R O P E A N U RO L O GY 7 5 ( 2 019 ) 3 5 8 – 3 6 7

available at www.sciencedirect.com
journal homepage: www.europeanurology.com

Platinum Opinion – Editor’s Choice

Guidelines for Reporting of Statistics for Clinical Research


in Urology

Melissa Assel a, Daniel Sjoberg a, Andrew Elders b, Xuemei Wang c, Dezheng Huo d,
Albert Botchway e, Kristin Delfino e, Yunhua Fan f, Zhiguo Zhao h, Tatsuki Koyama h,
Brent Hollenbeck i, Rui Qin j, Whitney Zahnd k, Emily C. Zabor a, Michael W. Kattan g,
Andrew J. Vickers a,*
a
Memorial Sloan Kettering Cancer Center, New York, NY, USA; b Glasgow Caledonian University, Glasgow, UK; c The University of Texas, MD Anderson Cancer
Center, Houston, TX, USA; d The University of Chicago, Chicago, IL, USA; e Southern Illinois University School of Medicine, Springfield, IL, USA; f University of
Minnesota, Minneapolis, MN, USA; g Cleveland Clinic, Cleveland, OH, USA; h Vanderbilt University Medical Center, Nashville, TN, USA; i University of Michigan,
Ann Arbor, MI, USA; j Janssen Research & Development, NJ, USA; k University of South Carolina, Columbia, SC, USA

It is widely acknowledged that the quality of statistics in the we hope that it would also enhance the author's
clinical research literature is poor. This is true for urology understanding of hypothesis tests.
just as it is for other medical specialties. In 2005, Scales et al. The guidelines are didactic, based on the consensus of the
[1] published a systematic evaluation of the statistics in statistical consultants to the journals. We avoided, where
papers appearing in a single month in one of the four possible, making specific analytic recommendations and
leading urology medical journals: European Urology, The focused instead on analyses or methods of reporting statistics
Journal of Urology, Urology, and BJUI. They reported that should be avoided. We intend to update the guidelines
widespread errors, including 71% of papers with compara- over time and hence encourage readers who question the
tive statistics having at least one statistical flaw. These value or rationale of a guideline to write to the authors.
findings mirror many others in the literature[3_TD$IF]: see, for
instance, the review given by Lang and Altman [2]. The 1. The golden rule
quality of statistical reporting in urology journals has no
doubt improved since 2005, but remains unsatisfactory. 1.1. Break any of the guidelines if it makes scientific sense to
The four urology journals in the Scales et al's [1] review do so
have come together to publish a shared set of statistical
guidelines, adapted from those in use at one of the Science varies too much to allow methodologic or reporting
journals, European Urology, since 2014 [3]. The guidelines guidelines to apply universally.
will also be adopted by European Urology Focus and
European Urology Oncology. Statistical reviewers at the 2. Reporting of design and statistical analysis
four journals will systematically assess submitted manu-
scripts using the guidelines to improve statistical analysis, 2.1. Follow existing reporting guidelines for the type of study
reporting, and interpretation. Adoption of the guidelines you are reporting, such as CONSORT for randomized trials, ReMARK
will, in our view, not only increase the quality of published for marker studies, TRIPOD for prediction models, STROBE for
papers in our journals, but also improve statistical observational studies, or AMSTAR for systematic reviews
knowledge in our field in general. Asking an author to
follow a guideline about, say, the fallacy of accepting the Statisticians and methodologists have contributed exten-
null hypothesis would no doubt result in a better paper, but sively to a large number of reporting guidelines. The first is

* Corresponding author. Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, 485 Lexington Avenue, 2nd Floor, New
York, NY 10017, USA.
E-mail address: [email protected] (A.J. Vickers).
https://doi.org/10.1016/j.eururo.2018.12.014
0302-2838/© 2018 European Association of Urology. Published by Elsevier B.V. All rights reserved.
E U R O P E A N U R O L O GY 7 5 ( 2 019 ) 3 5 8 – 3 6 7 359

widely recognized to be the Consolidated Standards of rationale for the analytic approach, if this is not obvious or
Reporting Trials (CONSORT) statement on reporting of there are reasonable alternatives. Special attention and
randomized trials, but there are now many other guidelines, description should be provided for rarely used statistical
covering a wide range of different types of study. Reporting techniques.
guidelines can be downloaded from the Equator website
(http://www.equator-network.org). 2.5. The statistical methods should be described in sufficient
detail to allow replication by an independent statistician given the
2.2. Describe cohort selection fully same data set

It is insufficient to state, for instance, that “the study cohort Vague reference to “adjusting for confounders” or “nonlin-
consisted of 1144 patients treated for benign prostatic ear approaches” is insufficiently specific to allow replica-
hyperplasia at our institution.” The cohort needs to be tion, a cornerstone of the scientific method. All statistical
defined in terms of dates (eg, “presenting March 2013 to analyses should be specified in the Methods section,
December 2017”), inclusion criteria (eg, “IPSS > 12”), and including details such as the covariates included in a
whether patients were selected to be included (eg, for a multivariable model. All variables should be clearly defined
research study) versus being a consecutive series. Exclu- where there is room for ambiguity. For instance, avoid
sions should be described one by one, with the number of saying that “Gleason grade was included in the model”;
patients omitted for each exclusion criterion to give the final state instead “Gleason grade group was included in four
cohort size (eg, “patients with prior surgery [n = 43], categories 1, 2, 3, and 4 or 5.”
allergies to 5-ARIs [n = 12], and missing data on baseline
prostate volume [n = 86] were excluded to give a final 3. Inference and p values
cohort for analysis of 1003 patients”). Note that the
inclusion criteria can be omitted if obvious from the 3.1. Do not accept the null hypothesis
context (eg, no need to state “undergoing radical prostatec-
tomy for histologically proven prostate cancer”); on the In a court case, defendants are declared guilty or not guilty;
contrary, dates may need to be explained if their rationale there is no verdict of “innocent.” Similarly, in a statistical
could be questioned (eg, “March 2013, when our specialist test, the null hypothesis is rejected or not rejected. If the p
voiding clinic was established, to December 2017”). value is 0.05 or higher, investigators should avoid conclu-
sions such as “the drug was ineffective,” “there was no
2.3. Describe the practical steps of randomization in difference between groups,” or “response rates were
randomized trials unaffected.” Instead, authors should use phrases such as
“we did not see evidence of a drug effect,” “we were unable
Although this reporting guideline is part of the CONSORT to demonstrate a difference between groups,” or simply
statement, it is so critical and so widely misunderstood that “there was no statistically significant difference in response
it bears repeating. The purpose of randomization is to rates.”
prevent selection bias. This can be achieved only if the
consenting patients cannot guess their treatment allocation 3.2. P values just above 5% are not a trend, and they are not
before registration in the trial or change it afterward. This moving
safeguard is known as allocation concealment. Stating
merely that “a randomization list was created by a Avoid saying that a p value such as 0.07 shows a “trend”
statistician” or that “envelope randomization was used” (which is meaningless) or “approaches statistical signifi-
does not ensure allocation concealment: a list could have cance” (because the p value is not moving). Alternative
been posted in the nurse's station for all to see; envelopes language might be that “although we saw some evidence of
can be opened and resealed. Investigators need to specify improved response rates in patients receiving the novel
the exact logistic steps taken to ensure allocation conceal- procedure, differences between groups did not meet
ment. The best method is to use a password-protected conventional levels of statistical significance.”
computer database.
3.3. The p values and 95% confidence intervals do not quantify
2.4. The statistical methods should describe the study questions the probability of a hypothesis
and the statistical approaches used to address each question
A p value of, say, 0.03 does not mean that there is 3%
Many statistical methods sections state only something probability that the findings are due to chance. Additionally,
like “Mann-Whitney was used for comparisons of a 95% confidence interval (CI) should not be interpreted as a
continuous variables and Fisher's exact for comparisons 95% certainty that the true parameter value is in the range of
of binary variables.” This says little more than “the the 95% CI. The correct interpretation of a p value is the
inference tests used were not grossly erroneous for the probability of finding the observed or more extreme results
type of data.” Instead, statistical methods sections should when the null hypothesis is true; the 95% CI will contain the
lay out each primary study question separately: carefully true parameter value 95% of the time were a study to be
detail the analysis associated with each and describe the repeated many times using different samples.
360 E U R O P E A N U RO L O GY 7 5 ( 2 019 ) 3 5 8 – 3 6 7

3.4. Do not use confidence intervals to test hypotheses statistically significant effect in one group of patients but
not in another. A more appropriate approach is to use what
Investigators often interpret confidence intervals in terms is known as an interaction term in a statistical model. For
of hypotheses. For instance, investigators might claim that instance, to determine whether a drug reduced pain scores
there is a statistically significant difference between groups more in women than in men, the model might be as follows:
because the 95% CI for the odds ratio excludes 1. Such claims
fFinal pain scoreg
are problematic because confidence intervals are concerned ¼ b0
with estimation, and not inference. Moreover, the mathe- þ b1 fbaseline pain scoreg
matical method to calculate confidence intervals may be þ b2 fdrugg þ b3 fsexg þ b4 fdrugsg  fsexg
different from those used to calculate p values. It is perfectly
possible to have a 95% CI that includes no difference It is sometimes appropriate to report estimates and
between groups even though the p value is <0.05 or vice confidence intervals within subgroups of interest, but p
versa. For instance, in a study of 100 patients in two equal values should be avoided.
groups, with event rates of 70% and 50%, the p value from
3.8. Tests for change over time are generally uninteresting
Fisher's exact test is 0.066 but the 95% CI for the odds ratio is
1.03–5.26. The 95% CI for the risk difference and risk ratio
A common analysis is to conduct a paired t test comparing,
also exclude no difference between groups.
say, erectile function in older men at baseline with erectile
3.5. Take care to interpret results when reporting multiple p
function after 5 yr of follow-up. The null hypothesis here is
values
that “erectile function does not change over time,” which is
known to be false. Investigators are encouraged to focus on
The more questions you ask, the more likely you are to get a estimation rather than on inference, reporting, for example,
spurious answer to at least one of them. For example, if you the mean change over time along with a 95% CI.
report p values for five independent true null hypotheses, the
3.9. Avoid using statistical tests to determine the type of
probability that you will falsely reject at least one is not 5%,
analysis to be conducted
but >20%. Although formal adjustment of p values is
appropriate in some specific cases, such as genomic studies,
a more common approach is to simply interpret p values in Numerous statistical tests are available that can be used to
the context of multiple testing. For instance, if an investigator determine how a hypothesis test should be conducted. For
instance, investigators might conduct a Shapiro-Wilk test
examines the association of 10 variables with three different
endpoints, thereby testing 30 separate hypotheses, a p value for normality to determine whether to use a t test or a
of 0.04 should not be interpreted in the same way as if the Mann-Whitney test,[1_TD$IF] Cochran's Q to decide whether to use a
study tested only a single hypothesis with a p value of 0.04. fixed-effect or a random-effect approach in a meta-analysis
or[2_TD$IF] a t test for between-group differences in a covariate to
3.6. Do not report separate p values for each of two different
determine whether that covariate should be included a
groups in order to address the question of whether there is a
multivariable model. The problem with these sorts of
difference between groups
approaches is that they are often testing a null hypothesis
that is known to be false. For instance, no data set perfectly
One scientific question means one statistical hypothesis follows a normal distribution. Moreover, it is often
tested by one p value. To illustrate the error of using two p questionable that changing the statistical approach in the
values to address one question, take the case of a light of the test is actually of benefit. Statisticians are far
randomized trial of drug versus placebo to reduce voiding from unanimous as to whether Mann-Whitney is always
symptoms, with 30 patients in each group. The authors superior to t test when data are non-normal, or that fixed
might report that symptom scores improved by 6 (standard effects are invalid under study heterogeneity, or that the
deviation 14) points in the drug group (p = 0.03 by one- criterion of adjusting for a variable should be whether it is
sample t test) and by 5 (standard deviation 15) points in the significantly different between groups. Investigators should
placebo group (p = 0.08). However, the study hypothesis generally follow a prespecified analytic plan, only altering
concerns the difference between drug and placebo. To test a the analysis if the data unambiguously point to a better
single hypothesis, a single p value is needed. A two-sample t alternative.
test for these data gives a p value of 0.8—unsurprising, given
that the scores in each group were virtually the same— 3.10. When reporting p values, be clear about the hypothesis
confirming that it would be unsound to conclude that the tested and ensure that the hypothesis is a sensible one
drug was effective based on the finding that the change was
significant in the drug group but not in placebo controls. The p values test very specific hypotheses. When reporting a
p value in the Results section, state the hypothesis being
3.7. Use interaction terms in place of subgroup analyses tested unless this is completely clear. Take, for instance, the
statement “pain scores were higher in group 1 and similar in
A similar error to the use of separate tests for a single groups 2 and 3 (p = 0.02).” It is ambiguous whether the p
hypothesis is when an intervention is shown to have a value of 0.02 is testing group 1 versus groups 2 and
E U R O P E A N U R O L O GY 7 5 ( 2 019 ) 3 5 8 – 3 6 7 361

3 combined or the hypothesis that pain score is same in all unique information. Authors should avoid reporting de-
three groups. Clarity about the hypotheses being tested can scriptive statistics that can readily be derived from the data
help avoid the testing of inappropriate hypotheses. For that have already been provided. For instance, there is no
instance, p values for differences between groups at need to state that in a cohort, 40% were men and 60% were
baseline in a randomized trial is testing a null hypothesis women[3_TD$IF], choose one or the other. Another common error is
that is known to be true (informally, that any observed to include a column of descriptive statistics for two groups
differences between groups are due to chance). separately and then combine the whole cohort. If, say, the
median age is 60 in group 1 and 62 in group 2, we do not
4. Reporting of study estimates need to be told that the median age in the cohort as a whole
is close to 61.
4.1. Use appropriate levels of precision
4.3. For descriptive statistics, median and quartiles are
Reporting a p value of 0.7345 suggests that there is an preferred over means and standard deviations (or standard
appreciable difference between p values of 0.7344 and errors); range should be avoided
0.7346. Reporting that 16.9% of 83 patients responded
entails a precision (to the nearest 0.1%) that is nearly The median and quartiles provide all sorts of useful
200 times greater than the width of the confidence interval information; for instance, 50% of patients had values above
(10–27%). Reporting in a clinical study that the mean calorie the median or between the quartiles. The range gives the
consumption was 2069.9 suggest that calorie consumption values of just two patients and so is generally uninformative
can be measured extremely precisely by a food question- of the data distribution.
naire. Some might argue that being overly precise is
irrelevant, because the extra numbers can always be 4.4. Report estimates for the main study questions
ignored. The counterargument is that investigators should
think very hard about every number they report, rather than A clinical study typically focuses on a limited number of
just carelessly cutting and pasting numbers from the scientific questions. Authors should generally provide an
statistical software printout. The specific guidelines for estimate for each of these questions. In a study comparing
precision are as follows: two groups, for instance, authors should give an estimate of
the difference between groups, and avoid giving only data
on each group separately or simply saying that the
1. Report p values to a single significant figure unless the p difference was or was not significant. In a study of a
value is close to [4_TD$IF]0.05 (say, 0.01 – 0.2), in which case, prognostic factor, authors should give an estimate of the
report two significant figures. Do not report “not strength of the prognostic factor, such as an odds ratio or a
significant” for p values of 0.05 or higher. Very low p hazard ratio, as well as reporting a p value testing the null
values can be reported as p < 0.001 or similar. A p value hypothesis of no association between the prognostic factor
can indeed be 1, although some investigators prefer to and outcome.
report this as >0.9. For instance, the following p values
are reported to appropriate precision: <0.001, 0.004, 4.5. Report confidence intervals for the main estimates of
0.045, 0.13, 0.3, 1. interest
2. Report percentages, rates, and probabilities to two
significant figures, for example, 75%, 3.4%, 0.13%. Authors should generally report a 95% CI around the
3. Do not report p values of 0, as any experimental result has estimates relating to the key research questions, but not
a nonzero probability. other estimates given in a paper. For instance, in a study
4. Do not give decimal places if a probability or proportion comparing two surgical techniques, the authors might
is 1 (eg, a p value of 1.00 or a percentage of 100.00%). The report adverse event rates of 10% and 15%; however, the key
decimal places suggest that it is possible to have, say, a p estimate in this case is the difference between groups, so
value of 1.05. There is a similar consideration for data this estimate, 5%, should be reported along with a 95% CI (eg,
that can take only integer values. It makes sense to state 1–9%). Confidence intervals should not be reported for the
that, for instance, the mean number of pregnancies was estimates within each group (eg, adverse event rate in
2.4, but not that 29% of women reported 1.0 pregnancy. group A of 10%, 95% CI 7–13%). Similarly, confidence
5. There is generally no need to report estimates to more intervals should not be given for statistics such as mean
than three significant figures. age or gender ratio.
6. Hazard and odds ratios are normally reported to two
decimal places, although this can be avoided for high 4.6. Do not treat categorical variables as continuous
odds ratios (eg, 18.2 rather than 18.17).
Variables such as Gleason grade groups are scored 1–5, but
4.2. Avoid redundant statistics in cohort descriptions it is not true that the difference between groups 3 and 4 is
half as great as the difference between groups 2 and
Authors should be selective about the descriptive statistics 4. Variables such as Gleason grade groups should be
reported, and ensure that each and every number provides reported as categories (eg, 40% grade group 1, 20% group
362 E U R O P E A N U RO L O GY 7 5 ( 2 019 ) 3 5 8 – 3 6 7

2, 20% group 3, 20% groups 4 and 5) rather than as a complication rates, for instance, an investigator could plot
continuous variable (eg, mean Gleason score of 2.4). age on the x axis against the risk of a complication on the y
Similarly, categorical variables such as Gleason should be axis and show a regression line, perhaps with a 95% CI.
entered into regression models not as a single variable (eg, a Nonlinear modeling is often useful because it avoids
hazard ratio of 1.5 per 1-point increase in Gleason grade assuming a linear relationship and allows the investigator
group) but as multiple categories (eg, a hazard ratio of to determine questions such as whether risk starts to
1.6 comparing Gleason grade group 2 with group 1 and a increase disproportionately beyond a given age.
hazard ratio of 3.9 comparing group 3 to group 1).
4.10. Do not ignore significant heterogeneity in meta-analyses
4.7. Avoid categorization of continuous variables unless there is
a convincing rationale Informally speaking, heterogeneity statistics test whether
variations between the results of different studies in a meta-
A common approach to a variable such as age is to define analysis are consistent with chance or whether such
patients as either old (aged 60 yr) or young (aged <60 yr) variation reflects, at least in part, true differences between
and then enter age into analyses as a categorical variable, studies. If heterogeneity is present, authors need to do more
reporting, for example, that “patients aged 60 and over had than merely report the p value and focus on the random-
twice the risk of an operative complication than patients effect estimate. Authors should investigate the sources of
aged less than 60”. In epidemiologic and marker studies, a heterogeneity and try to determine the factors that lead to
common approach is to divide a variable into quartiles and differences in study results, for example, by identifying
report a statistic such as a hazard ratio for each quartile common features of studies with similar findings or
compared with the lowest (“reference”) quartile. This is idiosyncratic aspects of studies with outlying results.
problematic because it assumes that all values of a variable
within a category are the same. For instance, it is likely not 4.11. For time-to-event variables, report the number of events
the case that a patient aged 65 yr has the same risk as a but not the proportion
patient aged 90 yr, but a very different risk from that of a
patient aged 64 yr. It is generally preferable to leave Take the case of a study that reported the following: “of
variables in a continuous form, reporting, for instance, 60 patients accrued, 10 (17%) died.” Although it is important
how risk changes with a 10-yr increase in age. Nonlinear to report the number of events, patients entered the study at
terms can also be used, to avoid the assumption that the different times and were followed for different periods;
association between age and risk follows a straight line. hence, the reported proportion of 17% is meaningless. The
standard statistical approach to time-to-event variables is
4.8. Do not use statistical methods to obtain cut-points for to calculate probabilities, such as the risk of death being 60%
clinical practice by 5 yr or the median survival—the time at which the
probability of survival first drops below 50%—being 52 mo.
Various statistical methods are available to dichotomize a
continuous variable. For instance, outcomes can be com- 4.12. For time-to-event analyses, report median follow-up for
pared on either side of several different cut-points and the patients without the event or the number followed without an
optimal cut-point chosen as the one associated with the event at a given follow-up time
smallest p value. Alternatively, investigators might choose a
cut-point that leads to the highest value of sensitivity It is often useful to describe how long a cohort has been
+ specificity, that is, the point closest to the top left-hand followed. To illustrate the appropriate methods of doing so,
corner of a receiver operating curve (ROC). Such methods take the case of a cohort of 1000 pediatric cancer patients
are inappropriate for determining clinical cut-points treated in 1970 and followed to 2010. If the cure rate was
because they do not consider clinical consequences. The only 40%, median follow-up for all patients might only be a
ROC approach, for instance, assumes that sensitivity and few years; however, the median follow-up for patients who
specificity are of equal value, whereas it is generally worse survived was 40 yr. This latter statistic gives a much better
to miss disease than to treat unnecessarily. The smallest p impression of how long the cohort had been followed. Now
value approach tests strength of evidence against the null assume that in 2009, a second cohort of 2000 patients was
hypothesis, which has little to do with the relative benefits added to the study. The median follow-up for survivors will
and harms of a treatment or further diagnostic workup. now be around a year, which is again misleading. An
alternative would be to report a statistic such as “312
4.9. The association between a continuous predictor and patients have been followed without an event for at least
outcome can be demonstrated graphically, particularly by using 35 years.”
nonlinear modeling
4.13. For time-to-event analyses, describe when follow-up
In high-school mathematics, we often thought about the starts and when and how patients are censored
relationship between y and x by plotting a line on a graph,
with a scatterplot added in some cases. This also holds true A common error is that investigators use a censoring date
for many scientific studies. In the case of a study of age and that leads to an overestimate of survival. For example, when
E U R O P E A N U R O L O GY 7 5 ( 2 019 ) 3 5 8 – 3 6 7 363

assessing the metastasis-free survival, a patient without a patients who recur before 6 mo are excluded) and the status
record of metastasis should be censored on the date of the of the variable is fixed at that time (eg, a patient who
last time the patient was known to be free of metastasis (eg, receives chemotherapy at 7 mo is defined as being in the no
negative bone scan, undetectable prostate-specific antigen adjuvant group). Alternatively, investigators can use a time-
[PSA]), and not at the date of last patient contact (which dependent variable approach. In brief, this “resets the clock”
may not have involved assessment of metastasis). For each time new information is available about a variable. This
overall survival, the date of last patient contact would be an would be the approach most typically used for the PSA
acceptable censoring date because the patient was indeed velocity and progression example.
known to be event free at that time. When assessing cause-
specific endpoints, special consideration should be given to 4.16. When presenting Kaplan-Meier figures, present the
the cause of death. The endpoints “disease-specific survival” number at risk and truncate follow-up when numbers are low
and “disease-free survival” have specific definitions, and
require careful attention to methods. With disease-specific Giving the number of risk is useful for helping to understand
survival, authors need to consider carefully how to handle when patients were censored. When presenting Kaplan-
death due to other causes. One approach is to censor Meier figures, a good rule of thumb is to truncate follow-up
patients at the time of death, but this can lead to a bias in when the number at risk in any group falls below 5 (or even
certain circumstances, such as when the predictor of 10) as the tail of a Kaplan-Meier distribution is very unstable.
interest is associated with other-cause death and the
probability of other-cause death is moderate or high. A 5. Multivariable models and diagnostic tests
competing risk analysis is appropriate in these situations.
With disease-free survival, both evidence of disease (eg, 5.1. Multivariable, propensity, and instrumental variable
disease recurrence) and death from any cause are counted analyses are not a magic wand
as events, and so censoring at the time of other-cause death
is inappropriate. If investigators are specifically interested Some investigators assume that multivariable adjustment
only in the former and wish to censor deaths from other “removes confounding,” “makes groups similar,” or “mimics
causes, they should define their endpoint as “freedom from a randomized trial.” There are two problems with such
progression.” claims. First, the value of a variable recorded in a data set is
often approximate and so may mask differences between
4.14. For time-to-event analyses, avoid reporting mean follow- groups. For instance, clinical stage might be used as a
up or survival time, or estimates of survival in those who had the covariate in a study comparing treatments for localized
event prostate cancer. However, stage T2c might constitute a
small nodule on each prostate lobe or, alternatively, most of
All three estimates are problematic in the context of the prostate consisting of a large, hard mass. The key point is
censored data. that if one group has more T2c disease than the other, it is
also likely that those with T2c disease in that group will fall
4.15. For time-to-event analyses, make sure that all predictors toward the more aggressive end of the spectrum. Multivar-
are known at time zero or consider alternative approaches such as iable adjustment has the effect of making the rates of T2c in
a landmark analysis or time-dependent covariates each group the same, but does not ensure that the type of
T2c is identical. Second, a model adjusts for only a small
In many cases, variables of interest vary over time. As a number of measured covariates, which does not exclude the
simple example, imagine that we were interested in possibility of important differences in unmeasured (or even
whether PSA velocity predicted time to progression in unmeasurable) covariates. A common assumption is that
prostate cancer patients on active surveillance. The problem propensity methods somehow provide better adjustment
is that PSA is measured at various time points after for confounding than traditional multivariable methods.
diagnosis. Unless they were being careful, investigators Except in certain rare circumstances, such as when the
might use time from diagnosis in a Kaplan-Meier or Cox number of covariates is large relative to the number of
regression, but use PSA velocity calculated on PSA values events, propensity methods give extremely similar results
measured at 1- and 2-yr follow-up. As another example, to multivariable regression. Similarly, instrumental vari-
investigators might determine whether response to che- ables analyses depend on the availability of a good
motherapy predicts cancer survival, but measure survival instrument, which is less common than is often assumed.
from the time of the first dose, before response is known. It In many cases, the instrument is not strongly associated
is obviously invalid to use information known only “after with the intervention, leading to a large increase in the 95%
the clock starts.” There are two main approaches to this CI or, in some cases, an underestimate of treatment effects.
problem. A “landmark analysis” is often used when the
variable of interest is generally known within a short and 5.2. Avoid stepwise selection
well-defined period of time, such as adjuvant therapy or
chemotherapy response. In brief, the investigators start the Investigators commonly choose which variables to include
clock at a fixed “landmark” (eg, 6 mo after surgery). Patients in a multivariable model by first determining which
are eligible only if they are still at risk at the landmark (eg, variables are statistically significant on univariable analysis;
364 E U R O P E A N U RO L O GY 7 5 ( 2 019 ) 3 5 8 – 3 6 7

alternatively, they may include all variables in a single not it is an apparently “stronger” predictor than PSA, will
model and then remove those that are not significant. This depend on whether it is entered in nanograms or picograms
type of data-dependent variable selection in regression per milliliter. Further, it is unclear how one should compare
models has several undesirable properties, increasing the model coefficients when both categorical and continuous
risk of overfit and making many statistics, such as the 95% variables are included. Finally, the prevalence of a categori-
CI, highly questionable. The use of stepwise selection should cal predictor also matters: a predictor with an odds ratio of
be restricted to a limited number of circumstances, such as 3.5 but a prevalence of 0.1% is less important than one with a
during the initial stages of developing a model, if there is prevalence of 50% and an odds ratio of 2.0.
poor knowledge of what variables might be predictive.
5.7. Discrimination is a property not of a multivariable model
5.3. Avoid reporting estimates such as odds or hazard ratios for but rather of the predictors and the data set
covariates when examining the effects of interventions
Although model building is generally seen as a process of
In a typical observational study, an investigator might fitting coefficients, discrimination is largely a property of
explore the effects of two different approaches to radical which predictors are available. For instance, we have excellent
prostatectomy on recurrence while adjusting for covariates models for prostate cancer outcome primarily because
such as stage, grade, and PSA. It is rarely worth reporting Gleason score is very strongly associated with malignant
estimates such as odds or hazard ratios for the covariates. potential. In addition, discrimination is highly dependent on
For instance, it is well known that a high Gleason score is how much a predictor varies in the data set. As an example, a
strongly associated with recurrence: reporting a hazard model to predict erectile dysfunction that includes age will
ratio of, say, 4.23 is not helpful and is a distraction from the have much higher discrimination for a population sample of
key finding—the hazard ratio between the two types of adult men than for a group of older men presenting at a
surgery. urology clinic because there is a greater variation in age in the
population sample. Authors need to consider these points
5.4. Rescale predictors to obtain interpretable estimates when drawing conclusions about the discrimination of
models. This is also why authors should be cautious about
Predictors sometimes have a moderate association with comparing the discrimination of different multivariable
outcome and can take a large range of values. This can lead models where these were assessed in different data sets.
to uninterpretable estimates. For instance, the odds ratio for
cancer per year of age might be given as 1.02 (95% CI 1.01, 5.8. Correction for overfit is strongly recommended for internal
1.02; p < 0.0001). It is not helpful to have the upper bound validation
of a confidence interval be equivalent to the central
estimate; a better alternative would be to report an odds In the same way that it is easy to predict last week's
ratio per 10 yr of age. This is simply achieved by creating a weather, a prediction model generally has very good
new variable equal to age divided by 10 to obtain an odds properties when evaluated on the same data set used to
ratio of 1.16 (95% CI 1.10, 1.22; p < 0.0001) per 10-yr create the model. This problem is generally described as
difference in age. overfit. Various methods are available to correct for overfit,
including cross validation and bootstrap resampling. Note
5.5. Avoid reporting both univariate and multivariable that such methods should include all steps of model
analyses unless there is a good reason building. For instance, if an investigator uses stepwise
methods to choose which predictors should go into the
Comparison of univariate and multivariable models can be model and then fits the coefficients, a typical cross-
of interest when trying to understand mechanisms. For validation approach would be to (1) split the data into
instance, if race is a predictor of outcome on univariate 10 groups, (2) use stepwise methods to select predictors
analysis, but not after adjustment for income and access to using the first nine groups, (3) fit coefficients using the first
care, one might conclude that poor outcome in African nine groups, (4) apply the model to the 10th group to obtain
Americans is explained by socioeconomic factors. However, predicted probabilities, and (5) repeat steps 2–4 until all
the routine reporting of estimates from both univariate and patients in the data set have a predicted probability derived
multivariable analysis is discouraged. from a model fitted to a data set that did not include that
patient's data. Statistics such as the area under the curve are
5.6. Avoid ranking predictors in terms of strength then calculated using the predicted probabilities directly.

It is tempting for authors to rank predictors in a model, 5.9. Calibration should be reported and interpreted correctly
claiming, for instance, that “the novel marker was the
strongest predictor of recurrence.” Most commonly, this Calibration is a critical component of a statistical model: the
type of claim is based on comparisons of odds or hazard main concern for any patient is whether the risk given by a
ratios. Such rankings are not meaningful since, among other model is close to his or her true risk. It is rarely worth
reasons, it depends on how variables are coded. For reporting calibration for a model created and tested on the
instance, the odds ratio for hK2, and hence whether or same data set, even if techniques such as cross validation
E U R O P E A N U R O L O GY 7 5 ( 2 019 ) 3 5 8 – 3 6 7 365

are used. This is because calibration is nearly always 6. Conclusions and interpretation
excellent on internal validation. Where a prespecified
model is tested on an independent data set, calibration 6.1. Draw a conclusion, do not just repeat the results
should be displayed graphically in a calibration plot. The
Hosmer-Lemeshow test addresses an inappropriate null Conclusion sections are often simply a restatement of the
hypothesis and should be avoided. Note also that calibration results. For instance, “a statistically significant relationship
depends on both the model coefficients and the data set was found between body mass index (BMI) and disease
being examined. A model cannot be inherently “well outcome” is not a conclusion. Authors instead need to state
calibrated.” All that can be said is that predicted and implications for research and/or clinical practice. For
observed risks are close in a specific data set, representative instance, a conclusion section might call for research to
of a given population. determine whether the association between BMI is causal
or make a recommendation for more aggressive treatment
5.10. Avoid reporting sensitivity and specificity for continuous of patients with a higher BMI.
predictors or a model
6.2. Avoid using words such as “may” or “might”
Investigators often report sensitivity and specificity at a
given cut-point for a continuous predictor (such as a PSA A conclusion that a novel treatment “may” be of benefit
value of 10 ng/ml), or report specificity at a given sensitivity would be untrue only if it had been proved that the
(such as 90%). Reporting sensitivity and specificity is not of treatment was ineffective. Indeed, that the treatment may
value because it is unclear how high sensitivity or specificity help would have been the rationale for the study in the first
would have to be in order to be high enough to justify place. Using words such as may in the conclusion is
clinical use. Similarly, it is very difficult to determine which equivalent to stating, “we know no more at the end of this
of two tests, one with higher sensitivity and the other with study than we knew at the beginning”—reason enough to
higher specificity, is preferable because clinical value reject a paper for publication.
depends on the prevalence of disease and the relative
harms of a false-positive result compared with a false- 6.3. A statistically significant p value does not imply clinical
negative result. In the case of reporting specificities at fixed significance
sensitivity, or vice versa, it is all but impossible to choose the
specific sensitivity rationally. For instance, a team of A small p value means that only the null hypothesis has
investigators may state that they want to know specificity been rejected. This may or may not have implications for
at 80% sensitivity, because they want to ensure that they clinical practice. For instance, that a marker is a statistically
catch 80% of cases. However, 80% might be too low if significant predictor of outcome does not imply that
prevalence is high or too high if prevalence is low. treatment decisions should be made on the basis of that
marker. Similarly, a statistically significant difference
5.11. Report the clinical consequences of using a test or a model between two treatments does not necessarily mean that
the former should be preferred to the latter. Authors need to
In place of statistical abstractions such as sensitivity and justify any clinical recommendations by carefully analyzing
specificity, or an ROC, authors are encouraged to choose the clinical implications of their findings.
illustrative cut-points and then report results in terms of
clinical consequences. As an example, consider a study in 6.4. Avoid pseudolimitations such as “small sample size” and
which a marker is measured in a group of patients “retrospective analysis”; consider instead sources of potential bias
undergoing biopsy. Authors could report that if a given and the mechanism for their effect on findings
level of the marker had been used to determine biopsy, then
a certain number of biopsies would have been conducted Authors commonly describe study limitations in a rather
and a certain number of cancers found and missed. superficial way, such as “small sample size and retrospective
analysis are limitations.” However, a small sample size may
5.12. Interpret decision curves with careful reference to be immaterial if the results of the study are clear. For instance,
threshold probabilities if a treatment or predictor is associated with a very large odds
ratio, a large sample size might be unnecessary. Similarly, a
It is insufficient merely to report that, for instance, “the retrospective design might be entirely appropriate, as in the
marker model had highest net benefit for threshold case of a marker study with very long-term follow-up, and
probabilities of 35–65%.” Authors need to consider whether have no discernible disadvantages compared with a pro-
those threshold probabilities are rational. If the study spective study. Discussion of limitations should include both
reporting benefit between 35% and 65% concerned detec- the likelihood and the effect size of possible bias.
tion of high-grade prostate cancer, few, if any, urologists
would demand that a patient have at least a one-in-three 6.5. Consider the impact of missing data and patient selection
chance of high-grade disease before recommending biopsy.
The authors would therefore need to conclude that the It is rare that complete data are obtained from all patients in a
model was not of benefit. study. A typical paper might report, for instance, that of
366 E U R O P E A N U RO L O GY 7 5 ( 2 019 ) 3 5 8 – 3 6 7

200 patients, eight had data missing on important baseline 7. Use and interpretation of p values
variables and 34 did not complete the end-of-study
questionnaire, leading to a final data set of 158. Similarly, It is apparent from even the most cursory reading of the
many studies include a relatively narrow subset of patients, medical literature that p values are widely misused and
such as 50 patients referred for imaging before surgery out of misunderstood. One of the most common errors is
the 500 treated surgically during that time frame. In both accepting the null hypothesis, for instance, concluding
cases, it is worth considering analyses to investigate whether from a p value of 0.07 that a drug is ineffective or that two
patients with missing data or who were not selected for surgical techniques are equivalent. This particular error is
treatment were different in some way from those who were described in detail in guideline 3.1. The more general
included in the analyses. Although statistical adjustment for problem, which we address here, is that p values are often
missing data is complex and warranted only in a limited set of given excessive weight in the interpretation of a study.
circumstances, basic analyses to understand the character- Indeed, studies are often classed by investigators into
istics of patients with missing data are relatively straightfor- “positive” or “negative” based on statistical significance.
ward and are often helpful. Gross misuse of p values has led some to advocate banning
the use of p values completely [4].
6.6. Consider the possibility and impact of ascertainment bias We follow the American Statistical Association state-
ment on p values and encourage all researchers to read
Ascertainment bias occurs when an outcome depends on a either the full statement [5] or the summary [6]. In
test, and the propensity for a patient to be tested is associated particular, we emphasize that[5_TD$IF] a p value is just one statistic
with the predictor. PSA screening provides a classic example: that helps interpret a study; it does not determine our
prostate cancer is found by biopsy, but the main reason why interpretations. Drawing conclusions for research or clinical
men are biopsied is an elevated PSA. A study in a population practice from a clinical research study requires evaluation of
subject to PSA screening will, therefore, overestimate the the strengths and weaknesses of study methodology, results
association between PSA and prostate cancer. Ascertainment of other pertinent data published in the literature, biological
bias can also be caused by the timing of assessments. For plausibility, and effect size. Sound and nuanced scientific
instance, the frequency of biopsy in prostate cancer active judgment cannot be replaced by just checking whether one
surveillance will depend on prior biopsy results and PSA of the many statistics in a paper is or is not <0.05.
level, and this induces an association between those
predictors and time to progression.
8. Concluding remarks

6.7. Do not confuse outcome with response among subgroups of


These guidelines are not intended to cover all medical
patients undergoing the same treatment: patients with poorer
statistics but rather the statistical approaches most
outcomes may still be good candidates for that treatment
commonly used in clinical research papers in urology. It
is quite possible for a paper to follow all the guidelines and
Investigators often compare outcomes in different sub-
yet be statistically flawed, or to break numerous guidelines
groups of patients, all receiving the same treatment. A
and still be statistically sound. On balance, however, the
common error is to conclude that patients with poor
analysis, reporting, and interpretation of clinical urologic
outcome are not good candidates for that treatment and
research will be improved by adherence to these guidelines.
should receive an alternative approach. This conclusion
confuses differences between patients for differences
Author contributions: Andrew J. Vickers had full access to all the data in
between treatments. As a simple example, patients with
the study and takes responsibility for the integrity of the data and the
large tumors are more likely to recur after surgery than accuracy of the data analysis.
patients with small tumors, but that cannot be taken to
suggest that resection is not indicated for patients with Study concept and design: Vickers, Assel, Sjoberg.
tumors greater than a certain size. Indeed, surgery is Acquisition of data: None.
Analysis and interpretation of data: None.
generally more strongly indicated for patients with
Drafting of the manuscript: Vickers, Assel, Sjoberg, Kattan.
aggressive (but localized) disease, and such patients are
Critical revision of the manuscript for important intellectual content: All
unlikely to do well on surveillance.
authors.
Statistical analysis: None.
6.8. Be cautious about causal attribution: correlation does not Obtaining funding: None.
imply causation Administrative, technical, or material support: None.
Supervision: None.
It is well known that “correlation does not imply causation,” but Other: None.
authors often slip into this error in making conclusions. The
Financial disclosures: Andrew J. Vickers certifies that all conflicts of
Introduction and Methods sections might insist that the
interest, including specific financial interests and relationships and
purpose of the study is merely to determine whether there is an
affiliations relevant to the subject matter or materials discussed in the
association between, say, treatment frequency and treatment manuscript (eg, employment/affiliation, grants or funding, consultan-
response, but the conclusions may imply that, for instance, cies, honoraria, stock ownership or options, expert testimony, royalties,
more frequent treatment would improve response rates. or patents filed, received, or pending), are the following: None.
E U R O P E A N U R O L O GY 7 5 ( 2 019 ) 3 5 8 – 3 6 7 367

Funding/Support and role of the sponsor: This work was supported in [2] Lang TA, Altman DG. Basic statistical reporting for articles published
part by the Sidney Kimmel Center for Prostate and Urologic Cancers, P50- in biomedical journals: the “Statistical Analyses and Methods in the
CA92629 SPORE grant from the National Cancer Institute to Dr. H. Scher, Published Literature” or the SAMPL guidelines. Int J Nurs Stud
and the P30-CA008748 NIH/NCI Cancer Center Support Grant to 2015;52:5–9.
Memorial Sloan-Kettering Cancer Center. [3] Vickers AJ, Sjoberg DD. Guidelines for reporting of statistics in
European Urology. Eur Urol 2015;67:181–7.
[4] Woolston C. Psychology journal bans P-values. Nature 2015;519:9.
References
[5] Wasserstein RL, Lazar NA. The ASA's statement on p-values: context,
[1] Scales Jr CD, Norris RD, Peterson BL, Preminger GM, Dahm P. Clinical process, and purpose. Am Stat 2016;70:129–33.
research and statistical methods in the urology literature. J Urol [6] American Statistical Association. https://www.amstat.org/asa/files/
2005;174:1374–9. pdfs/P-ValueStatement.pdf.

You might also like