Concordance
Concordance
c−d
τa = (1)
c + d + tx + ty + txy
c−d
τb =p (2)
(c + d + tx )(c + d + ty )
c−d
γ = (3)
c+d
c−d
D = (4)
c + d + tx
c + tx /2
C = (D + 1)/2 = (5)
c + d + tx
Kendall's tau-a (1) is the most conservative, ties shrink the value towards zero.
1
Somers' D (4) treats ties in y as incomparable; pairs that are tied in x (but not y) score
as 1/2, as we can see from equation (5) which is the concordance statistic.
Kendall's tau-b (2) can be viewed as a version of Somers' D that is symmetric in x and y.
The rst 4 statistics range from -1 to 1, similar to the correlation coecient r. The concor-
dance (5) ranges from 01, which matches the scale for a probability.
Why is C dened using Somers' D rather than one of the other three?
If y is a 0/1 variable, then C = AUROC, the area under the receiver operating curve,
which is well established for binary outcomes. (Proving this simple theorem is harder than
it looks, but the result is well known.)
For survival data, this choice will agree with Harrell's C. More importantly, as we will see
below, it has strong connections to standard tests for equality of survival curves.
Direct
n= 11
Concordance= 0.7818 se= 0.1255
concordant discordant tied.x tied.y tied.xy
43 12 0 0 0
2
Logistic regression
n= 150
Concordance= 0.8258 se= 0.03279
concordant discordant tied.x tied.y tied.xy
4129 871 0 6174 1
Linear regression
n= 11
Concordance= 0.7818 se= 0.1255
concordant discordant tied.x tied.y tied.xy
43 12 0 0 0
> sqrt(summary(fit2)$r.squared) # R
[1] 0.891425
Parametric survival
n= 137
Concordance= 0.7122 se= 0.02232
concordant discordant tied.x tied.y tied.xy
6263 2527 14 39 0
3
Cox regression
n= 137
concordance se
fit4 0.7119 0.0224
fit5 0.7384 0.0210
fit6 0.7359 0.0212
As shown in the last example, the concordance for multiple ts can be obtained from a
single call. The variance-covariance matrix for all three concordance values is available using
vcov(ctest); this is used in Section 3.1 to formally test the equality of two concordance values.
The above also shows that addition of another variable to a tted model can decrease the
concordance. The larger model will have higher correlation between the linear predictor Xβ
and the response y , by denition, but this does not guarantee a greater association between
rank(Xβ) and rank(y).
n n
1 XX
c−d= sign(yi − yj )sign(xi − xj ) (6)
2 i=1 j=1
X1 X
= sign(xi − xj ) (7)
i=n yj >yi
The rst equation is the simple denition of concordance as a sum over all n2 possible pairs,
where sign is the R sign function. Equation (7) makes the obvious simplication of counting
each pair only once, by taking advantage of the fact that y is sorted. The key coding insight is
4
18
8 24
2 12 21 27
1 6 9 14 19 23
5
to store the x values as part of a balanced binary tree; an example of such a tree is shown in
Figure 1. The basic algorithm is to:
1. Create a balanced binary tree for all n xi values. This can be done in O(n log2 (n)) steps.
The nal tree will have a node for each unique x value. Each node contains the value,
along with counts for the number of observations at that value, for left hand children, and
for right hand children. Initialize all the counts to 0.
(b) Add this observation to the tree. Each addition will update the count for its node,
then walk up the tree updating child counts of the parent, grandparent, etc.
If there are tied y values, do all the counts for a set of ties rst, and then add their x values to
the tree.
6
Call:
concordance.coxph(object = fit4)
n= 137
Concordance= 0.7119 se= 0.02235
concordant discordant tied.x tied.y tied.xy
6261 2529 14 39 0
> # Concordance using predictions from a Cox model
> concordance(Surv(time, status) ~ predict(fit4), data = veteran, reverse = TRUE)
Call:
concordance.formula(object = Surv(time, status) ~ predict(fit4),
data = veteran, reverse = TRUE)
n= 137
Concordance= 0.7119 se= 0.02235
concordant discordant tied.x tied.y tied.xy
6261 2529 14 39 0
Stratied models
Stratied models present a further variation: if observations i and j are in dierent strata, the
survival curves for those strata might cross; S(t; xi ) and S(t; xj ) no longer have a simple ordering.
A solution is to use a stratied concordance, which compares all pairs within each stratum, and
then adds up the result. In the example below there is a separate count for each stratum, the
nal concordance is based on the column sums. The same issue, and solution, applies to stratied
survreg models. (The fact that strata names are not retained as labels for the counts matrix is
a deciency in the routine.)
n= 137
Concordance= 0.6986 se= 0.02679
concordant discordant tied.x tied.y tied.xy
squamous 357 161 0 1 0
smallcell 728 361 3 9 0
adeno 275 65 1 1 0
large 240 102 0 0 0
> table(veteran$celltype)
squamous smallcell adeno large
35 48 27 27
7
2.1 Time-weighted concordance
Look again at equation (7), rewriting it for survival with ti as the response in order to more
closely match standard notation for survival. Watson and Therneau [10] show that this can be
further rewritten as
n
X X
c−d= δi sign(xi − xj )
i=1 tj >ti
n
X X
= δi sign(xi − xj ) (8)
i=1 tj ≥ti
X
=2 δi n(ti ) [ri (ti ) − r] (9)
i
Peto and Peto [7] point out that n(t) ≈ n(0)S(t−)G(t−), where S is the survival distribu-
tion and G the censoring distribution. They argue that S(t−) would be a better weight
since G may have features that are irrelevant to the question being tested. For a particu-
lar dataset, Prentice [8] later showed that these concerns were indeed justied, and most
software now uses the Peto-Wilcoxon variant.
Schemper et al [9] argue for a weight of S(t)/G(t) in the Cox model. When proportional
hazards does not hold the coecient from the Cox model is an average hazard ratio,
and they show that using S/G leads to a value that remains interpretable in terms of an
underlying population model. The same argument would also apply to the concordance,
since our goal is an assumption free assessment of association.
8
Uno et al [11] recommend the use of n/G2 as a weight based on a consistency argument.
If we assume that the concordance value that would be obtained after full follow-up of all
subjects (no censoring) is the right one, and proportional hazards does not hold, then
the standard concordance will not consistently estimate this target quantity when there is
censoring.
In practice, weights need to be based on left continuous versions of the survival curves S(t−)
and G(t−), and extra care needs to be exercised in computation of G. Consider the aml dataset
as an example; the rst few lines of the relevant survival curve are shown below.
9
> cord1 <- concordance(colonfit, timewt="n", ranks=TRUE)
> cord2 <- concordance(colonfit, timewt="S", ranks=TRUE)
> cord3 <- concordance(colonfit, timewt="S/G", ranks=TRUE)
> cord4 <- concordance(colonfit, timewt="n/G2", ranks=TRUE)
> temp <- c("n(t)"= coef(cord1), S=coef(cord2), "S/G"= coef(cord3),
"n/G2"= coef(cord4))
> round(temp,5) # 4 different concordance estimates
n(t) S S/G n/G2
0.65559 0.65437 0.65357 0.65357
> # Plot the weights over time using the first 3 approaches
> matplot(cord1$ranks$time/365.25, cbind(cord1$ranks$timewt,
cord2$ranks$timewt,
cord3$ranks$timewt),
type= "l", lwd=2, col=c(1,2,4),
xlab="Years since enrollment", ylab="Weight")
> legend(1, 3000, c("n(t)", "nS(t-)", "nS(t-)/G(t-)"), lwd=2,
col=c(1,2,4), lty=1:3, bty="n")
> # Note that n/G2 and S/G are identical
> all.equal(cord3$ranks$timewt,cord4$ranks$timewt)
[1] TRUE
6000
5000
4000
Weight
3000
n(t)
nS(t−)
2000
nS(t−)/G(t−)
1000
0
0 2 4 6 8
10
0.8
0.8
0.8
0.4
0.4
0.4
0.0
0.0
0.0
0 2 4 6 8 0 2 4 6 8 10 14 0 10 20 30 40
0.8
0.8
0.4
0.4
0.4
0.0
0.0
0.0
0.0 0.5 1.0 1.5 2.0 2.5 0 5 10 20 30 0 5 10 15 20
0.13
Beta(t) for age
0.8
0.8
0.10
0.4
0.4
0.07
0.0
0.0
0 2 4 6 8 10 0 5 10 15 20 0 5 10 15
Figure 2: Survival (black) and censoring (red) curves for 8 datasets found in the survival package.
The nal panel shows a proportional hazards evaluation for the age variable, in a t of age +
male to the NAFLD data.
First, sucient censoring that the two weights dier for a reasonable fraction of the data,
that is, G(t) is low and S(t) has not attened (deaths are still occurring).
Second, per the arguments in Schemper and in Uno, is the potential presence of non-
proportional hazards in the t.
Figure 2 shows survival and censoring curves for 8 dierent datasets found in the survival
package. Based on these, the dataset with the greatest potential for an S/G dierence is the
NAFLD data; the dataset also has some early non-proportionality for age. The code below shows
the calculation of the S/G dierence. Surprisingly, the four weightings still yield very similar
concordance values; the Harrell (n) and Uno (n/G2) weighting dier in only the second decimal
place.
11
> nfit <- coxph(Surv(futime/365.25, status) ~ age + male, data = nafld1)
> ncord1 <- concordance(nfit, timewt = "n")
> ncord2 <- concordance(nfit, timewt = "S")
> ncord3 <- concordance(nfit, timewt = "S/G")
> ncord4 <- concordance(nfit, timewt = "n/G2")
> temp <- c(n = coef(ncord1), S = coef(ncord2),
"S/G" = coef(ncord3), "n/G2" = coef(ncord4))
> round(temp,6)
n S S/G n/G2
0.823254 0.821457 0.805438 0.805438
The concordance function provides a ranks=TRUE argument which can be used to further
exploration of the weights. I f set, the output will include a dataframe that contains one row for
each event, containing the time point, its relative rank in the risk set (ri − r ), the case weight for
the observation and the time weight at that time point. The relative ranks are comparable to
Schoenfeld residuals, and their weighted sum will equal c − d. We can plot them over time and
apply a smooth. Figure 3, for the veteran dataset, shows a precipitous drop to 0. The Veteran's
cancer data is perhaps an extreme of this pattern. Due to the very rapid progression of disease,
baseline measurements soon lose their meaning.
12
1.0
0.5
Rank residual
0.0
−0.5
Years
Figure 3: Schoenfeld residuals for the scaled ranks, from a t to the veteran dataset.
13
We sometimes receive push-back on this, with the argument that one should use all the data.
We disagree, and think that the target of the validation is critical. Predictions become less
accurate the further out we reach in time; this is true for everything from weather forecasts to
the stock market, and survival models are not immune. Reaching too far into the future may
return an overly pessimistic value of C.
Another reason for using an upper limit is that 1/G can become unstable as the sample size
becomes small (large jumps in the KM), or unreasonably large as G approaches 0. Most authors
suggest an upper limit for this purely technical reason.
Argument could also be made for a lower limit, though this would be uncommon for censored
data. Many laboratory values, for instance, treat all results less than some threshold as identical.
However, the ability to implement a lower limit correctly is constrained by censoring. Say for
instance that there were values of 5, 8+, and 9, and a lower limit of 10 were chosen. The approach
used for non-censored data is to treat 5 and 9 as tied values, but this logic does not correctly
extend to the censored value. This issue and possible solutions will be discussed more fully in
the external validation vignette.
2.3 Synthetic C
As another method of addressing censoring, Göen and Heller [3] show that if the statistical model
is correct, and if proportional hazards holds, then for any pair of covariate vectors
1
P (yi > yj ) =
1 + eηj −ηi
They then order the η values from a tted model, and take an average over all n(n−1)/2 ordered
pairs. The authors argue that this is an estimate that is independent of censoring, and therefore
preferable to Harrell's C. (The estimate can be obtained by using the royston function.)
The biggest problem with this approach is that it gives an estimate of concordance under an
assumption that the model is exactly correct. Our goal, rather, is to assess how well the model
performs, for our needs, knowing that it will be imperfect. The G and H formula answers a
14
question that we did not ask, over a time range (0, ∞) which is not of interest. The calculation
is also O(n2 ) so will be slow for large sample sizes.
1. An important issue that has not been sorted out is how to extend 1/G weighting arguments
to datasets that are subject to delayed entry, e.g., when using age as the time scale instead
of time since enrollment. There is, in this case, no natural estimate available for G. It is also
not possible for the coxph routine to reliably tell the dierence between such left truncation
and simple time-dependent covariates or strata. The default action of the routine is to use
the safe choice of n(t).
2. Consider setting a time (y ) restriction using the ymax option, based on careful thought
about the proper range of interest. This often has a larger practical eect than the choice
of time weight.
3. Safety. If using the usual Gehan-Wilcoxon weights of n(t), the Peto-Wilcoxon variant S(t)
would appear advantageous, particularly if there is dierential censoring for some subjects.
4. Equality vs. eciency. On one hand we would like to treat each data pair equally, but in
our quest for ever sharper p-values we want to be ecient. The rst argues for n(t) as the
weight and the second for using equal weights, since the variances of each ranking term
are nearly identical. This is exactly the argument between the Gehan-Wilcoxon and the
log-rank tests.
Our current opinion is that the point of the concordance is to evaluate the model in a more
non-parametric way, so a log-rank type of focus on ideal p-values is misplaced. This suggests
using either S or S/G as weights. Both give more prominence to the later time points as
compared to the default n(t) choice, but if time limits have been thought through carefully the
dierence between these three will almost always be ignorable.
We most denitely disagree with Uno's unstated assumption that the C statistic one would
obtain with innite follow-up and no censoring is the proper target of estimation, and the
ordinary concordance is therefore biased. That target will never be attainable, and we would
argue that it is largely irrelevant if it was. Proportional hazards never is true over the long term,
simply because it is almost impossible to predict events that are a decade or more away and
thus the rank residuals shown above will eventually tend to 0. The starting point should always
be to think through exactly what one wants to estimate. As stated by Yogi Berra If you don't
know where you are going, you'll end up someplace else."
15
3 Variance
The variance of the statistic is estimated in two ways. The rst is to use the variance of the
equivalent Cox model score statistic. As pointed out by Watson, this estimate is both correct and
ecient under H0 : C = .5, and so it forms a valid test of H0 . However, when the concordance is
over .7 or so, this estimator systematically overestimates the true variance. An alternative that
remains unbiased is the innitesimal jackknife (IJ) variance
n
X
V = wi Ui2
i=1
∂C
Ui =
∂wi
The concordance routine calculates an inuence matrix U with one row per subject and columns
that contain derivatives for the 5 individual counts: concordant, discordant, tied on x, tied on y,
and tied on xy pairs. From this it is straightforward to derive the inuence of each subject on the
concordance, or on any other of the other possible association measures such as τ -a mentioned
earlier. The IJ variance is printed by default but the PH variance is also returned; an earlier
survConcordance function only computed the PH variance.
The condordance function does not compute Kendall's τ -a or τ -b, nor Goodman's gamma.
However, since all of the necessary components for those values are returned, along with IJ
inuence for each, it can be used as the computational engine for those measures and their
variance, should someone wish to do so.
The variance computation accounts for the inuence of each subject on both the numberator
and denominator of C ; this agrees with parallel development used by Newson [6] and implemented
in STATA. An alternate approach is to treat the total number of comparable pairs is an ancillary
statistic, thus var(C) = var(c − d)/(4m) where m is the number of comparable pairs (n(n − 1)/2
for uncensored data). Arguments about what aspects of a dataset can or should be treated as
ancillary are as old as statistics, e.g., treating the margins of a 2x2 table as ancillary leads to
Fisher's exact test. In this case we anticipate that the dierences will be quite small, but have
done no formal exploration.
16
n= 137
concordance se
fit4 0.7119 0.0224
fit5 0.7384 0.0210
fit6 0.7359 0.0212
17
sizes. The tree based computations used in the concordance function might well address the
speed issue, but have not been implemented.
Here we pursue another avenue, which is to consider a transformation based condence
interval, in much the same way as is done for condence intervals of a survival curve. That is,
we use
g −1 [g(C) ± zσ(g(C))]
for some transformation function g. For survival curves, the g functions log(p), log(p/(1 − p),
log(− log(1 − p)) and arcsin(p) have all been found to be superior to the simple interval.
For the concordance, consider the Fisher z-transform, widely used for the correlation coe-
cient r
1 1+r
z= log (10)
2 1−r
Since Somers' D andr are targeted at similar concepts, we might hazard that a similar transfor-
mation of Somers' D, which also ranges from -1 to 1, would also be close to equivariant. Since
D = 2C − 1 we have
1 1 + (2C − 1)
zc = log
2 1 − (2C − 1)
1 C
= log
2 1−C
which we recognize as the inverse of the logistic link used glm models.
We can get the standard error of zc by retrieving the individual dfbeta values and performing
a transformation. The dfbeta value is dened as di = C − C−i where the latter is the C statistic
omitting the ith observation.
z <- qnorm((1-p)/2)
old.ci <- temp$concordance + c(z, -z)*old.sd
new.ci <- logistic(ilogist(temp$concordance) + c(z, -z)* new.sd)
rbind(old = old.ci, new= new.ci)
}
> round(zci(colonfit), 4)
[,1] [,2]
old 0.6302 0.6810
new 0.6298 0.6805
18
The two intervals hardly dier, which is what we would expect for a value far from 1. As a
second example, create a small dataset with a concordance that is close to 1. As shown below, the
z-transform shifts the CI towards zero, as it should, but also avoids the out-of-bounds endpoint.
> set.seed(1953)
> ytest <- matrix(rexp(20), ncol=2) %*% chol(matrix(c(1, .98, .98, 1), 2))
> cor(ytest)
[,1] [,2]
[1,] 1.0000000 0.9422072
[2,] 0.9422072 1.0000000
> lfit <- lm(ytest[,1] ~ ytest[,2])
> zci(lfit)
[,1] [,2]
old 0.8419721 1.0246946
new 0.7253801 0.9867027
4 Details
This section documents a few details - most readers can skip it.
The usual convention for survival data is to assume that censored values come after deaths,
even if they are recorded on the same day. This corresponds to the common case that a subject
who is censored on day 200, say, was actually seen on that day. That is, their survival is strictly
greater than 200. As a consequence, censoring weights G actually use G(t−) in the code: if 10
subjects are censored at day 100, and these are the rst censorings in the study, then an event
on day 100 should not be given a larger weight. (Both the Uno and Schemper papers ignore this
detail.)
When using weights of S(t) the program actually uses a weight of nS(t−) where n is the
number of observations in the dataset. The reason is that for a stratied model the weighted
number of concordant, discordant and tied pairs is calculated separately for each stratum, and
then added together. If one stratum were much smaller or larger than the others we want to
preserved this fact in the sum.
References
[1] D. G. Altman and P. Royston. What do we mean by validating a prognostic model? Stat.
in Medicine, 19:45373, 2000.
[2] F. J. Anscombe. One estimating binomial response relations. Biometrika, 43:461464, 1956.
[3] M. Göen and G. Heller. Concordance probability and discriminatory power in proportional
hazards regression. Biometrika, 92:965970, 2005.
[4] E. L. Korn and R. Simon. Measures of explained variation for survival data. Stat. in
Medicine, 9:487503, 1990.
19
[5] R. G. Newcombe. Condence intervals for an eect size measure based on the mannwhitney
statistic. part 2: asymptotic methods and evaluation. Stat. in Medicine, pages 55973, 2006.
[6] R Newson. Condence intervals for rank statistics: Somers' D and extensions. Stata Journal,
6(3):309334, 2006.
[7] R. Peto and J. Peto. Asymptotically ecient rank invariant test procedures (with discus-
sion). J. Royal Stat. Soc. A, 135(2):185206, 1972.
[8] Ross L Prentice and P Marek. A qualitative discrepancy between censored data rank tests.
Biometrics, 35(4):861867, 1979.
[9] M. Schemper, S. Wakounig, and G. Heinze. The estimation of average hazard ratios by
weighted Cox regression. Stat. in Medicine, 28(19):24732489, 2009.
[10] T. M. Therneau and D. A. Watson. The concordance statistic and the Cox model. Technical
Report 85, Department of Health Science Research, Mayo Clinic, 2015.
[11] H. Uno, T. Cai, M. J. Pencina, R. B D'Agnostino, and L. J. Wei. On the C-statistics for
evaluating overall adequacy of risk prediction procedures with censored survival data. Stat.
in Medicine, 30(10):11051117, 2011.
20