This vignette is concerned with the question of whether a particular model can be successfully applied outside of the particular context (data set) on which it was developed. The most important question in validation will be to decide what “success” means in the above sentence; this will normally depend on the details of the target study.
Technically, the computation themselves have three important facets.
The time range of interest. Survival models give predictions over the entire time range of the development data set, but we are often interested in performance over a more limited time span \((0, \tau)\), or even an interval \((\tau_1, \tau_2)\).
The choice of measures
Relative accuracy (discrimination), e.g., concordance and \(R^2\)
Absolute accuracy (calibration), for which there three natural measures for any given state \(j\)
\(E(N_j(t))\), the expected number of visits to state, up to time \(t\)
\(p_j(t)\): the probability of being in state \(j\) at time \(t\)
the expected total time spent in state \(j\), up to time \(t\), alternately labeled as the sojourn time, restricted mean time in state (RMTS) or restricted mean survival time (RMST)
How to deal with censoring
Martingale approach
Standard survival models
Inverse probability of censoring weights
Pseudovalues
This overall framework accommodates a large fraction of the methods that have been proposed. It also gives a basis for applying these ideas to both single state and multistate models, and dealing with multiple time scales. The methods are illustrated with several fit/validation data set pairs.
2 What does it mean to validate a model?
If you don’t know where you are going, you might end up someplace else. – Yogi Berra
We will separate 3 meanings of “validation”. This document is concerned with the third.
Software validation, i.e., does a given computer routine produce the correct result? This is discussed in a separate validation vignette.
Internal validation: checks that further examine the fit of a model, using the original data set. These form an extension of the familiar trio of functional form, proportional hazards, and undue leverage which have been discussed elsewhere.
External validation. Application of a model to new data set, one which was not used in the model’s construction.
External validation tries to answer the question of whether a particular model is applicable outside of the context where it was developed. The most critical step, before computing any metrics, is to carefully think through what we mean by “applicable”. In what sense does a given model provide, or not provide, a useful prediction? Any careful answer to this will be problem specific. One consequence is that this document will contain ideas on how to think about the problem and methods, but no final checklist with a “best” solution.
This key consideration is surprisingly absent from most papers on the topic of survival model validation; two notable exceptions which greatly influenced our thinking are Korn and Simon (1990) and Altman and Royston (2000). As an example from the first, suppose that we were using \(t - \hat t\) as the metric of prediction error, the difference between the observed and predicted time to death. Should an (observed, predicted) pair of (.5 year, 1 year) be treated as the same size error as (10.5 years, 11 years)?
In a study of acute disease we might consider the first pair to be a much more consequential error than the latter. For example, in a cancer in which 5 year survival is considered to be a cure, one might consider any pair \((a, b)\) with both \(a\) and \(b\) greater than 6 to be no error at all. As another example, suppose one has a model \(M\) from a large clinical trial, for a disease with a high death rate. A new proposed use for \(M\) is to identify subjects with expected survival of \(<6\) months for referral to supportive care.
For this purpose, the performance of \(M\) with respect to separating 1, 2, and 3+ year survivors is immaterial; validation should focus on the metric of interest. Another use of the same model might be to identify a subset with expected survival of more than 1 year for enrollment in a trial of long term adjuvant therapy, e.g., a therapy that is not expected to have an influence on the earliest recurrences. We might further want to stratify treatment assignment by the expected length of disease free survival. This leads to yet another metric for success.
In general, the chosen validation metric needs to be fit for purpose. One of the most common outcomes of such a consideration, in our experience, will be to restrict the upper range of the assessment to an upper threshold \(\tau\); this could be 3 months or 10 years depending on the case at hand, and will be a consideration in each of our examples. A companion issue is that the validation data set might have very little information beyond some point; if only a handful of the validation subjects have more than 4 years of follow-up, for instance, then an assessment of predictive ability at 10 years will be a fool’s errand, whether or not that were a target of interest.
In general validation should not be a yes/no type of exercise, but rather an assessment of which aspects of the prediction are most reliable and which are may be less so. This will particularly be true in multi-state models, where it may well be true that one transition is poorly predicted while the others are successful. In this case there will be a collection of validation results, which can help guide users with respect to their particular needs. Now, because published validation data sets are very rare, successful validation of a model for one particular case is often taken as evidence that it ``will work in other contexts too’’. Meaningful external validation in this broader sense requires that the validation data have important difference in time, place, range of subjects or other context as compared to the model development setting.
A last note is to recognize that validation is asymmetrical. Once a prediction model has been set, those predictions are, for our purpose, fixed. One would hope that the original creator of the prediction model checked its statistical validity, i.e., that proportional hazards holds, functional forms are reasonably correct, etc.; as that would give said model a much better chance of external success.
Whether those authors did so or not is, however, immaterial to our assessment of how well a given prediction model works for for the external data set(s) at hand.
3 Survival Data
Much of the thinking and machinery for model validation has been developed in the context of binomial models. Survival data is more complex, however, in three key ways.
There are at least 3 possible assessments
The expected number of visits to state \(j\) by time \(t\), \(E(N_j(t))\)
The probability of being in state \(j\) at time \(t\), \(p_j(t)\)
The expected total sojourn time in state \(j\), \(s_j(t)\)
Each of these can be assessed at multiple target times \(\tau\).
The validation data set is subject to censoring.
For a simple alive/dead survival 1 and 2 above are formally the same:
the expected number of visits to the death state by time \(t\) = the probability of death by time \(t\), and the same is true for any absorbing state in a multi-state model. Our computational approach for the two targets is however quite different. Measure 2 has been the most used, and 1 the least. Which of the measures is most appropriate, and even more so which time points \(\tau\) might be a rational focus, will depend on the application, i.e., on the situation specific definition of successful prediction.
A further choice that needs to be made is whether to use a summary based on relative prediction or absolute prediction, commonly denoted as discrimnation and calibration. Discrimination measures whether the outcomes for two subjects are in the same relative order as the predicted values for that pair of subjects.
In some cases this is sufficient, for instance if the prediction will be used to stratify treatment assignment of subjects in a new trial, then only the relative ordering of their risk is required. Individual patient counseling, on the other hand, will require absolute predictions for that patient. Likewise, predicted sample size for a trial with a survival endpoint depends on the eventual number of events that will occur, a task that requires absolute predictions.
The validation data will almost certainly be censored, and how to deal with this is a central technical issue, as it always is for time to event models. There are essentially four approaches.
For assessment type 1, the total number of events, we can compare the observed events for each subject to the expected number for that subject, over each subjects’ specific follow-up time. This neatly sidesteps the censoring problem. The approach is closely related to standardized mortality ratios (SMR), a measure with a long history in epidemiology.
Apply standard censored data methods to the validation data, and compare the results of the resulting model fit (Kaplan-Meier,Cox, etc.) to the target model’s predictions. I will sometimes call this the ``yhat vs yhat’’ approach.
Create uncensored data specific to a chosen assessment time \(\tau\), then use standard methods for uncensored data. Two approaches are
Use the redistribute-to-the-right (RTTR) or inverse probability of censoring (IPC) algorithm to reassign the case weights of those observations censored prior to \(\tau\), resulting in a weighted subset who status at \(\tau\) is known.
Replace the response with pseudovalues at time \(\tau\).
Ignore censoring. A disastrous approach, but one that too often appears in practice. A common example would be assessment of 5 year survival by comparing those dead before 5 to those still alive at 5, while simply ignoring all those censored before 5. (Or perhaps even worse, treating them as alive at 5 years.)
In summary, the validation of time-to-event data has three primary dimensions: choice of the appropriate summary statistic, the choice of a range or set of comparison times \(\tau\), and choice of a method for handling censoring. The first two of these will be closely tied to application at hand, the third is more technical.
4 Data
One challenge with any presentation on external validation is the need for data set pairs, one to fit the model and another to assess it. There is a dearth of such available, but we have tried to assemble some useful cases.
4.1 Breast cancer
The rotterdam data contains information on 2892 primary breast cancer patients from the Rotterdam tumor bank.
The gbsg data set contains the patient records of 628 patients with complete data for prognostic variables, out of 720 enrolled in 1984–1989 trial conducted by the German Breast Study Group for patients with node positive breast cancer. These data sets were introduced by Royston and Altman (2013) and have been widely used, both are included in the survival package. Figure 1 shows overall survival for the two groups. The Rotterdam study has longer follow-up.
The usual approach has been to fit a prediction model to the Rotterdam data, and assess model predictions using the GBSG data. I will use 3 models: the first using menopausal status, size, nodes, and grade, a second adds nonlinear age, and a third using age as the time scale.
Call:
coxph(formula = Surv(ryear, rfs) ~ meno + size + grade + pspline(nodes),
data = rott2)
coef se(coef) se2 Chisq DF p
meno 1e-01 5e-02 5e-02 4e+00 1 0.03
size20-50 3e-01 6e-02 6e-02 3e+01 1 3e-07
size>50 5e-01 8e-02 8e-02 4e+01 1 8e-10
grade 3e-01 6e-02 6e-02 3e+01 1 6e-08
pspline(nodes), linear 7e-02 6e-03 5e-03 2e+02 1 <2e-16
pspline(nodes), nonlin 1e+02 3 <2e-16
Iterations: 6 outer, 17 Newton-Raphson
Theta= 1
Degrees of freedom for terms= 1 2 1 4
Likelihood ratio test=584 on 8 df, p=<2e-16
n= 2982, number of events= 1713
>># I dislike the color choices of termplot> termplot2 <-function(fit, ...) termplot(fit, col.term=1, col.se=1, ...)>termplot2(rfit, term=4, se=TRUE)>abline(v=9, col="gray")
Figure 2: Spline fit for number of nodes, Rotterdam
The printout shows that the nodes effect has an important non-linear component. A plot of the estimated shape is shown in Figure 2. Risk increases with the number of nodes, up to about 8-9 and then stabilizes. Recode nodes in this way. The anova below shows that this does not lose significant predictive value.
The risk appears lowest for subjects diagnosed around age 60, rising for both older and younger ages. A rough index of how important each covariate is to the risk score is to compute the standard deviation of its contribution to the overall risk score. We see below that nodes have by far the largest effect — an “average” node effect of exp(.47)= 1.6 fold — followed by size, age, and grade at 1.15–1.17.
(One could also use the MAD, which reduces the effect of node’s long right tail, but MAD is less informative for 0/1 variables.)
As an alternative analysis consider doing the model with current age as the time scale and time since enrollment as a covariate. A first question is how to model time since enrollment, which is by definition a time-dependent covariate, and will need to be added to the data set. In any study which includes informed consent, we often find the death rate in the first year after enrollment is only half that at later times. One possible reason for this is that subjects who are in extremis often do not enroll in a research study, and some trials actually include “expected survival of at least 6 months” in their eligibility criteria. Updating the on study time and current age each day would be overly extreme and lead to a very large dataset, as the longest followup in the Rotterdam cohort is just over 19 years.
Below we opt to update at half year intervals. We can then fit a spline on tstart, the time since enrollment.
The partial likelihood for this model is actually larger than when we modeled age. The RFS rate rises by about 1.6 fold over the first 2 years after enrollment, then falls again, stabilizing after around 7 years. (The confidence intervals get crazy after 12 years so we have restricted the range of the plot.)
(Aside: the proper counterpart to a model with time-since-enrollment as the time scale and age as a covariate, is one with age as a time scale and enrollment time as a covariate. Each model controls for one of the variables by matching and for the other by modeling.)
When drawing a survival curve on age scale, we also need to specify a starting point, e.g., expected survival for someone enrolled at age 60 would be
The estimated effects of size and node drop off sharply in the first 3 years after enrollment, while the age effect is smallest in that period. Time-since-enrollment has the largest effect on younger ages. Nevertheless, these are the imperfect models we will validate.
We have been aggressive in the fit, finding an ``optimal’’ transformation for number of nodes, and allowing for a general age effect. But, since such over-fitting is common in creating a risk score we will retain this final version, and expect some imperfection to be revealed in the validations.
The are some differences between the Rotterdam and GBSG covariates: GBSG has the size in mm instead of grouped, 12% of the GBSG subjects are grade 1 and none of the Rotterdam, GBSG has no patients with 0 nodes versus 48% in Rotterdam, and the Rotterdam follow-up is much longer. We have purposely ignored these, mimicking a risk score that has been built on one data set, before the potential further uses of it were known.
McLernon (2023) also used the Rotterdam/GBSG pair, but fit their models to only the first 5 years of the Rotterdam data to better match the GBSG follow-up.
Indeed, in the (somewhat rare) case that the original data is available, matching to original fit to an intended use case makes sense.
> temp <-survSplit(Surv(ryear, rfs) ~., rott2, cut=5, episode="epoch")> rott5 <-subset(temp, epoch==1) # first 5 years of fu> rfit1b <-update(rfit1, .~., data=rott5)> rfit2b <-update(rfit2, .~., data=rott5)>#rfit3b <- update(rfit3, .~., data=rott5)>># the model used in McLernon et al> rott5$node3 <-cut(rott5$nodes, c(-1,0, 2, 50),c("0", "1-2", ">3"))> rott5$grade3 <-1*(rott5$grade >2)> rfit4 <-coxph(Surv(ryear, rfs) ~ size + node3 + grade3, rott5)
4.1.3 GBSG data
The survival package makes validation computations easy if the original and validation data sets have exactly the same variables. For the GBSG data we will need to first change size to a categorical, then add the node8, ryear and rfs variables.
> gbsg2 <- gbsg> gbsg2$sizec <- gbsg2$size # sizec= continuous size in mm> gbsg2$size <-cut(gbsg2$sizec, c(0,20, 50, 500), c("<=20", "20-50", ">50"))>#gbsg2$pgr3 <- with(gbsg2, pmax(30, pmin(200, pgr)))> gbsg2$rfs <- gbsg2$status> gbsg2$ryear <- gbsg2$rfstime/365.25> gbsg2$node8 <-pmin(gbsg2$nodes, 8)> gbsg2$eta1 <-predict(rfit1, newdata=gbsg2) # risk score under first model> gbsg2$eta2 <-predict(rfit2, newdata=gbsg2) # under model 2>> gbsg3 <-survSplit(Surv(ryear, rfs) ~ ., gbsg2, cut=seq(.5, 9, by=.5))> gbsg3$age1 <-with(gbsg3, age + tstart)> gbsg3$age2 <-with(gbsg3, age + ryear)> gbsg3$eta3 <-predict(rfit3, newdata=gbsg3) # time scale model
4.2 Amyloidosis
Muchtar et al (2019) used data on 1007 patients with newly diagnosed light chain amyloidosis to assess the performance of 4 different prediction models that had appeared in the literature. Each of the models categorizes patients into 3-5 discrete stages of disease, with survival curves based on those strata. The amyloidmodel data set contains digitized curves from the papers, and amyloid the survival times and per-model risk category for each of the validation subjects.
This data is different in that a Cox model does not directly appear, only the predicted category for each subject. Cox models were used in some of the work to help motivate the categorical staging variables and their cutoffs, but is not used in the final score. (The hematologists in this exercise felt that a discrete score is much easier for the practitioner to use.)
The 2004, 2012 and 2015 predictions differ substantially, particularly for the lowest risk group, there are for instance 70 subjects and 0 deaths in the top 2015 curve. This data spans a time when several new and effective treatments for amyloidosis became available. Any validation exercise will need to be cognizant of this.
4.3 Treatment trial in Primary Biliary Cirrhosis
The model and data underlying a risk score and predicted survival representing the untreated progression of primary biliary cirrhosis is well known; the pbc data has been part of the survival package for over 2 decades.
> pbc2 <-subset(pbc, id <=312)> pbc2$death <-1*(pbc2$status ==2)> pbcfit1 <-coxph(Surv(time, death) ~ age + edema +log(bili) +log(albumin) +log(protime), pbc2, ties="breslow")> pbc2$riskscore <-model.matrix(pbcfit1) %*%coef(pbcfit1)># or pbcfit1$linear.predictor + sum(pbcfit1$means * pbcfit1$coef)># For expected, matching the calculation used in the ucda data is important> pbcfit2 <-coxph(Surv(time, death) ~ riskscore, pbc2, ties="breslow")
The udca data set contains early results of a trial of ursodeoxycholic acid (UDCA) versus placebo for treatment of PBC, “early” in the sense that 3 years’ enrollment plus 5 years of follow-up had produced only 16 deaths. Estimating the coefficients for the 5 risk score variables, de novo within this study, while also getting an estimate of treatment effect is simply not feasible. However, we can use the PBC fit as a direct estimate of the expected death rate. The udca data includes the Mayo risk score, using the coefficients from the 312 subset.
Early enrollment predicting late enrollment. Semi-competing risks. Yet to be done.
4.5 Dementia and Death
Fits from a fixed population (Mayo Clinic Study of Aging) applied to data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) Yet to be done.
5 Initial analysis
A few checks of the validation data should be done early in the process. These include a Kaplan-Meier plot of the outcome, overall or by subgroups, numerical and/or graphical summaries of each of the predictor variables found in the reference model, and a refit of the reference model using the new data. The goal of this is not validation per se, but to uncover simple data issues such as a lab test in different units, a different range of subjects, or a change in time scale.
An estimate of the censoring distribution may also be useful to understand limitations, e.g., if 95% of the validation sample is censored before 5 years then survival at \(\tau =5\) years will not be a wise choice for the validation measure.
6 Concordance
6.1 Definition
The most common measure of discrimination for survival data is the concordance \(C\). For \(X\) and \(Y\) from a continuous bivariate distribution it is defined as
The two forms are numerically equivalent, just with a different viewpoint. Imagine that a local newspaper has predicted the outcome of the mayor’s race for 20 years. We could take a forward looking view and ask how often the voter’s followed what the paper predicted (y given x), or a backwards in time view and ask how often the paper was correct (x given y). In either case the estimate is (total correct)/(total contests). In the present case replace \(X\) with the model predictions \(\yhat_i\), and \(Y\) with \(t_i\), the observed survival times. A pair of observations is concordant if the ordering of the results agrees with the ordering of the predictions.
One numerical wrinkle has to do with ties. There have been various suggestions, which the concordance vignette lays out in detail. Here we only summarize that discussion.
For \(C\) count ties in the response time \(t\) as uninformative, such pairs count as neither concordant or discordant. Ties in the predictor \(\hat y\) count as 1/2.
For survival data, pairs such as (5+, 7) are also treated as uninformative. We do not know if subject 1 (censored at 5) will or will not have a survival time greater than subject 2 (event at 7).
If the response is a 0/1 variable, then \(C\) = AUROC, the area under the receiver operating curve, a metric which is well established for binary outcomes. (Proving this simple theorem is harder than it looks, but the result is well known.)
The concordance has a natural interpretation as an experiment: present pairs of subjects one at a time to the physician, statistical model, or some other oracle, and count the number of correct predictions. Pairs that have the same outcome \(y_i=y_j\) are not put forward for scoring, since they do not help discriminate a good oracle from a bad one. If the oracle cannot decide (gives them a tie) then a random choice is made. This leads to a score of 1/2 for ties.
As a measure of association the concordance has two interesting properties. The first is that that the prediction can be shifted by an arbitrary constant without changing \(C\), or equivalently that we do not need the intercept term of the predictor equation. A second is that, for single state survival, all 3 of our assessment targets lead to the same value: if the predicted probability of death is lower for subject A than B, then the expected number of death events will lower for A, as will the expected number of years in the death state, and further, this order does not depend on what target time \(\tau\) might be chosen. There is thus a single concordance, which requires only the linear predictor \(X\beta\) from a coxph fit. (Multistate models, as we will see later, are more complex.)
As shown in the concordance vignette, the concordance can also be written as
where \(\delta_i\) is 0 for censored and 1 for uncensored observations, \(r_i(t)\) is the percentile rank of \(\yhat_i\) among all those still at risk at time \(t_i\), and \(w(t)\) is a weight function. This looks very much like the score statistic from a Cox model, and indeed for a model with a single 0/1 predictor it turns out to be exactly the Cox score statistic. Rewriting the concordance in this form also properly allows for time dependent covariates and alternate time scales. Like a Cox model, at each event time all that matters is who is at risk at that time, and their covariate values at that time.
When there is a single binary predictor the case weights of \(w(t)= 1\), \(n(t)\), \(S(t-)\) and \(S(t-)/G(t-)\) correspond to the log-rank, Gehan-Wilcoxon, Peto-Wilcoxon, and Schemper tests for a treatment effect, where \(n(t)\) is the number at risk at time \(t\), \(S\) is the survival and \(G\) the censoring distribution. Many other weights have also been suggested. For the concordance, weights of \(n(t)\) and \(S(t-)/G(t-)\) correspond to the suggestions of Harrell and of Uno, respectively. At one time arguments about the relative strength and weakness of the various survival tests had a prominent role in the literature, but experience has shown that, for nearly all data sets, the choice does not matter all that much; and this once lively controversy has died away. In our limited experience, the same may be true for weighted versions of the concordance, and we suspect that the debate about Harrell vs. Uno weights will also fade with time. (For uncensored data, \(G(t)=1\) for all \(t\), and all the above weights are identical.)
6.2 Dichomized concordance
A variation that is important is the dichotomized concordance: replace \(t_i\) with \(I(t_i <= \tau)\) for a chosen cutpoint \(\tau\), i.e., a 0/1 indicator of death at or before time \(\tau\). Then
The redistribute-to-the-right (RTTR) algorithm can be used to reassign the case weights of all those censored before \(\tau\), for whom the value of the indicator variable is uncertain, to other cases. This results in a weighted data set where the 0/1 indicator is known, for all those with non-zero weight.
Compute a weighted concordance on the remaining observations.
Since it is based on a 0/1 outcome the resulting value of \(C\) is equal to the area under a receiver operating curve (AUROC). This approach has been labeled as the “time-dependent AUROC” by Heagerty, Lumley, and Pepe (2000). We consider this label a poor choice, however, because it invites confusion with time-dependent covariates, and will henceforth refer to this approach as a dichotomized concordance \(DC(\tau)\).
The \(DC\) and \(C\) statistics measure different things. \(DC\) tends to be a bit larger than \(C\), but with a larger variance. With respect to why it is larger, consider all pairs with \(|t_i- t_j| <c\) where \(c\) is a small constant, say 1/20 the range of \(t\). These are normally the hardest pairs to score correctly. \(C\) includes many more such pairs, \(DC\) only has the subset which are close to \(\tau\).
Another measure based on time dichotomy is an ordinary \(R^2\) using the weighted RTTR data
where \(\delta\) is the 0/1 status indicator for observed death before \(\tau\), \(w\) are the IPC weights, \(\hat p\) is the predicted probability of death based on the model, and \(p_0\) is the result of fitting a null model with only an intercept. The null model predicted probability of an event at time \(\tau\) is simply 1- Kaplan-Meier(\(\tau\)): one of the original arguments for the RTTR was that a weighted mean of the 0/1 status variable reprises the KM (1967). That is, a weighted logistic regression of \(\delta\) using only the intercept as a predictor is simply another way to compute \(KM(\tau)\). The numerator \(N\) is known as the Brier score. van Houwelingen and Putter (2012) show the value of \(N\) and \(D\) in parallel over a range of cutpoints (Figure 3.6), and on the same page define \((D-N)/D\) as the ``relative error’’; this is exactly R-squared as above. Kattan and Gerds (2018) redefine \(1- N/D\) as the index of predictive accuracy (IPA).
6.3 Breast cancer
Figure 5} shows the \(C\) statistic with \(n(t)\) (Gehan-Wilcoxon or Harrell) and \(S/G\) weights (Schemper or Uno) at 151 cutpoints \(\tau\) from 1 to 7 years, along with \(DC\) at those same points. The \(C\) statistic at \(\tau=4\), say, is a measure of how accurate predictions are up through 4 years, and can be computed by treating all pairs with both \(t_i\) and \(t_j\) greater than 4 as tied times, or, equivalently, censoring all the observations at 4. The DC at 4 treats pairs with both \(t_i\) and \(t_j >4\)and pairs with both \(t_i\) and \(t_j\) less than 4 as incomparable pairs.
>#Compute concordance and DC at multiple cutoffs> tau2 <-seq(1, 7, length=151) # 151 cutpoints, from 1 to 7 years> reweight <-rttright(Surv(ryear, rfs) ~1, gbsg2, times=tau2)> psmat <-1-pseudo(gsurv, times=tau2)>> Cstat <-array(0, dim=c(151,2,6)) # value and std, 6 measures>for (i in1:151) {# Compute each measure at each cutpoint tau c1 <-concordance(rfit1, newdata=gbsg2, ymax=tau2[i]) c2 <-concordance(rfit1, newdata=gbsg2, ymax=tau2[i], timewt="S/G") ytau <-with(gbsg2,ifelse(rfs==1& ryear <= tau2[i], 1, 0)) c3 <-concordance(ytau ~ eta1, data=gbsg2, weight=reweight[,i],subset=(reweight[,i] >0)) # DC c4 <-concordance(rfit1, ymax=tau2[i]) # Original Rotterdam data c5 <-concordance(psmat[,i] ~ eta1, data=gbsg2, ymax=tau2[i]) # pseudo Cstat[i,,1] <-c(coef(c1), sqrt(vcov(c1))) Cstat[i,,2] <-c(coef(c2), sqrt(vcov(c2))) Cstat[i,,3] <-c(coef(c3), sqrt(vcov(c3))) Cstat[i,,4] <-c(coef(c4), sqrt(vcov(c4))) Cstat[i,,5] <-c(coef(c5), sqrt(vcov(c5))) }> bfit <-brier(rfit1, newdata=gbsg2, times=tau2)> Cstat[,1,6] <- (1+sqrt(bfit$rsquare))/2
Figure 5: C statistics versus truncation time using predictions from the Rotterdam model, for Rotterdam and GBSG data sets.
For this particular data set pair, the discrimination of the model in the GBSG data set is not too far from its performance in the Rotterdam data used to build the model. From year 3 forward the standard \(C\) statistics (\(n(t)\) weight) differ by only .01 to .02; while the std of the GBSG concordance is .017. (The blip of better performance before year 1.5, in the validation versus the original, is an unexplained oddity.)
Concordance gets worse over time, which is not a surprise. Outcomes that are farther in the future are harder to predict in essentially all areas of life, from the weather to the stock market to the final height of a child. Survival time is no exception.
The \(S/G\) weighting gives proportionately larger weights than \(n(t)\) weighting to points later in time, and consequently leads to a lower estimate of \(C\). But the two remain close with a difference of approximately .01 at 4 years and .02 at 6 years. The \(S/G\) weight can become very large at later times, however, leading to late instability in the concordance.
The DC is larger than \(C\), but much more erratic with respect the \(\tau\) cutpoint. The sometimes large changes in \(DC\) associated with small changes in \(\tau\) is, we suspect, a consequence of the sharp transitions in weight from \(>1\) to 0 when a censored observation is reassigned.
Figure 6 gives further insight to why \(DC > C\). The plot starts with all 6.7 million time pairs used to calculate concordance for the full Rotterdam data, with a value of 0 (not concordant), 1 (concordant) or .5 for each pair. The image colors come from a discrete average over a 40 x 40 grid, and the contour lines are predicted values from a glm fit. Correct prediction is harder for two outcomes that are nearly identical (the diagonal) and for outcomes further from the origin. If we focus on the concordance values at \(\tau=5\) in Figure 5, the concordance \(C\) at that point is an average that omits the upper right quadrant of Figure 6 (time 1 and time 2 both \(\ge 5\) years) while DC omits both the upper right and lower left quadrants (time 1 \(<5\) and time 2 \(>5\) or vice versa).
Figure 6: Image plot of the concordance for pairs of observed time values in the Rotterdam data set. The dashed contour lines are at .5 (diagonal), .55, .6, .65, .7 and .75.
Now add in the \(R^2\) and pseudovalue measures, and also plot the estimated standard deviation. For R-squared, we note that \(R=0\) for no association while \(C= 1/2\) for no association; plotting \((R+1)/2\) places them on similar scales.
Why does the DC variance grow so rapidly after 5 years? As \(\tau\) increases the RTTR algorithm has set more and more observations’ weight to zero: by time 6 only 33/387 of the censored subjects remain, and the induced 0/1 variable has counts of 296 and 36; and we know that for binomial data the precision of an estimate is proportional to the size of the smaller group. A concordance based on pseudovalues uses all the pairs: two observations censored at different times will have different pseudovalues (if at least one event occurs in the interim). We posit that the resultant increase in the number of comparable pairs may be responsible for a lower estimated variance. We do not recommend the pseudovalue \(C\) at this time, as its behavior and properties have not been fully investigated.
The IJ variance used by the concordance function also allows computation of a variance for the difference between any two of the methods. I’m not sure that this would ever be useful, however.
The concordance lines for model 1 have been added in gray. We see that model 2 has essentially the same patterns, but higher discrimination: about .02 units at the 5 year mark. The plots do not tell us which\(\tau\) cutoff to use, that has to be decided based on the goals of the validation study.
6.3.1 Age scale
Using age as the time scale does not change the underlying computation for the concordance: for a death of observation \(i\) at age \(a_i\), the rank of that subject i’s risk score \(\eta_i\) is compared to that for all others which are alive and at risk at that age. The total number of comparable subjects at risk at any given moment is lower, however, leading to far fewer comparable pairs and an increase in the standard error.
Model 1 Model 2 Model 3
Original C 0.6728 0.6833 0.6786
Original std 0.0068 0.0067 0.0072
GBSG C 0.6639 0.6816 0.6768
GBSG std 0.0162 0.0154 0.0173
What is not simple for the age scale is imposition of a time cutoff. A PH model using age as the time scale is predicting relative hazard for matched subjects of a given age; perhaps there would be a situation where it is not of interest to assess those above a certain age, or one where we wish to compare those within that age stratum who are within fixed number of years of enrollment, but we have not encountered any such. A dichotomized concordance faces similar uncertainty; what 0/1 variable would one want to define?
An addition technical challenge the RTTR algorithm is no longer available, so there is not a simple way to create said 0/1 variable in the face of censoring.
6.4 Amyloid
Compare the 5 different risk scores, of which a simple count of the number of affected organs is the most crude. Since several of the prediction models only extend to 60 months, we choose this as the upper limit \(\tau\). We see, that as expected, number of affected organs has the lowest concordance with outcome, while the 2012, 2013 and 2015 modifications of the 2004 score do seem to give a slight improvement. When there are multiple terms on the right hand side, each is treated as a separate predictor, and separate concordance assessment.
The reverse = TRUE argument is needed because a high score is expected to lead to shorter survival. A second question is whether any of these are a significant improvement. We can use the variance-covariance matrix of the values to address this.
It appears that all are significantly different from the simple count, and that the final two improve on the 2004 score.
6.5 UDCA treatment
For this data the concordance is not of primary interest. The concordance is very high for prediction of death, however, and moderately strong for prediction of progression.
One of the more flexible, but unfortunately less known validation methods is to assess the observed and expected number of events.
It is based on the fact that, if the model is correct, then \(\int Y_i(t) \left[dN_i(t) - \lhat_i(t) \right]\) is a martingale, where \(lhat\) is the prediction from the model; labels are the observed cumulative hazard, the compensator of the martingale, or expected number of events. This holds whatever the model might be: a simple constant death rate, a Cox model, machine learning, etc. As importantly, the process remains a martingale under any choice of stopping time, as long as that is a predictable process, i.e., it does not look into the future. This is a variant of the well known ``gambler’s ruin’’: no stopping strategy can change the long run outcome of a game of chance, e.g., leave the casino after winning 2 sequential bets.
There can be a separate stopping point for each subject, i.e., \(N_i(t_i) - \Lambda_i(t_i; x_i)\) is a martingale, where \(t_i\) is the observed follow-up time of subject \(i\).
One of the oldest and most familiar uses of this approach is the computation of the standardized mortality ratio (SMR).
In this case the expected number of events for each subject is computed using standard population mortality tables; the survexp.us array for instance contains United States mortality rates from 1940 to 2020 by age, sex and calendar year. Here is the calculation of expected events for a hypothetical male subject born on 1953-03-10, enrolled on study 2004-07-12 with last on study followup on 2013-10-12.
The sum of expected deaths for a cohort is then compared to the observed number of events, O/E is the SMR.
Berry (1983) shows that \(N_i\) has the same likelihood function, up to a constant, as a Poisson distribution with rate parameter \(\Lhat_i\). We can thus assess the goodness of fit using simple glm fits, which provides both estimates and their standard errors. The compensator \(\Lambda_i(t_i)\) is the predicted number of events for each subject, which is obtained from a coxph model using the predict function. Below we show simple counts, then formal confidence intervals.
Model 1 Model 2 Model 3
SMR 1.11 1.22 1.09
lower CI 0.99 1.09 0.97
upper CI 1.24 1.37 1.23
Over the full follow-up time, the GBSG data has 11%, 22%, and 9% more deaths than predicted by models 1, 2 and 3, respectively. The use of subset= (expect >0) deserves comment. When there are observations in the validation data set that are censored before the first event in the training data (there are 7 such in the GBSG data), then the Poisson log-likelihood has a term of \(0\log(0)\). Using L’H{^o}pital’s rule we know that \(\lim_{x \rightarrow 0}x \log(x) =0\); thus these observations contribute 0 to the loglik and can be omitted. The glm function does not know calculus, however, and will generate NA if these observations are retained.
We can look at the excess risk at various levels of refinement: for the study as a whole, as a function of the risk score, or for specific covariate combinations. These increasingly strict checks are known as mean calibration (also calibration in the large), weak, moderate, and strong calibration. Weak calibration, sometimes known as the calibration slope, is an assessment that the risk in the validation data rises at the same rate as the predictions. Below are two simple assessments, the first prefigures material from ?@sec-remodel. In both cases a slope of 1.0 lies well within the confidence interval, risk rises as we expect. When the reference model suffers from overfitting the calibration slope is often seriously less than 1; risk does not systematically vary as much as was predicted.
The next level of assessment is to look at the excess as a function of the prediction, known as a calibration plot. The predict call below uses expect=1 to get the death rate relative to an ideal of 1.0, which would represent a perfect match to the predicted events. We have added a horizontal line at the overall excess. In this case, the excess does not appear to be related to the per-subject risk, e.g., it is not limited to the highest or lowest risk subjects. The hint of a rise near the left is only a hint, at no point does the 95% pointwise confidence band exclude the overall value of 1.2. A scaled density of the eta values is has been added to the bottom. There are a 60 subjects with \(\eta < -.5\) but only 12 events.
Figure 7: Calibration plot using observed/expected deaths
We can also divide the predicted risk score eta into bins to create a tableau reminiscent of the Hosmer-Lemeshow approach in binomial data. (When a similar categorization is applied to the original data and with a binomial outcome, the final row would be the components of the HL test. In this case the glm fit provides formal tests.)
If we repeat the plot for 5 year data there is no substantive change. If we use the termplot function below, the curve is centered at the grand mean so we draw the horizontal line at zero.
The astute reader will notice that the confidence intervals from the termplot call are a bit narrower. This is because the termplot result can ignore uncertainty in the intercept. It is a plot of relative excess rather than absolute excess; one could argue that doing so is asking a discrimination rather than a formal calibration question.
A next step is to look at components of the risk score.
The first fit gfit3a reveals nothing of consequence.
Because the GBSG data set has subjects who are grade 1–3 while the Rotterdam data had only 2–3, a second fit looks at the three grades separately. Grade was modeled as an integer, and the pattern of excess risk coefficients suggests that perhaps the 1-2 increment in risk is larger than the 2-3 increment. The overall ANOVA suggests that this could simply be random variation, however.
A similar check is made for the three size categories. The medium group, which has 66% of the cases, is almost identical to the overall: exp(.186)= 1.204. There is a hint of a trend, but neither the ANOVA or a trend test are significant.
For nodes, we have broken the distribution into four approximately even groups and look at the excess per group. There is not a consistent pattern of excess: the largest excess is for 1 node, smallest for 2–3, and intermediate for 4–7 and 8+.
Last we look at age, fitting a spline to the excess.
Analysis of Deviance Table
Model 1: rfs ~ offset(log(expect2))
Model 2: rfs ~ ngrp + offset(log(expect2)) - 1
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 678 810.09
2 675 806.98 3 3.1052 0.3757
>># move boundary knots in from the edge a bit, which I prefer> test2e <-update(gfit2, . ~ . +nsp(age, df=4, trim=.05))>termplot2(test2e, se=TRUE) >abline(0,0, lty=3)>rug(jitter(gbsg2$age))
>anova(gfit2, test2e, test="Chisq")
Analysis of Deviance Table
Model 1: rfs ~ offset(log(expect2))
Model 2: rfs ~ nsp(age, df = 4, trim = 0.05) + offset(log(expect2))
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 678 810.09
2 674 796.41 4 13.675 0.008408
>># redraw this directly on the excess risk scale> dummy <-data.frame(age=21:80, expect2=1)> yhat <-predict(test2e, newdata=dummy, se.fit=TRUE)> yy <- yhat$fit +outer(yhat$se, c(0, -1.96, 1.96), '*')>matplot(21:80, exp(yy), log='y', type='l', lty=c(1,2,2),lwd=c(2,1,1), col=1,xlab="Age", ylab="Excess death rate")>abline(h=1.2, col='gray')
The age check shows a systematic bias in the predicted versus observed risk, and it is statistically significant. How do we parse this out? The quartiles of age in the GBSG study are 46 and 61, so approximately the middle half of the subjects lie in the upward trending portion of the plot. Referring back to the coxph fit, the predicted risk score drops over this interval (lower risk at 61). The coxph model predicts that the hazard drops by around 15% over this region, the gbsg data says ``perhaps not so much’’. One way to look at this is to directly fit age in the GBSG data, holding all other coefficients at the rotterdam fit values.
Note– some further work makes be believe that there is something wrong with the last plot above. Come back to this later
We can also look at the accumulation of observed and expected over time. This is currently a bit clumsy in the software, a tmax argument in the predict function, similar to ymax in concordance, would make this simpler. Nevertheless:
The observed RFS events start out as less than expected for the first half year, catch up by year 1, and slowly rise from 1.1 to 1.2 times expected from year 2–7. We have not yet made the crucial decision about the time period of interest. Do we want to know about predictive behavior over the first year, first 2 years, first 5? The correct answer depends of course on the application at hand. This data set pair has been widely used for discussion of validation, and most of those papers have chosen 5 years without presenting any argument for that time point. In order to facilitate doing validation over the range of 0–5 years, add 3 more variables to the gbsg2 dataset: ryear5, rfs5, and expect5.
> tdata <- gbsg2> tdata$rfs <-ifelse(tdata$ryear >5, 0, tdata$rfs)> tdata$ryear <-pmin(tdata$ryear, 5)> gbsg2$ryear5 <- tdata$ryear> gbsg2$rfs5 <- tdata$rfs> gbsg2$expect5 <-predict(rfit2, tdata, type='expected')>> gfit5 <-glm(rfs5 ~offset(log(expect5)), poisson, data=gbsg2,subset= (expect5 >0))>exp(coef(gfit5)) # over the shorter interval, the SMR is approx 1.20
(Intercept)
1.197491
7.1 Amyloidosis
To compute expected events under the models we need the cumulative hazard \(\Lambda(t)\). The Kaplan-Meier is not exactly \(\exp(\Lambda)\), but as long as the KM is not near 0, -log(KM) is close. We use a simple threshold of .05 below. Since most of the predictions only go out to 5 years, it is important to only compute expected and observed deaths over that window.
> amyloid2 <- amyloid> amyloid2$month60 <-with(amyloid, pmin(month, 60))> amyloid2$status60 <-with(amyloid, ifelse(month>60, 0, status))>> efun <-function(vtime, ttime, tsurv) {# vtime: fu time of the validation subject# ttime, tsurv, survival curves from the fitted model chaz <--log(pmax(.05, tsurv)) # stop infinity expect <-double(length(vtime)) z <-approx(ttime, chaz, vtime, rule=2)$y z }>># Create a matrix of expected, one row per subject, one col per model># 2013 only has some of the curves> amexpect <-matrix(0, nrow(amyloid), 3)> years <-c(2004, 2012, 2015)>for (i in1:3) { tdata <-subset(amyloidmodel, study == years[i]) vstage <-get(paste0('r', years[i]), amyloid)for (stage inunique(vstage, na.rm=T)) { indx1 <-which(vstage== stage) indx2 <-which(tdata$stage == stage)cat(i,stage, " ") amexpect[indx1, i] <-efun(amyloid2$month60[indx1], tdata$month[indx2], tdata$survival[indx2]) } }
1 1 1 2 1 0 2 0 2 2 2 3 2 1 3 1 3 2
Warning in regularize.values(x, y, ties, missing(ties), na.rm =
na.rm): collapsing to unique 'x' values
3 3
Warning in regularize.values(x, y, ties, missing(ties), na.rm =
na.rm): collapsing to unique 'x' values
The total deaths in the amyloidosis validation data set at 5 years is 583, while expected counts are 1487, 989 and 505 for the 2004, 2012 and 2105 predictions. The primary problem with this comparison is that it is naive. Look again at the earlier graph.
As stated earlier, there were major gains in the treatment of AL amyloidosis over this span of years. This leaves a set of models with very similar discrimination but wildly different calibration. Group the validation data set by enrollment year.
A formal test using glm runs into an issue, however, in that the 2015 reference has 0 deaths in stage 0, while the amyloid data has 51/199 deaths in that stage. This leads to 1*log(0) in the Poisson log-likelihood for such subjects: observed deaths are (statistically) impossible when the true rate is 0.
Call:
glm(formula = death ~ offset(log(expect)), family = poisson,
data = udca2, subset = (expect > 0))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3993 0.2500 -1.597 0.11
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 79.941 on 167 degrees of freedom
Residual deviance: 79.941 on 167 degrees of freedom
AIC: 113.94
Number of Fisher Scoring iterations: 6
Call:
glm(formula = death ~ trt + offset(log(expect)), family = poisson,
data = udca2, subset = (expect > 0))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.09179 0.31622 0.290 0.7716
trt -0.98988 0.51639 -1.917 0.0552
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 79.941 on 167 degrees of freedom
Residual deviance: 76.105 on 166 degrees of freedom
AIC: 112.1
Number of Fisher Scoring iterations: 6
A perhaps obvious approach to the censored data in the validation data set is to use standard censored methods of the Kaplan-Meier, Cox model, etc.
One simple approach is to compare the predicted survival curve for the validation cohort, based on the model, with the KM of the validation cohort. For the predicted curve we need to use the proper marginal estimate which is
where \(x\) is the set of covariate vectors for the \(n\) validation samples and \(\hat S\) is prediction using the target model. Equation 1 is called the direct adjusted survival, or in more recent years as a \(g\)-estimate. The curve at the mean covariate, Equation 2 is not the same.
> directall <-survfit(rfit2, newdata=gbsg2)>dim(directall) # separate curve for all 686 subjects
data
686
>># collapse for plotting> direct <- directall> direct$surv <-rowMeans(direct$surv)># std does not collapse neatly, keep plot and summary from trying to use it> direct$std.err <- direct$upper <- direct$lower <- direct$std.err <-NULL>>plot(gsurv, fun='event', lwd=c(2,1,1), xlab="Years since enrollment", ylab="Death")>lines(direct, fun='event', conf.int=F, lwd=2, col=2)>legend(0, .6, c("Observe GBSG deaths", "Predicted"), col=1:2, lwd=2, bty='n')
>> temp1 <-1-summary(gsurv, time=4:7)$surv> temp2 <-1-summary(direct, time=4:7)$surv # predicted death at years 4:7>round(temp1/temp2,2)
[1] 1.13 1.14 1.17 1.26
Object directall contains 686 separate predicted survivals, one per subject. Their average is the cohort survival. Exactly as in the SMR plot, we see that the observed deaths start out a little smaller than expected over the first few months, are about equal at 1 year, and greater than expected after 2 years. The ratios for probability of death at 4–7 years is not identical to the SMR, but similar in size.
The more common approach is to fit a Cox model with eta as the only covariate, as a way to assess the linearity of the relationship.
Call:
coxph(formula = Surv(ryear, rfs) ~ eta2, data = gbsg2)
coef exp(coef) se(coef) z p
eta2 1.01981 2.77268 0.09966 10.23 <2e-16
Likelihood ratio test=102.8 on 1 df, p=< 2.2e-16
n= 686, number of events= 299
> cfit2
Call:
coxph(formula = Surv(ryear, rfs) ~ eta2 + offset(eta2), data = gbsg2)
coef exp(coef) se(coef) z p
eta2 0.01981 1.02001 0.09966 0.199 0.842
Likelihood ratio test=0.04 on 1 df, p=0.8424
n= 686, number of events= 299
Figure 8: A check for systematic effect versus the risk score.
Fit cfit1 is often called regression calibration, a good model should have a coefficient of 1. If there was overfitting during creation of the prediction model then we might expect the actual risk to rise and fall less sharply than the predictions. The label is a bit of a misnomer since the result is on a relative scale rather than absolute, so is closer to discrimination. One could have have a coefficient of 1.0 and all predicted survivals be half what we see in the data (but consistently half!).
A test for \(\beta=0\) in cfit2 is an easy way to test \(\beta=1\) in cfit1. Likewise checking for a horizontal line in Figure 8 is a visual check for systematic differences in the calibration as a function of \(\eta\). Notice that this is very similar to the calibration plot in Figure 7, with two differences. First, Figure 8 is centered at zero, which is a consequence of refitting the model using coxph: O-E residuals are now guaranteed to sum to zero within the validation (these are the martingale residuals for the fit.) Because of the new baseline hazard, the figure can only assess relative change, not absolute. Second, the confidence intervals in Figure 7 are a bit wider, since they also account for estimation of the intercept. One can argue over which of these is the more useful choice when assessing the overall shape of the curve.
The more common plot is to compute predicted survival for both the prediction model and the refit model for a chosen time \(\tau\) and range of risk scores eta, and then compare them on the same graph, what I call a \(\yhat\) vs. \(\yhat\) plot. An alternative but less common figure is to plot all times for a fixed risk score. This is a calibration measure since it is on absolute scale.
> d2 <-survfit(cfit1, newdata=gbsg2)> indx <-order(gbsg2$eta2) # need to be in eta order to draw lines> yrs1 <-findInterval(c(2,4,6), d2$time)> yrs2 <-findInterval(c(2,4,6), direct$time)> temp1 <-t(d2$surv[yrs1,indx]) # Refit survival> temp2 <-t(directall$surv[yrs2, indx]) # Predicted survival>matplot(gbsg2$eta2[indx], 1-cbind(temp2, temp1), type='l', lwd=2,lty=c(1,1,1,2,2,2), col=1:3, xlab="Linear Predictor",ylab="Predicted Death Rates, across subjects")>legend("topleft", outer(paste("Year", c(2,4,6)), c("Prediction", "Refit"), paste),lty=c(1,1,1,2,2,2), col=1:3, lwd=2, bty='n')
>matplot(1-temp2, 1-temp1, type='l', lwd=2, col=1:3, lty=1, xlab="Model predicted death", ylab="Refit prediction")>abline(0,1, lty=3)>legend(.6, .3, c("At year 2", "At year 4", "At year 6"), lwd=2, lty=1, col=1:3)
The final plot above, for a single follow-up year \(\tau\), is introduced in Austin, Harrell, and vanKlaveren (2020), with a couple of changes. The first is to realize that the accuracy of the ``refit’’ survival estimate requires that the refit model be correct, in particular the proportional hazards assumption. They recommend fitting a more flexible model to address this, and we agree. A check of PH for cfit1 shows issues, for instance; so we might add a time by eta interaction. (To do, show this.)
The KM comparison above is not subject to the modeling issues and gives a clear visual read of the lack of calibration.
However, the Cox model fits allow for more nuanced exploration, in particular how the calibration accuracy may change over values of the linear predictor or other covariates. A second difference is their use of the label ``observed survival’’ on the y-axis of the above plot, which we do not agree with. The plot is a comparison of the the validation model’s prediction to another estimate of survival. Given the presence of censoring this new \(\yhat\) may be one of the best estimates possible, using as it does our best censored data tools, but it is still an estimate.
To do: Add plot of \(S(t; \eta=\eta_0)\) with \(\hat S\) from the validation model and from the refit model, 2 curves on one graph.
8.0.1 Age scale models
Some portions of the above are unchanged by the use of age as the time scale, namely the model fits and displays of cfit1, cfit2, and cfit3 above. What does change are survival curves: we can now create predicted curves for any combination of starting age and linear predictor \(\eta\), and display their results either over \(\eta\) for a selected time \(\tau\), or over potential cutpoints \(\tau\) for selected values of \(\eta\). To do: show it.
8.1 Amyloidosis
The obvious first approach with amyloidosis is to plot the survival curve of the validation data along with a predicted curve from the prediction models. It is easiest to do by stage.
The above shows clear flaws with our prior calculation of observed versus expected, which implicitly extended all of the model curves to the right. For stages 0 and 1 this expected will be seriously biased small, for stage 3 it will be biased large. The plot for 2012 shows that our counts seriously biases there as well, perhaps more so. When using predictions with a limited time span care needs to be spent to ensure that like is compared to like. Nevertheless, we can see that the greatest gains in survival were for stages 0 and 1.
Since the prediction did not come from a fitted model, we do not have a linear predictor \(\eta\) as a concise summary of the difference between curves, and the survival regression methods do not apply.
8.2 UDCA
Two analysis strategies for the UDCA study were to first increase the number of events, by using other markers of liver progression, and second to make use of the PBC risk model to reduce the total number of covariates. We have seen an absolute risk version of this in the SMR section.
With only 10 events, survival curves that break subjects up by risk score would be fruitless. Look at a set of regression models.
Fit 1 Fit 2 Fit 3
coef -0.675 -0.932 -0.952
std 0.517 0.540 0.520
Since the study was randomized, one could argue that no adjustment for covariates is required. However, the first fit clearly shows a different treatment effect. The second fit adjusts for bilirubin, the most impactful portion of the risk score, and the third directly uses the coefficients from the PBC risk score. We can see that option 3 does, as we expected, somewhat decrease the std of the treatment coefficient, though still not quite significant. (The glm approach avoids estimating a baseline hazard, and was perhaps a tiny bit smaller at .516.)
Fit 1 Fit 2 Fit 3
coef -0.766 -0.939 -1.077
std 0.260 0.268 0.262
Again, we see an increase the estimated effect when adjusting for confounders, with a larger increase when more than bilirubin is included, and a decrease in std for the treatment effect. The total number of events is still too small at 63 to seriously consider regression coefficients for all 5 elements of the risk score.
9 Binomial methods
Binomial methods are the most popular, but we have left them for third. The overall approach is
Choose a target time \(\tau\)
Use the RTTR (or IPW) approach to reassign the case weights of all those observations \(t_i\) in the validation data set which are censored before \(\tau\) to other observations \(j\) with \(t_j > t_i\).
In the resulting data all observations with non-zero weight will have a known 0/1 status at time \(\tau\).
Pseudovalues are an alternative to the RTTR which will assign a response to censored and non-censored subjects.
Apply well known measures for binomial data to the new data, e.g., sensitivity, specificity, logistic regression, etc.
A simple weighted mean of the status values at \(\tau\), after redistributing the weights, turns out to be exactly equal to the Kaplan-Meier estimate at time \(\tau\), based on the unmodified validation data. Thus mean calibration at \(\tau\), on the probability scale, is quite easy to obtain; simply compare the predicted probability for the validation cohort as a whole to the KM at that point. Since the prediction model is considered fixed the standard error of the difference is the std of the KM at that point. One should of course use one of the better std scales for the KM rather than ``plain’’ to form a confidence interval. The plot of KM vs \(\hat S\) appears yet again …
If one were to fit a weighted logistic regression to the weighted data with status as the outcome and an intercept as a predictor, the predicted probability will again equal the KM, and the transformed confidence interval for \(\beta_0\) is yet another possible choice.
(We point this not to argue for or against the mean only model, but to show how things fit in.)
One constraint is that the re-weighting approach does not extend to age scale analyses, as neither the RTTR nor pseudovalues extend to data with delayed entry.
9.1 Breast cancer
As we have before, start with the overall observed and predicted survival, and look at the difference between them. The overall predicted curve direct was computed earlier.
The raw difference between observed and predicted survival increases with time, which is what we would expect if there is a systematic shortcoming in the prediction model. Statisticians often pick a scale where effects will be additive and approximately constant over time (though if we are honest, the transform is normally chosen purely for convenience). Both the complimentary log-log and logit have some success with respect to this metric over the interval from 2.5 to 5 years, though the discontinuous weighting process adds a lot of noise: as \(\tau\) moves from left to right a censored observation’s weight slowly increases and then abruptly drops to zero.
Now do a binomial analysis with a cutoff of \(\tau= 5\) years. Normally, users default to a logistic link, but the \(\eta\) value from a Cox model is consistent with the cloglog link so we use the latter.
The above are not formally calibration since they do not use the absolute predictions. We can reprise the \(\yhat\) vs \(\yhat\) plot of the prior section, based on the clog fit.
> phat1 <-1-c(summary(directall, time=5)$surv) # predicted death at 5 years> phat2 <-predict(lfit3, type='response', se.fit =TRUE)> yy <- phat2$fit+outer(phat2$se.fit, c(0, -1.96, 1.96), '*')>> indx <-order(d5$eta2) # so that we can plot lines>matplot(phat1[indx], yy[indx,], type='l', lwd=c(2,1,1), lty=c(1,2,2),col=1, xlab="P(death at 5 yr), model prediction", ylab="Refit binomial prediction")>abline(0,1, lty=3)
Using the set of \(\eta\) values found in the coxph fit to draw our curve is simply a convenience, we could as easily have chosen 50 ordered values across the range; the curve would be the same. However, the predict functions for both coxph and glm are set up to give predictions for a new set of covariates, not their linear combination. This is the reason we did not worry about the RTTR weights in the above.
We can also plot the ROC curve at 5 years. To do.
Another measure applicable to dichotomized time is the Brier score. Let \(y_i= I(t_i <= \tau)\) be the dichotomized response (death by \(\tau\)), \(\hat p_i(\tau)\) the predicted probability of survival at \(\tau\) from the model, and \(\hat p_0(\tau)\) the Null model probability based on a Kaplan-Meier of the validation data, i.e., a ``fit’’ with no covariates. Then
is the usual r-squared statistic, based on the 0/1 variable. The numerator \(N\) is known as the Brier score, and ratio N/D as Kattan’s index of prediction.
As for the DC, for censored data we first reassign weights for all observations censored before \(\tau\), then compute weighted sums.
10 Synthetic estimates
Not yet complete
The R squared measures of Nagelkerke, Kent and O’Quigley, Royston and Sauerbrei, and the C statistic of Goen and Heller are computed by the royston function. There are measures that do not make any direct use of the response value. They exploit a relationship that exists in some particular case between the statistic of interest and other aspects of the fitted model (linear predictors, degrees of freedom, test statistic, etc.), and then use those to infer a value of the statistic in the data set at hand. We find their utility particularly questionable when applied to external validation, but they have sometimes been suggested in the literature.
10.1 G"{o}en and Heller C statistic
If proportional hazards holds exactly for a known linear predictor \(\eta_i= X\beta\), over all time, then it can be shown that for two prospective subjects \(i\) and \(j\) that
independently of the form of the baseline hazard (2005). This is a very interesting result, which shows that the \(C\) statistic can be related to an average variation in the risk scores. For a fitted Cox model, the authors suggest ordering the fitted risk scores \(\hat\eta\), and computing \(C_G\) as the mean of \(f\) over all pairs with \(j >i\). An argument is made that this avoids having to deal with censoring, with an implicit assumption that the uncensored value of \(C\), integrated over all time, is the proper target of estimation. (The last we clearly disagree with).
For fit 2 of the Rotterdam data \(C= (.683, .670, .651)\) for Harrell and Uno weights, and for the synthetic measure. The obvious use for this in validation, to simply plug in \(\hat\eta\) values for the new data set, using the base model coefficients, is clearly unwise. The predicted concordance \(C_G\) will for instance be identical for two copies of the GBSG data set, the second of which with all of the time values randomly permuted.
Since \(C_G\) is essentially a measure of the spread of the predicted values \(\hat\eta\), one possible solution to the above flaw would be to refit the Cox model on the validation data, using \(\hat\eta_i\) as the predictor for subject \(i\), then compute \(C_G\) using the rescaled data. This is essentially the same idea as the regression slope.
11 Royston’s \(D\)
The Royston and Sauerbrei statistics \(D\) and \(R_D\) are also based on a particular measure of variation in the risk scores. First convert the \(\hat\eta\) values into normal scores
If there are tied \(\hat\eta\) values replace those \(z\) values with their average. Third, fit a new Cox model with \(z\) as the only covariate, and use the resulting coefficient as a measure of the variation in risk scores. Last, compute the three estimates. (Since \(z\) has standard deviation 1, \(\beta z\) will have standard deviation \(\beta\).)
Translation of this to the validation context is challenging. The first step is easy, i.e., to form reference \(\hat\eta^*\) using the prior coefficients and the validation data. But a z-transform of these new values will not be the same transformation function as was used in the original data. The distribution of risk scores is quite different for the Rotterdam and GBSG data sets, for example, as shown below.
Our main problem with this approach is ``what exactly does it measure?’’ We are interested in the predictive power of our original model. Instead this gives estimates for a different risk score, one that will never actually be used in a further application of the model. W do I care how that score behaves? Secondarily, said artificial risk score has a distribution that is extremely rare in practice. Errors may be approximately Gaussian, but raw medical data is emphatically not.
12 Synthetic R-squared
Nagelkerke, Kent and O’Quigley; yet to do.
13 Summary
Binomial estimates are familiar but have higher variance (sometimes a lot).
A goal should not be to “test” for validity, but to explore in what facets the estimate does or does not perform well: covariate patterns, risk score, time ranges, metric, etc. Then focus on the specific use case.
One of the problems in the field, in our opinion, has been the determined attempt to force survival into the Procrustean bed of binomial methods. This most often leads to a focus on a single timepoint \(\tau\), \(p(\tau)\) as the target, and use of the RTTR to create a binomial response. This has the positive benefit of making the results look familiar, at the cost of potential statistical inefficiency, and far more importantly, correctly answering the wrong question.
There is, after all, a reason that the Cox model is used for the analysis of censored data than RTTR(\(\tau\)) followed by logistic regression. I have often heard the argument that ``we cannot do anything else, because the readers will not understand it’’, but this is not one which I accept. Assuming a stupid audience is not a justification for sub-par methods.
14 Appendix
14.1 The redistribute to the right algorithm
15 References
Altman, D. G., and P. Royston. 2000. “What Do We Mean by Validating a Prognostic Model?”Stat. In Medicine 19: 453–73.
Austin, P. C., F. E. Harrell, and D. vanKlaveren. 2020. “Graphical Calibration Curves and the Integrated Calibration Index for Survival Models.”Stat Med 39. https://doi.org/10.1002/sim.8570.
Berry, G. 1983. “The Analysis of Mortality by the Subject Years Method.”Biometrika 39: 173–84.
Efron, B. 1967. “The Two Sample Problem with Censored Data.” In Proceedings of the Fifthh Berkeley Symposium on Mathematical Statistics and Probability, 1:831–53.
Göen, M., and G. Heller. 2005. “Concordance Probability and Discriminatory Power in Proportional Hazards Regression.”Biometrika 92: 965–70.
Heagerty, Patrick J, Thomas Lumley, and M. S. Pepe. 2000. “Time-Dependent ROC Curves for Censored Survival Data and a Diagnostic Marker.”Biometrics 56 (2): 337–44.
Houwelingen, Hans C. van, and Hein Putter. 2012. Dynamic Prediction in Clincal Survival Analysis. CRC Press.
Kattan, M. W., and T. A. Gerds. 2018. “The Index of Predicition Accuracy: An Intuitive Measure Useful for Evaluating Risk Prediction Models.”Diagn Progn Res. https://doi.org/10.1186/s41512-018-0029-2.
Korn, E. L., and R. Simon. 1990. “Measures of Explained Variation for Survival Data.”Stat. In Medicine 9: 487–503.
McLernon, David J, Daniele Giardiello, Ben Van Calster, Laure Wynants, Nan van Geloven, Maarten van Smeden, Terry Therneau, Ewout Steyerberg, and topic groups 6 and 8 of the STRATOS initiative. 2023. “Assessing Performance and Clinical Usefulness in Prediction Models with Survival Outcomes: Practical Guidance for Cox Proportional Hazards Models.”Ann Intern Med 176: 105–14.
Muchtar, E., T. Therneau, D. Larson, M. Gertz, M. Lacy, F. Buadi, D. Dignli, et al. 2019. “Comparative Analysis of Staging Systems in AL Amyloidosis.”Leukemia 33: 811–13.