Clinical prediction models or risk scores are developed to estimate or predict a patient’s risk of having (diagnosis) or developing (prognosis) a particular outcome or disease. Such models have become increasingly popular in anesthesiology, critical care, and surgery.1 3 The best known risk score is undoubtedly the Apgar score.4 Others include the Acute Physiology and Chronic Health Evaluation (APACHE) score,5 Simplified Acute Physiology Score (SAPS),6 Framingham risk score,7 Ottawa ankle rule,8 and risk scores for predicting postoperative vomiting9 and pain.10 Any prediction model tends to show optimistic predictive accuracy in the data from which it was developed. A simple Medline search using a suggested search strategy11 revealed numerous examples of prediction models showing lower accuracy when applied in new subjects across all medical domains.12 19 The decrease in accuracy varied, but was often large enough to adversely affect patient management and outcome.

Hence, it is widely recommended that any prediction model should first be validated in new subjects before application in practice.1 3 , 18 , 20 , 21 New subjects can be used at a later period from the same institute (temporal validation); from another institute, city, or country (geographical validation); or from another level of care, e.g., primary vs. secondary care (transmural validation).

While the number of studies developing prediction models is sharply increasing, a smaller number of prediction models have been validated.3 Researchers frequently use their data set simply to develop their ‘own’ prediction model, without first validating existing models. If researchers do validate existing models and discover poor performance for their data or setting, they often proceed to develop a new model by re-estimating the predictor-outcome associations or even by repeating the entire selection of important predictors. For example, there are over 60 published models aiming to predict outcome after breast cancer22 and about 25 for predicting long-term outcome in neurotrauma patients.23 This practice is problematic for several reasons. First, developing a different model per time period, hospital, country, level of care, etc., makes prediction research particularistic and non-scientific. Second, prior knowledge is not used optimally, i.e., predictive information captured in the original model is neglected. Finally, validation studies commonly include fewer patients than the corresponding development study, making the new model more subject to limitation and, thus, even less comprehensive than the original model.

The principle of using prior knowledge has been recognized in etiologic and intervention research where meta-analyses are common. But prior knowledge can also be effectively used in prediction research. When a prediction model performs inadequately in another population or setting, it has been shown that the model can often be ‘updated’ (adjusted) using the new data to improve its performance in that population.24 , 25 Such an updated model is based on both the development and the validation data. Unfortunately, these updating methods are seldom used in applied clinical research. Several updating methods can be distinguished by the extent to which they vary in their comprehensiveness, as is reflected by the number of variables that are adjusted or re-estimated. In the simplest updating method, only one variable of the original prediction model is adjusted, while in the more extensive methods, the effects of several predictors are adjusted or additional variables are considered.

Given the considerable increase in the number of published prediction models across all medical domains (the number will increase with the introduction of electronic patient records),26 28 we thought it important to re-emphasize how a simple updating method can effectively adjust a prediction model to local circumstances. The prediction model described herein was developed to predict severe postoperative pain; however, the methodology can be applied to many types of prediction models across medical domains. The following is an example of a situation where a difference in the primary outcome leads to reduced performance.

Methods

Patients

Moderate to severe acute postoperative pain occurs frequently after surgery. Incidences of up to 50% in inpatients and 40% in outpatients (patient who undergo ambulatory surgery) have been reported.29 31 Risk-based prophylactic treatment could reduce the frequency of postoperative pain. A prediction model that preoperatively predicts the risk of severe postoperative pain was developed with multivariable logistic regression. The model was thoroughly developed from the data of 1,944 surgical patients selected in the Academic Medical Center, Amsterdam, the Netherlands (development set). That data set has been reported previously.10 32 Severe acute postoperative pain (herein named ‘postoperative pain’) was defined as a score ≥6 at a numerical rating scale (0 indicates no pain at all, and 10 indicates the most severe pain imaginable), which occurred at least once within the first hour after surgery. The prediction model is presented as an original regression formula (Box 1) and as an easy-to-use score chart (Fig. 1).

Fig. 1
figure 1

Score chart to predict the risk of severe acute postoperative pain for inpatients and outpatients. The scores per predictor were derived by multiplying the regression coefficients by 5 and rounding to the nearest integer. A sum score can be calculated for each patient by adding the scores that correlate to the patient’s characteristics. The total sum score can be linked to the patient’s individual risk using the box in the lower part. Consider, for example, an inpatient setting (intercept = −0.42, corresponding score = 0), a female patient (β = −0.004, score = 0) of age 64 (β = −0.009 * 64 = −0.576, score = −3), with a preoperative pain score of 7 (β = 0.11 * 7 = 0.77, score = 4), who is scheduled for a high pain procedure (β = 1.05, score = 5) with a small expected incision size (β = 0, score = 0), who has a preoperative anxiety score of 16 (β = 0.05 * 16 = 0.8, score = 4), and a preoperative need for information score of 4 (β = −0.05 * 4 = −0.20, score = −1). The intercept plus the regression coefficients times the predictor values total 1.42 using the formula in Box 1, yielding a predicted risk of pain of 1/(1 + e−1.42) = 80%. The total score is 9, which results in a risk of postoperative pain of 81%. Reprinted with permission from Janssen et al.32

Box 1 Original regression model to predict the risk of severe acute postoperative pain for inpatients and outpatients, where −0.42 is the intercept, and the other numbers are the regression coefficients (β) of each predictor or interaction term. Female gender, types of surgery, expected incision size ≥10 cm, and ambulatory surgery equal 1, if true, and 0, otherwise. Separate regression coefficients (β) for gender and type of surgery were estimated for inpatients and outpatients, as the effect of these predictors differed across in- and outpatients (interaction terms)

We studied the predictive performance of the model in 1,035 new patients (validation set), to test whether or not the model could be generalized across time and place. This data set has been presented and analyzed elsewhere.32 As a minimum of 100 events are required to detect changes in the predictive performance between two sets, our validation set was large enough for this purpose.33 The patients in our validation set were scheduled for surgery more recently (between February and December 2004) and in a different academic hospital (University Medical Center, Utrecht, the Netherlands), i.e., temporal and geographical validation. The study was approved by the institutional medical ethics committee, and all patients gave written informed consent for their participation.

Statistical analyses

We considered two aspects of the performance of the prediction model in the validation set, i.e., calibration and discrimination.

Calibration

Calibration is the agreement between the risks predicted by the model and the observed frequencies of an outcome. It can be graphically assessed with a calibration plot, with the predicted probabilities representing the independent variable and the observed frequencies representing the dependent variable. The plot ideally rests exactly on the 45° line, implying that the predicted risks are equal to the observed frequencies. However, when the incidence of the outcome is lower in the validation set, all predicted risks may be systematically overestimated. In that situation, the intercept (that reflects the risk of the outcome not explained by the covariates) of a prediction model can easily be adjusted, such that the mean predicted risk equals the observed incidence in the validation set.34 This modification is called ‘updating’, which means that the model is adjusted to the new circumstances and combines the information captured in the original model with the information (lower outcome incidence) of the new patients.24 , 25

The correction factor for the intercept is estimated in the validation set and is based on the mean predicted risk and the incidence in the validation set.

$$ {\text{Correction}}\,{\text{factor}}\, = \,{ \ln }\left( {{{\frac{{{\text{incidence}}\,{\text{in}}\,{\text{validation}}\,{\text{set}}}}{{1 - {\text{incidence}}\,{\text{in}}\,{\text{validation}}\,{\text{set}}}}} \mathord{\left/ {\vphantom {{\frac{{{\text{incidence}}\,{\text{in}}\,{\text{validation}}\,{\text{set}}}}{{1 - {\text{incidence}}\,{\text{in}}\,{\text{validation}}\,{\text{set}}}}} {\frac{{{\text{mean}}\,{\text{predicted}}\,{\text{risk}}\,{\text{in}}\,{\text{validation}}\,{\text{set}}}}{{1 - {\text{mean}}\,{\text{predicted}}\,{\text{risk}}\,{\text{in}}\,{\text{validation}}\,{\text{set}}}}}}} \right. \kern-\nulldelimiterspace} {\frac{{{\text{mean}}\,{\text{predicted}}\,{\text{risk}}\,{\text{in}}\,{\text{validation}}\,{\text{set}}}}{{1 - {\text{mean}}\,{\text{predicted}}\,{\text{risk}}\,{\text{in}}\,{\text{validation}}\,{\text{set}}}}}}} \right) $$

This correction factor equals the natural logarithm of the odds ratio of the mean observed incidence and the mean predicted risk. The correction factor simply needs to be added to the intercept of the original model (Box 1) when the model is applied to the new patients. Consequently, in the simplified score chart, the new intercept also needs to be adjusted to an easy to use number.

Discrimination

Discrimination is the ability of the model to distinguish the patients with postoperative pain from patients without postoperative pain and is quantified with the area under the receiver-operator characteristic (ROC) curve (AUC). An AUC ranges from 0.5 (no discrimination; same as flipping a coin) to 1.0 (perfect discrimination). The AUC of a prediction model is measured after ranking all subjects in a data set based on their predicted risk and estimating the extent to which these predicted risks are different between subjects, with and without the outcome. The AUC is a rank order statistic.35 , 36

Results

One-third (36%) of the patients in the validation set reported severe pain, compared to 62% of the patients in the development set (Table 1). The distribution of most predictors was similar in the two data sets, although, compared to patients in the development set, patients in the validation set were slightly older (47 vs. 43 years, respectively), had ambulatory surgery more often (43% vs. 28%, respectively) and had a lower incidence of large surgical incisions (7% vs. 37%, respectively).

Table 1 Distribution of the characteristics of patients who underwent surgery in the Amsterdam cohort between April 1997 and January 1999 and patients who underwent surgery in the Utrecht cohort between February and December 2004; % (n) unless stated otherwise

Calibration

Prediction models usually show good calibration in the development set, which was also the case for the postoperative pain model (Fig. 2a). However, the model showed insufficient calibration when tested in the patients of the validation set (Fig. 2b). Predicted risks were systematically higher than observed frequencies. The question arises as to how this happened.

Fig. 2
figure 2

Calibration line of the original prediction model in the development set (a), in the validation set (b), and the calibration line of the original prediction model with adjusted intercept in the validation set (c). Triangles indicate the observed frequency of severe acute postoperative pain per decile of predicted risk. The solid line shows the relation between observed outcomes and predicted risks. Ideally, this line equals the dotted line that represents perfect calibration, where the predicted risks equal the observed frequencies of severe postoperative pain. Reprinted with permission from Janssen et al.32

The systematically overestimated risk in the validation set of the prediction model corresponds to the difference in incidence of postoperative pain between the development set (62%) and the validation set (36%). Since the incidence was 36%, and the mean predicted risk of the patients in the validation set was 57%, the correction factor was:

$$ \ln \left( {{{\frac{0.361}{1 - 0.361}} \mathord{\left/ {\vphantom {{\frac{0.361}{1 - 0.361}} {\frac{0.577}{1 - 0.577}}}} \right. \kern-\nulldelimiterspace} {\frac{0.577}{1 - 0.577}}}} \right) = \,\ln \,\left( {0.414} \right)\, = \, - 0.89 $$

The correction factor was added to the intercept of the original model (Box 1), which yields the (new) intercept of −0.42 − 0.89 = −1.31, that should be used when applying the model to the patients in the validation set. Consequently, in the simplified score chart, the new intercept also needed to be adjusted to an easy-to-use number. As the scores in the score chart were equal to the regression coefficients multiplied by 5 and rounded to the nearest integer (Fig. 1), the new intercept was adjusted to −7 (−1.31 * 5 = −6.55 = −7). For example, for the female patient discussed in the legend of Fig. 1, the adjusted prediction model summed up to 0.53, yielding a risk of 1/(1 + e−0.53) = 63% (vs. 80% before updating) (Fig. 1). Using the simplified score, the total sum score was 2, which leads to a risk of postoperative pain of 60%.

The calibration plot of the updated model in the validation set is shown in Fig. 2c. As expected, the updated prediction model resulted in lower predicted risks and a calibration line that was much closer to the ideal line.

Discrimination

The AUC of the original prediction model (before updating) in the validation set was 0.65 (0.57–0.73), compared to 0.71 (0.66–0.76) in the development set. Adjustment of the intercept, i.e., adding or subtracting a fixed value for each subject, did not change the ranking of the predicted risks of the subjects. Thus, the AUC of the model was unaltered, and the AUC of the updated model was also 0.65 in the validation set.

Discussion

It is important to improve the scientific approaches for evaluating the generality of prediction models, since, on the one hand, we observe an increased attention for prediction models in the literature and in clinical practice, and, on the other hand, we often encounter poor accuracy of a model in new subjects. Therefore, developing and improving methods for validation and updating of prediction models will be relevant to many clinical domains dealing with diagnosis and prognosis of patients. If a model in new patients does not show sufficient performance at the outset, an alternative to developing a new model would initially be to adjust or update the previously developed prediction model(s) with the new data so as to improve its calibration and/or discrimination, provided that the initial model was appropriately developed.24 , 25 , 3739 The updated model should be based on additional patient data yielding better risk estimates and should be easily transferred to other, yet untested, populations.

In our study, the model showed disappointing calibration due to a difference in outcome incidence. It must be noted that calibration would not be adversely affected if the difference in incidence were due only to differences in patient characteristics (predictors) included in the model. For example, if the lower incidence in the validation set was the result of a larger proportion of older patients (who experience postoperative pain less frequently, as reflected by a regression coefficient of −0.009 per year) or a larger proportion of patients experiencing a type of low pain surgery (regression coefficient of 0.50 compared to 1.72 for highest pain surgery), the (mean) predicted risks in the validation set would also be lower, thus closer to the observed frequencies. Accordingly, the calibration plot of the model in the validation set would be comparable to Fig. 2a. However, the mean predicted risk in the validation set of the model (57.7%) was substantially higher than the observed frequency (36.1%). Hence, the lower postoperative pain incidence in the validation set and the overestimation by the model cannot be explained by the predictors; therefore, they must be the result of characteristics that were not included in the model. For example, the hospital used for the development study may have used less aggressive pain treatment than the validation hospital. Note that the presented correction factor is not applicable when the outcome frequency is extremely low (below 5%) or extremely high (above 95%). For extreme outcome frequencies, the correction factor can be estimated by fitting a logistic regression model with the linear predictor of the prediction model (as offset) as the only covariate in the new patients. For further elaboration, we refer to the literature.24 , 25

Adjustment may vary from simply updating the intercept (or constant) of the model for differences in outcome frequency, as we illustrated above; to adjusting the relative weights of the predictors in the model, in case the associations of the predictors are different in the new population; to adding new predictors, in case an important predictor was overlooked. We showed that a simple intercept adjustment can greatly improve the performance of a prediction model in new patients. However, such adjustment only improves the calibration of the model. Usually more rigorous adjustments are needed to enhance a model’s discrimination, for instance, adding (previously missed) important predictors to the model.24 , 25 It is recommended, however, that a considerably updated model be again validated in other populations. Furthermore, we stress that updating methods require the original regression formula of the prediction model, as exhibited in the Box 1. This means that researchers developing prediction models should not only present the simplified risk score of their model, as frequently occurs, but should also present the underlying regression model. We refer to the literature for a more comprehensive discussion of available methods for updating prediction models.24 , 25

It would be useful if we could describe the calibration plots that call for different updating methods. Unfortunately, it is not as straightforward as it may seem, since, for example, it is not possible, based on the calibration plot, to choose between the updating method where additional predictors are considered, and the updating method where a predictor that is already in the model is re-estimated. When there is only a difference in incidence, as in our clinical example, the calibration plot may indicate the method of updating. However, this updating method cannot improve the discrimination. Therefore, the researcher must consider whether the discrimination and/or the calibration needs to be improved, whether a change in measurement may have caused a decreased discrimination and/or calibration, and whether there are potential predictors available that may improve the discrimination and/or calibration. Therefore, it is not possible to recommend the type of updating purely based on the calibration plot.

Recent methodological advances in prediction research may see future prediction models being continuously validated and updated, while quantitatively maintaining all available evidence. The extent to which this process of model validation and updating must be pursued before clinical application is justified is yet unknown and a topic for further research. Although our paper relates to the preoperative prediction of severe postoperative pain, it may serve as a practicable contribution to improving the validation and use of prediction models in other medical domains, including perioperative emergency medicine and surgery.