Original ArticleSubstantial effective sample sizes were required for external validation studies of predictive logistic regression models
Introduction
Predictive logistic regression models are important tools to provide estimates of patient outcome probabilities. A model that accurately predicts probabilities for patients in the development data may unfortunately not do so for new patients, even when the patients are derived from plausibly related populations, for example, patients treated more recently or patients from another center [1]. Therefore, the performance of prediction models needs to be tested in new patients (external validation) [2], [3]. A straightforward approach to study external validity is to split the development data into two parts: one part containing early treated patients to develop the model and another part containing the most recently treated patients to assess the performance. With this approach, the temporal aspect of external validity may be studied [1], [4], [5]. Similarly, the place aspect can be studied by splitting the data according to treatment centers [6], [7], [8].
Validation studies may typically show a systematic deviation of the predicted probabilities or too extreme predicted probabilities [9], [10]. A systematic deviation of the probabilities (overall too high or too low), suggests that an important predictor variable was not included in the model [11], [12]. If the probabilities were too extreme (i.e., high predictions too high and low predictions too low), the regression coefficients of the prediction model were on average too large [13], [14], [15]. Individual regression coefficients can also be incorrect, due to differences in predictor definitions (bias) or imprecise estimates of the coefficients (imprecision). Further, a different distribution of predictor values (“case-mix”) can influence some aspects of model performance.
Common measures to assess model performance include (1) calibration measures, which study the agreement between observed outcome frequencies and predicted probabilities, (2) discrimination measures, which study the ability of the model to distinguish between patients with different outcomes, and (3) overall performance measures, which incorporate both aspects of calibration and discrimination [16], [17]. Each measure has its own properties. A calibration measure will likely have more power to detect systematically deviating predictions than a change in case-mix, which is expected to affect mainly the discriminative ability.
Little is known about adequate sample sizes to study model performance in other populations [9], [12], [18]. Particularly, the use of too small samples may lead to statistically nonsignificant results, while true differences do exist. For binary outcomes, the power is determined by the number of events (or nonevents, if less frequent than events), that is, the effective sample size. For instance, a sample with 821 patients may seem adequate, but an outcome frequency of 1.1% implies that the sample contains only nine events [19]. Such a data set provides little power to test differences in model performance. Here, we study at which number of events relevant differences in model performance can be detected with measures for calibration and discrimination. We used a model that predicts the histology of retroperitoneal lymph nodes in patients treated with chemotherapy for metastatic testicular germ cell cancer [6], [20]. The model was validated in samples that differed in some way from the development data. Evaluations of power were performed with standard formulas for power calculations and with Monte Carlo simulations. We will show that relatively many events are required to obtain a reasonable power in external validation studies.
Section snippets
Prediction model for metastatic testicular germ cell cancer
Resected retroperitoneal lymph nodes of patients treated with chemotherapy for metastatic testicular germ cell cancer contain purely benign tissue in about 45% of the operated patients. Those patients are unnecessarily operated. A logistic regression model was constructed to predict the probability of benign tissue [6]. The model contained six predictor variables: three dichotomous variables (normal prechemotherapy levels of the serum tumor markers alpha-fetoprotein [AFP] and human chorionic
Model performance in validation samples
Fig. 1 shows the calibration curves and performance measures of the prediction model for metastatic testicular germ cell cancer in the simulated validation samples, at a very large sample size (n = 100,000). A sample from the same underlying population as the development data (Scenario 0) showed perfect calibration (slope = 1.0, intercept = 0.0), with good discrimination (c-statistic = 0.83) and good overall performance (R2 = 0.41, and Brier score = 0.17). Systematically too high predictions in the
Discussion
We have shown that substantial sample sizes and numbers of events are required to detect relevant decreases in model performance in external validation samples. We studied several measures for calibration and discrimination. A model showing systematically too high predictions in the validation sample could be best detected with a test for the intercept of the calibration line. This measure showed far more power than the Hosmer-Lemeshow goodness-of-fit statistic. The U-statistic tests the values
Acknowledgments
This manuscript was substantially improved by the thoughtful comments of a reviewer. Yvonne Vergouwe was supported by The Netherlands Organization for Scientific Research. Ewout Steyerberg was supported by a fellowship from the Royal Netherlands Academy of Arts and Sciences.
References (39)
- et al.
External validity of predictive models: a comparison of logistic regression, classification trees, and neural networks
J Clin Epidemiol
(2003) - et al.
Validation of a prediction model and its predictors for the histology of residual masses in nonseminomatous testicular cancer
J Urol
(2001) - et al.
Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates
J Clin Epidemiol
(1995) - et al.
Prospective testing of two models based on clinical and oximetric variables for prediction of obstructive sleep apnea
Chest
(2002) - et al.
A prognostic index for 30-day mortality after stroke
J Clin Epidemiol
(2001) - et al.
Evaluation of the Leeds prognostic score for severe head injury
Lancet
(1991) - et al.
Validation of a coronary prognostic index for the Chinese—a tale of three cities
Int J Cardiol
(1989) - et al.
Assessing the generalizability of prognostic information
Ann Intern Med
(1999) - et al.
Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors
Stat Med
(1996) Construction and assessment of classification rules
(1997)
Data splitting
Am Stat
Validation of probabilistic predictions
Med Decis Making
Prediction of residual retroperitoneal mass histology following chemotherapy for metastatic nonseminomatous germ cell tumor: multivariate analysis of individual patient data from six study groups
J Clin Oncol
Construction, validation and updating of a prognostic model for kidney graft survival
Stat Med
A clinical prediction rule for renal artery stenosis
Ann Intern Med
What do we mean by validating a prognostic model?
Stat Med
Validation and updating of predictive logistic regression models: a study on sample size and shrinkage
Stat Med
Two further applications of a model for binary regression
Biometrika
Regression, prediction and shrinkage
J R Stat Soc B
Cited by (479)
Fully independent validation of eleven prognostic scores predicting progression to critically ill condition in hospitalized patients with COVID-19
2024, Brazilian Journal of Infectious DiseasesRadiomics in Carotid Plaque: A Systematic Review and Radiomics Quality Score Assessment
2023, Ultrasound in Medicine and BiologyExternal validation of a nomogram predicting conditional survival after tri-modality treatment of esophageal cancer
2023, Surgery (United States)External Validation of the UTICalc with and Without Race for Pediatric Urinary Tract Infection
2023, Journal of Pediatrics