Original Article
Substantial effective sample sizes were required for external validation studies of predictive logistic regression models

https://doi.org/10.1016/j.jclinepi.2004.06.017Get rights and content

Abstract

Background and Objectives

The performance of a prediction model is usually worse in external validation data compared to the development data. We aimed to determine at which effective sample sizes (i.e., number of events) relevant differences in model performance can be detected with adequate power.

Methods

We used a logistic regression model to predict the probability that residual masses of patients treated for metastatic testicular cancer contained only benign tissue. We performed standard power calculations and Monte Carlo simulations to estimate the numbers of events that are required to detect several types of model invalidity with 80% power at the 5% significance level.

Results

A validation sample with 111 events was required to detect that a model predicted too high probabilities, when predictions were on average 1.5 times too high on the odds scale. A decrease in discriminative ability of the model, indicated by a decrease in the c-statistic from 0.83 to 0.73, required 81 to 106 events, depending on the specific scenario.

Conclusion

We suggest a minimum of 100 events and 100 nonevents for external validation samples. Specific hypotheses may, however, require substantially higher effective sample sizes to obtain adequate power.

Introduction

Predictive logistic regression models are important tools to provide estimates of patient outcome probabilities. A model that accurately predicts probabilities for patients in the development data may unfortunately not do so for new patients, even when the patients are derived from plausibly related populations, for example, patients treated more recently or patients from another center [1]. Therefore, the performance of prediction models needs to be tested in new patients (external validation) [2], [3]. A straightforward approach to study external validity is to split the development data into two parts: one part containing early treated patients to develop the model and another part containing the most recently treated patients to assess the performance. With this approach, the temporal aspect of external validity may be studied [1], [4], [5]. Similarly, the place aspect can be studied by splitting the data according to treatment centers [6], [7], [8].

Validation studies may typically show a systematic deviation of the predicted probabilities or too extreme predicted probabilities [9], [10]. A systematic deviation of the probabilities (overall too high or too low), suggests that an important predictor variable was not included in the model [11], [12]. If the probabilities were too extreme (i.e., high predictions too high and low predictions too low), the regression coefficients of the prediction model were on average too large [13], [14], [15]. Individual regression coefficients can also be incorrect, due to differences in predictor definitions (bias) or imprecise estimates of the coefficients (imprecision). Further, a different distribution of predictor values (“case-mix”) can influence some aspects of model performance.

Common measures to assess model performance include (1) calibration measures, which study the agreement between observed outcome frequencies and predicted probabilities, (2) discrimination measures, which study the ability of the model to distinguish between patients with different outcomes, and (3) overall performance measures, which incorporate both aspects of calibration and discrimination [16], [17]. Each measure has its own properties. A calibration measure will likely have more power to detect systematically deviating predictions than a change in case-mix, which is expected to affect mainly the discriminative ability.

Little is known about adequate sample sizes to study model performance in other populations [9], [12], [18]. Particularly, the use of too small samples may lead to statistically nonsignificant results, while true differences do exist. For binary outcomes, the power is determined by the number of events (or nonevents, if less frequent than events), that is, the effective sample size. For instance, a sample with 821 patients may seem adequate, but an outcome frequency of 1.1% implies that the sample contains only nine events [19]. Such a data set provides little power to test differences in model performance. Here, we study at which number of events relevant differences in model performance can be detected with measures for calibration and discrimination. We used a model that predicts the histology of retroperitoneal lymph nodes in patients treated with chemotherapy for metastatic testicular germ cell cancer [6], [20]. The model was validated in samples that differed in some way from the development data. Evaluations of power were performed with standard formulas for power calculations and with Monte Carlo simulations. We will show that relatively many events are required to obtain a reasonable power in external validation studies.

Section snippets

Prediction model for metastatic testicular germ cell cancer

Resected retroperitoneal lymph nodes of patients treated with chemotherapy for metastatic testicular germ cell cancer contain purely benign tissue in about 45% of the operated patients. Those patients are unnecessarily operated. A logistic regression model was constructed to predict the probability of benign tissue [6]. The model contained six predictor variables: three dichotomous variables (normal prechemotherapy levels of the serum tumor markers alpha-fetoprotein [AFP] and human chorionic

Model performance in validation samples

Fig. 1 shows the calibration curves and performance measures of the prediction model for metastatic testicular germ cell cancer in the simulated validation samples, at a very large sample size (n = 100,000). A sample from the same underlying population as the development data (Scenario 0) showed perfect calibration (slope = 1.0, intercept = 0.0), with good discrimination (c-statistic = 0.83) and good overall performance (R2 = 0.41, and Brier score = 0.17). Systematically too high predictions in the

Discussion

We have shown that substantial sample sizes and numbers of events are required to detect relevant decreases in model performance in external validation samples. We studied several measures for calibration and discrimination. A model showing systematically too high predictions in the validation sample could be best detected with a test for the intercept of the calibration line. This measure showed far more power than the Hosmer-Lemeshow goodness-of-fit statistic. The U-statistic tests the values

Acknowledgments

This manuscript was substantially improved by the thoughtful comments of a reviewer. Yvonne Vergouwe was supported by The Netherlands Organization for Scientific Research. Ewout Steyerberg was supported by a fellowship from the Royal Netherlands Academy of Arts and Sciences.

References (39)

  • R.R. Picard et al.

    Data splitting

    Am Stat

    (1990)
  • M.E. Miller et al.

    Validation of probabilistic predictions

    Med Decis Making

    (1993)
  • E.W. Steyerberg et al.

    Prediction of residual retroperitoneal mass histology following chemotherapy for metastatic nonseminomatous germ cell tumor: multivariate analysis of individual patient data from six study groups

    J Clin Oncol

    (1995)
  • H.C. van Houwelingen et al.

    Construction, validation and updating of a prognostic model for kidney graft survival

    Stat Med

    (1995)
  • P. Krijnen et al.

    A clinical prediction rule for renal artery stenosis

    Ann Intern Med

    (1998)
  • D.G. Altman et al.

    What do we mean by validating a prognostic model?

    Stat Med

    (2000)
  • E.W. Steyerberg et al.

    Validation and updating of predictive logistic regression models: a study on sample size and shrinkage

    Stat Med

    (2004)
  • D.R. Cox

    Two further applications of a model for binary regression

    Biometrika

    (1958)
  • J.B. Copas

    Regression, prediction and shrinkage

    J R Stat Soc B

    (1983)
  • Cited by (479)

    View all citing articles on Scopus
    View full text