Objectives: To compare implications of using the logistic EuroSCORE and a locally derived model when analysing individual surgeon mortality outcomes.
Design: Retrospective analysis of prospectively collected data.
Setting: All NHS hospitals undertaking adult cardiac surgery in northwest England.
Patients: 14 637 consecutive patients, April 2002 to March 2005.
Main outcome measures: We have compared the predictive ability of the logistic EuroSCORE (uncalibrated), the logistic EuroSCORE calibrated for contemporary performance and a locally derived logistic regression model. We have used each to create risk-adjusted individual surgeon mortality funnel plots to demonstrate high mortality outcomes.
Results: There were 458 (3.1%) deaths. The expected mortality and receiver operating characteristic (ROC) curve values were: uncalibrated EuroSCORE −5.8% and 0.80, calibrated EuroSCORE −3.1% and 0.80, locally derived model −3.1% and 0.82. The uncalibrated EuroSCORE plot showed one surgeon to have mortality above the northwest average, and no surgeon above the 95% control limit (CL). The calibrated EuroSCORE plot and the local model showed little change in surgeon ranking, but significant differences in identifying high mortality outcomes. Two of three surgeons above the 95% CL using the calibrated EuroSCORE revert to acceptable outcomes when the local model is applied but the finding is critically dependent on the calibration coefficient.
Conclusions: The uncalibrated EuroSCORE significantly overpredicted mortality and is not recommended. Instead, the EuroSCORE should be calibrated for contemporary performance. The differences demonstrated in defining high mortality outcomes when using a model built for purpose suggests that the choice of risk model is important when analysing surgeon mortality outcomes.
Statistics from Altmetric.com
Results of cardiac surgical operations are now subject to close scrutiny in the United Kingdom. Mortality rates of individual surgeons have been published by the media,1 through peer review publication,2 on a Healthcare Commission website and through hospital internet sites.3 Supporters of this initiative believe that publication will help to improve quality, enable patient choice and provide reassurance that no hospital or surgeon has excessive mortality.4
Various surgical factors are important determinants of operative mortality including the type of operation and various patient risk factors.5 Unless these factors are taken into consideration it is possible to obtain false conclusions from the comparison of mortality rates, and any initiative that examines surgical outcomes without adjusting for predicted risk may lead the surgical community to turn down high-risk patients who might otherwise benefit from surgery.6–10 It has been clearly shown that the proportion of high-risk cases performed by individual surgeons can differ markedly, meaning that potential pitfalls of using non-risk-adjusted data are real.11
The most commonly used tool to adjust for operative risk is the EuroSCORE, which has two variants; a simple additive EuroSCORE and a more complex logistic version.12 13 The additive EuroSCORE has been shown to overpredict operative risk and shows poor predictive ability in higher-risk patients.11 14 15 The logistic EuroSCORE has also been shown to overpredict observed mortality but has better predictive ability for high-risk patients.16 The logistic EuroSCORE has been used for risk adjustment in recent analyses on surgical outcomes by the Healthcare Commission in the UK.3 This type of analysis will become increasingly important when the recommendations of the Department of Health white paper “Trust, assurance and safety”, of using clinical outcomes data for professional recertification are implemented.17
It is not clear what influence the selection of different risk-adjustment models has on the analysis of individual surgical outcomes. To study this in more detail we have used a large regional patient database to derive the “best” statistical model possible based on local data. We have then compared surgeon-specific mortality analyses using crude mortality, the logistic EuroSCORE in its original form, the logistic EuroSCORE calibrated for contemporary performance and our locally derived purpose-built model.
The North West Quality Improvement Programme in Cardiac Interventions (NWQIP) is a regional consortium involving all four NHS hospitals performing adult cardiac surgery and percutaneous coronary interventions in the north west of England (Blackpool Victoria Hospital, Cardiothoracic Centre-Liverpool, Manchester Royal Infirmary and South Manchester University Hospital).18
Data were collected prospectively on 14 637 consecutive patients undergoing adult cardiac surgery between 1 April 2002 and 31 March 2005. Each patient had a dataset collected, which included preoperative and operative variables, to enable a predicted mortality to be calculated. Data were collected in each institution and returned to a central source for analysis. Validation of data was conducted in each centre. Mortality was defined as any in-hospital death.
Design of the study
The specific questions we addressed were:
Are there differences in the conclusions drawn when using the various models for analysing individual surgeon data?
Is there a difference in predictive ability between the models?
Are the significant patient risk factors found in the local purpose-built risk-prediction model different from the risk factors included in the logistic EuroSCORE?
Categorical data are shown as a percentage. Predicted mortality was calculated for each patient by using the logistic EuroSCORE formula.19 If a patient factor necessary to calculate the EuroSCORE was missing in the record, that factor was assumed to be absent (occurred in less than 2% of cases).
A multivariate logistic regression analysis was undertaken on the data, using the forward stepwise technique, to identify independent risk factors for in-hospital mortality.20 Candidate variables were entered into the model with a p value less than 0.1. The predicted risks of individual patients were rank ordered and divided into 10 groups. Within each group of estimated risk, the number of in-hospital deaths predicted was compared with the number of observed in-hospital deaths and the Hosmer-Lemeshow goodness-of-fit statistic was calculated to assess the calibration of the model.20 The area under the receiver operating characteristic (ROC) curve was calculated to assess discriminatory ability of both the local model and the EuroSCORE.21
Observed mortality was compared to expected mortality from the logistic EuroSCORE (uncalibrated), a regionally calibrated logistic EuroSCORE and the locally derived logistic model. The regionally calibrated logistic EuroSCORE was derived by dividing the observed percentage mortality by the percentage predicted by the uncalibrated logistic EuroSCORE and this was done in two ways: first, by using the figures published in our most recent analysis of data during the period 2002–4, which produces a calibrated logistic EuroSCORE set at 0.58 of the original.16 Second, we have simply divided the observed mortality by that predicted by logistic EuroSCORE in the current study—that is, 3.1% divided by 5.8% = 0.53.
To assess the difference in risk-factor weightings between models we calculated the odds ratio from the logistic EuroSCORE coefficients and compared these to the odds ratio values of the locally derived model.
To assess the effect of using different risk models in identifying high mortality and potential outliers we have used funnel plots with 95% and 99% control limits.22 Risk-adjusted mortality was calculated using the uncalibrated logistic EuroSCORE, the calibrated logistic EuroSCORE and the locally derived model. This method of risk adjustment was developed by Hannan et al,23 and is derived by calculating the observed and expected mortality ratios for each surgeon. These ratios are multiplied by the region-wide observed mortality. The result reflects performance expected had the case-mix of the surgeon been identical to that of the region.
To examine how different risk-adjustment methods affect rank ordering of surgeons we divided surgeons into quartiles based on observed mortality. We then ranked surgeons using the different risk-adjustment methods and examined the effect each method had on surgeon rank and quartile. For a closer examination of how risk adjustment using a local model would differ from using the logistic EuroSCORE, the surgeon rankings based on the local model and logistic EuroSCORE were correlated using Spearman’s rank correlation.
All analysis was performed using SAS for Windows Version 8.2.
In-hospital mortality rate and breakdown of operations
There were 458 in-hospital deaths in 14 637 patients giving a mortality of 3.1%. The operation type and in-hospital mortality rates are listed in table 1.
Independent risk factors for in-hospital mortality
The independent risk factors for mortality, along with odds ratios and confidence limits are shown in table 2. The area under the ROC curve for the locally derived multivariate prediction model was 0.82. The Hosmer-Lemeshow goodness-of-fit statistic across groups of risk was not statistically significant (p = 0.33), indicating no evidence for a lack of fit. The logistic regression equation for calculation of predicted risk of in-hospital mortality is shown at the bottom of table 2.
Performance of different risk models
The expected hospital mortality rates from the different risk models are shown in table 3. The area under the ROC curve for both the uncalibrated logistic EuroSCORE and the calibrated logistic EuroSCORE was 0.80.
Difference in risk factors between EuroSCORE and the NWQIP model
Table 4 highlights the differences in risk factors between the logistic EuroSCORE and the NWQIP model and assesses whether the weightings given are different. The EuroSCORE calculations were based on the original risk factor definitions; however, the definitions for a number of risk factors included in the NWQIP model are different. Renal dysfunction in the NWQIP model includes both patients with a serum creatinine >200 μmol/l and patients who are currently receiving dialysis. In the NWQIP model, the risk factor “critical preoperative state” is only present if the patient is in cardiogenic shock. In the EuroSCORE the risk factor “other than isolated CABG” includes any other major cardiac procedure other than, or in addition to, CABG. In the NWQIP model there are two risk factors involving surgery other than isolated CABG. The first risk factor in the model is CABG combined with a valve procedure and the second risk factor is having any other surgery not including CABG, valve repair/replacement or a combined CABG and valve procedure. This risk factor in the local model therefore includes surgery on the thoracic aorta and surgery for post-infarct septal rupture, which are included as risk factors in the EuroSCORE. Any risk factors noted as not in model (NIM) in table 4 were offered to the logistic regression model but were not identified as independent risk factors.
Figure 1 shows observed and risk-adjusted mortality funnel plots with 95% and 99% control limits. The observed in-hospital mortality funnel plot (fig 1A) showed one surgeon outside the upper 99% control limit, and two additional surgeons outside the upper 95% control limit. Two of these three surgeons return to satisfactory outcomes when the local model is used (inside the 95% control limits), and the other remains between the 99% and 95% control limits. The risk-adjusted mortality funnel plot using uncalibrated logistic EuroSCORE (fig 1B) shows all surgeons except one to have mortality below the regional average and no surgeon above the 95% control limit.
The risk-adjusted mortality funnel plot for the local model (fig 1C) and the calibrated logistic EuroSCORE 2002 to 2005 (fig 1D) both show no surgeon to be above the 99% control limit. The calibrated EuroSCORE 2002–5 plot shows three surgeons falling between the 95% and 99% control limit. When the local model is used two of these three surgeons return to satisfactory outcomes. The local model continues to identify a middle-volume surgeon at between 95% and 99% control limits and, in addition, also identifies a low-volume surgeon to be between 95% and 99% control limits. When the funnel plots of the calibrated EuroSCORE 2002 to 2005 (fig 1D) and the calibrated EuroSCORE 2002–4 (fig 1E) are compared it can be seen that two of the three outliers at 95% control limits return to “acceptable performance” because of the more lenient calibration coefficient used.
Rank quartile analysis of crude mortality data showed eight surgeons in the highest mortality quartile. Following risk adjustment two of these surgeons drop out of the highest mortality quartile and are moved to the second highest quartile with two surgeons moving in the other direction. The logistic EuroSCORE and the locally derived model defined the same eight surgeons in both the highest mortality and the lowest mortality quartiles; however there was some movement of surgeons within the quartiles. Figure 2 shows that the local model and the logistic EuroSCORE were highly correlated with respect to surgeon ranking.
Statement of principal findings
This study has shown that it is possible to derive a local risk-adjustment model built for purpose that has better predictive ability than the logistic EuroSCORE. There are a number of risk factors in the local model that are not included in the EuroSCORE and vice versa. In addition, a number of the significant risk factors have different weightings in the locally derived model than in the EuroSCORE. Analysis of surgical outcomes using funnel plot methodology has shown that there are outliers on crude mortality plots that revert to acceptable outcomes following appropriate risk adjustment. Use of an uncalibrated logistic EuroSCORE plot gives false reassurance about outcomes and should not be used. On the whole the conclusions reached when using either the calibrated logistic EuroSCORE or the purpose-built locally derived model are similar and the correlation between individual surgeon ranking is strong. However, which surgeons are defined as showing “outlying” performance is critically dependent on the exact time period used to calibrate the EuroSCORE, as there have been ongoing improvements in overall mortality with time.
Strengths and weaknesses of the study
Our study has been conducted on a large patient population undergoing surgery over three years in northwest England and includes all patients undergoing adult cardiac surgery in NHS hospitals equating to about one eighth of all cardiac surgical activity in the UK. We have shown previously that outcomes in the north west are similar to those seen nationally.16 The data have the confidence of clinicians and are locally validated at each institution but have not been externally validated, which is a weakness of the study. The size of the study sample is large and so we think our findings are robust.
Strength and weaknesses compared to other studies
Risk stratification is important when comparing outcomes in cardiac surgery. It has been shown that the predicted mortality differs significantly between surgeons owing to the number of high-risk cases taken on by each surgeon.11 If published outcomes are compared using only crude mortality this may lead to high-risk cases being turned down to protect results.6–10 This study shows clearly that non-adjusted analyses may be misleading: the one surgeon falling outside 99% control limits on the “crude” mortality plot falls within the confidence intervals following adjustment with either the calibrated (2002–5) or locally derived model.
The first risk prediction tool to be widely used in cardiac surgery was developed in the 1980s.24 It was used to compare risk-djusted outcomes but was subsequently shown to significantly overpredict observed mortality.5 25 The EuroSCORE was created to produce an objective model based on a wide patient population. It was developed using a European database of 19 030 consecutive patients undergoing cardiac surgery at 128 surgical centres in eight European countries throughout the 1990s.12 Data were collected on 68 preoperative and 29 objective operative risk factors and then analysed comparing the risk factors to patient mortality outcome using logistic regression.
The EuroSCORE was first used in 1999 as a simple additive model and has since been extensively studied.12 14 The studies have shown that the additive EuroSCORE overestimates mortality in low-risk patients and underestimates mortality in high-risk patients as well as underestimating mortality in combined CABG and valve procedures.11 14 15 To try to improve the performance of the additive EuroSCORE, the logistic EuroSCORE model was recommended.13 26 However recent studies have shown the logistic EuroSCORE to overpredict observed mortality in all operative groups and in both high-risk and low-risk patients.16 27
We are not aware of previous studies analysing the implication of using different models on defining outlying performance
Meaning of the study
This study defines in more detail some of the limitations of the logistic EuroSCORE. A number of risk factors such as unstable angina and recent myocardial infarction, which are included as significant factors in the EuroSCORE, are not shown to be significant risk factors in our model, which probably reflects different models of care with increased use of routine interventional strategies for acute coronary syndromes and better overall outcomes for these patients in the modern era. Moderate left ventricular dysfunction is also included in the EuroSCORE but not in our model and this again probably indicates improved care for this group of patients either with better myocardial protection or off-pump surgery. Age as a risk factor has a lower weighting in our model than the EuroSCORE; for example, an otherwise fit 83-year-old patient for isolated first-time coronary surgery would have a predicted mortality using the original logistic EuroSCORE of 4.2%, but using the locally derived model the predicted mortality is now 1.8%. This is probably the major reason why the EuroSCORE has “drifted” and now requires recalibration.
This study has shown that if you risk adjust using the calibrated EuroSCORE or the locally derived model then different surgeons may be identified as outliers at 95% control limits. Being defined as an outlier on a funnel plot does not mean that a surgeon is “performing badly”. Issues around multiple comparisons mean that you would expect at least one surgeon out of 31 to have outlying performance at 95% control limits by chance alone. However being identified as an outlier may have a significant impact on an individual surgeon and should lead to a deeper investigation of practice involving case-mix review and a more comprehensive analysis of the workplace and associated issues. In this case, using 95% control limits, two surgeons would avoid further analysis and one surgeon would require further analysis if a model built for purpose was used instead of the calibrated logistic EuroSCORE.
Previous published work from our group clearly showed that the logistic EuroSCORE overpredicts observed mortality on a dataset collected in 2002 to 2004 and suggested a calibration factor of 0.58 was necessary to adjust the logistic EuroSCORE for contemporary performance.16 However, when assessing the calibration factor for the current study period (2002–5), we have seen that the logistic EuroSCORE requires a calibration factor of 0.53, which shows that there are ongoing improvements in the quality of surgical outcomes in the north west of England.28 These differences are small, but lead to marked differences in the patients defining as exhibiting outlying performance at either 95% or 99% control limits, which have potentially important implications for public reporting of outcomes or use in professional revalidation and emphasise the need to ensure risk models are appropriately calibrated at all times to ensure accurate contemporary peer-group benchmarking.
The benefits of the locally derived model are that it is built for purpose. The advantage of the EuroSCORE is that it is widely used throughout the UK and beyond, and the tools necessary to derive its score are easily available.19 This study has shown that the EuroSCORE is a good risk adjustment model as long as it is accurately calibrated for contemporary practice but it also demonstrates that a model built for purpose would have advantages over the calibrated EuroSCORE.
If the facilities to create a local or national risk-adjustment model built for purpose were available then the application of such a model would provide a more accurate risk-adjustment model than the calibrated EuroSCORE and may lead to different conclusions about surgeons with high mortality outcomes.
Unanswered questions and future research
This study has shown clearly that crude mortality analysis or those involving the non-calibrated EuroSCORE may be misleading. All the analyses we have performed have been based on mortality, and this is only present in a small proportion of cases. It is likely that analysis of other outcomes such as stroke rate, re-exploration for bleeding, blood and blood product use, hospital stay, critical care stay and the incidence of hospital-acquired infection would be useful to enable surgeons to benchmark their outcomes against their peers and stimulate further improvements in quality, and it is probable that these analyses would also require similar risk-adjusted treatments to the ones we have shown.
This study has been conducted on behalf of the North West Quality Improvement Programme in Cardiac Interventions, and the consultant surgeons involved are listed as follows: John Au, Ben Bridgewater, Colin Campbell, John Carey, John Chalmers, Walid Dhimis, Abdul Deiraniya, Andrew Duncan, Brian Fabri, Elaine Griffiths, Geir Grotte, Ragheb Hasan, Tim Hooper, Mark Jones, Daniel Keenan, Neeraj Mediratta, Russell Millner, Nick Odom, Brian Prendergast, Mark Pullan, Abbas Rashid, Franco Sogliani, Paul Waterworth, Nizar Yonan. The following surgeons have left the collaboration during the study period: Narinda Bhatnagar, Albert Fagan, Bob Lawson, Udin Nkere. Peter O’Keefe, Richard Page, Ian Weir and David Sharpe.
Competing interests: BB is a Society of Cardiothoracic Surgeons of GB and Ireland representative on the joint Society of Cardiothoracic Surgeons, Healthcare commission, Department of Health group defining national cardiac surgical audit and is on the executive committee of the SCTS. BB, AG, GJG, BF and MJ are all members of the steering group of the Northwest Regional Quality Improvement Programme in Cardiac Interventions.
Funding: Funding for the North West Quality Improvement Programme in Cardiac Interventions collaboration has been received from all primary care trusts in the north west of England. All authors were independent of the funding.