Objective To compare the accuracy of data from hospital administration databases and a national clinical cardiac surgery database and to compare the performance of the Dutch hospital standardised mortality ratio (HSMR) method and the logistic European System for Cardiac Operative Risk Evaluation, for the purpose of benchmarking of mortality across hospitals.
Methods Information on all patients undergoing cardiac surgery between 1 January 2007 and 31 December 2010 in 10 centres was extracted from The Netherlands Association for Cardio-Thoracic Surgery database and the Hospital Discharge Registry. The number of cardiac surgery interventions was compared between both databases. The European System for Cardiac Operative Risk Evaluation and hospital standardised mortality ratio models were updated in the study population and compared using the C-statistic, calibration plots and the Brier-score.
Results The number of cardiac surgery interventions performed could not be assessed using the administrative database as the intervention code was incorrect in 1.4–26.3%, depending on the type of intervention. In 7.3% no intervention code was registered. The updated administrative model was inferior to the updated clinical model with respect to discrimination (c-statistic of 0.77 vs 0.85, p<0.001) and calibration (Brier Score of 2.8% vs 2.6%, p<0.001, maximum score 3.0%). Two average performing hospitals according to the clinical model became outliers when benchmarking was performed using the administrative model.
Conclusions In cardiac surgery, administrative data are less suitable than clinical data for the purpose of benchmarking. The use of either administrative or clinical risk-adjustment models can affect the outlier status of hospitals. Risk-adjustment models including procedure-specific clinical risk factors are recommended.
Statistics from Altmetric.com
A valid comparison of outcomes between hospitals or healthcare providers (benchmarking) requires adjustment for severity of the health condition of patients and the performed interventions, often referred to as case-mix differences.1–3 For this purpose prediction models have been developed to estimate risk-adjusted outcomes across hospitals. Most of these models are based on routinely collected administrative hospital data. For example, the hospital standardised mortality ratio (HSMR), first developed by Jarman in 1999 for the UK, is a risk-adjusted mortality rate calculated using prediction models based on administrative data.4 Because administrative data are collected for other purposes, they are easily available, and thus the use of these data for benchmarking is cheap and requires relatively little extra effort.
However, administrative databases are often criticised for being inaccurate, incomplete and containing limited information.5–9 As a consequence, comparisons of risk-adjusted outcome rates between healthcare providers that are based on administrative database data might be unreliable, leading to unjustified criticism. For that reason clinical databases with corresponding clinical prediction models have been developed (eg, European System for Cardiac Operative Risk Evaluation (EuroSCORE) and Society of Thoracic Surgeons risk models in cardiac surgery) that include multiple clinical predictors for mortality.10–12 The EuroSCORE is a prediction model that was specifically designed to predict the risk of operative mortality related to cardiac surgery using 18 demographic and risk factors. The EuroSCORE can thus be used to adjust for differences in case mix in the comparison between healthcare providers. Models based on clinical risk factors are claimed to have a better predictive performance, resulting in improved risk adjustment, and enable valid comparison of outcomes across centres.5–7 ,13 The downside is that clinical databases are more expensive; they comprise information that is obtained by active data collection by dedicated individuals and thus require continuous maintenance. Previous studies have not come to a conclusive answer to the question if clinical risk factors are necessary for adequate risk adjustment. Some concluded that administrative data are sufficient to enable benchmarking, whereas others show a clear inferiority and insufficiency when compared with clinical data.6–8 13–19
The aim of our study was to analyse whether a risk adjustment model based on administrative data allows for adequate benchmarking in cardiac surgery. Using a nationwide cohort of cardiac surgery patients, we assessed the accuracy of an administrative database and the predictive performance of administrative models in comparison with a clinical database and the clinical EuroSCORE model.20
EuroSCORE and administrative variables of a national cohort of cardiac surgery patients in The Netherlands have been collected in two separate databases: (1) The adult national cardiac surgery database of the Netherlands Association of Thoracic Surgery (NVT) and (2) The National Hospital Discharge Registry (HDR) of The Netherlands.20–22
The adult national cardiac surgery database of the Netherlands Association of Thoracic Surgery
This clinical database has a national coverage with participation of all 16 centres performing cardiac surgery in The Netherlands.20 All patients undergoing cardiac surgery excluding trans catheter aortic valve implantation, circulatory assist devices and pacemakers, are included in the database. Ten out of 16 cardiac centres participated in our study, in which 34 229 consecutive procedures were performed between 1 January 2007 and 31 December 2010. Procedures with incomplete data were excluded (N=218, 0.6%), resulting in 34 011 procedures for further analyses. The dataset consisted of predictors for mortality as listed in table 1, defined according to the EuroSCORE.10 The EuroSCORE was developed to estimate the operative risk of mortality related to cardiac surgery (within 30 days and/or during the same hospital admission).11 In this study, the EuroSCORE was used to estimate the risk of inhospital mortality.
The Hospital Discharge Registry
The HDR contains administrative data of all 10 hospitals included in this study. The dataset consists of patient characteristics and admission details such as age, comorbidity, sex and urgency of admission. For interventions the International Classifications of Health Interventions coding system is used and for diagnoses the International Classification of Disease-9.23 The Dutch HSMR method is based on the HDR database and uses 50 risk-adjustment models, each for one specific group of diagnoses. The models estimate the risk of mortality for patients with a diagnosis belonging to the specific diagnose group.21
Linkage of datasets
In order to compare the HDR and the NVT databases and the models based on them, information on cardiac surgery interventions was required from both databases. Therefore, the HDR and NVT databases were linked to identify similar records. The HDR and NVT databases contain anonymised data, meaning no directly identifying information is stored. Records from both databases were linked to the municipal registries based on date of birth, gender and zip code, and were subsequently linked to each other. The linkage was performed by Statistics Netherlands and is described in previous publications.20 ,24 ,25 The linkage of datasets is illustrated in the flow chart shown in figure 1. In total 26 178 (77%) records from the NVT database could be linked to a record in the HDR database and were used for further analyses. The predicted mortality according to the logistic EuroSCORE did not differ between the linked and the non-linked population (median 3.7%). Reasons for failed linkage were: the HDR record could not be linked to the municipal registries or no HDR record existed for the specific intervention (18.7%), the NVT record could not be linked to the municipal registries (2.7%) or no administrative model was available for the record (1.6%). The linkage of the HDR database to the municipal registries caused most linkage failure, as only four digits (out of six) of the zip code were available in the HDR database.
Comparison of data between the NVT and HDR databases: intervention and inhospital mortality
The type of intervention and the outcome inhospital mortality were compared between the registries. Considering the fact that the NVT and the HDR registries use other risk factors for risk adjustment, these were not compared. The NVT database was used as the reference for the type of intervention, because this information is collected by the surgeons themselves. The HDR database was used as the reference for inhospital mortality, as the date of mortality is extracted directly from the up-to-date municipality registers. The comparison of inhospital mortality between both databases was performed on patient level (as opposed to intervention level), to avoid persons being counted multiple times for mortality.
Comparison of risk-adjustment models
The administrative and clinical model
The Dutch HSMR method (models based on administrative data) and the logistic EuroSCORE (model based on clinical data) were applied in their original form to our study population, to predict the risk of inhospital mortality in our study population.10 ,21 These models will subsequently be called Administrative.1 and Clinical.1. Existing risk-adjustment models can be updated to a new study population. Updated models are adjusted to the characteristics of that population and are likely to show improved generalisability.26 There are several methods to update a risk-adjustment model.26 As cardiac surgery interventions are incorporated in multiple Dutch HSMR models (ie, several diagnosis groups), one model for cardiac surgery was constructed using stepwise backward selection based on Akaike's Information Criterion.27 This means that the intercept and the coefficients of all included covariates were estimated again in our study population and only relevant risk factors were included in the updated model. To update the EuroSCORE model, the intercept and the coefficients of all included covariates were also estimated again in our study population. This resulted in the updated models Administrative.2 and Clinical.2. The models can be updated even more thoroughly by inclusion of interaction terms, in order to maximise risk adjustment in our study population. Thus, first-order interaction terms between all covariates were added to the updated models, resulting in the models Administrative.3 and Clinical.3.27
Comparison of model performance
The predictive performance of a risk-adjustment model is quantified by means of calibration and discrimination. Discrimination refers to the ability of a model to differentiate between subjects with and without the outcome and depends on the variables included in the model. The discrimination of the models was quantified using the area under the ROC -curve, which is equivalent to the c-statistic. The 95% CI of the c-statistic and the difference between two c-statistics was tested using DeLong's test.28
The calibration of a risk model refers to the ability of a model to predict how many patients will have the outcome. The calibration was assessed using calibration plots and the Brier Score. The Brier Score measures model accuracy on patient level by squaring and summing the difference between the predicted and the observed outcome per patient. The method by Redelmeier was used to estimate the 95% CI of the Brier Score and test the difference between two Brier scores.29
In this study, benchmarking is performed by calculating the standardised mortality ratio (SMR) for all hospitals. The SMR is calculated by dividing the observed mortality with the expected mortality within a hospital. SMRs of the administrative and clinical models were compared. Centres with a SMR for which the 95% CI did not cover the value 1 were considered to be outliers. The 95% CI of the SMRs were estimated using the method described by Breslow and Day.30
All analyses were performed in R V.2.15.31
Risk factor coding
The risk factors in the linked subset from the administrative and clinical databases are presented in table 1. Mean age was 66.6 years (±10.7) and 29.5% of patients were female. A comparison of the prevalence of risk factors could not be made, as the definitions differed between the administrative and the clinical database.
Number of cardiac interventions performed (by type of intervention)
In total 14 300 (54.6%) isolated CABG procedures were performed according to the NVT database. Other frequently performed interventions were: aortic valve replacement with or without concomitant CABG (12.1% and 8.3%, respectively) and mitral valve repair with or without concomitant CABG (3.1% and 2.7%, respectively). The proportion of isolated CABG, isolated aortic valve replacement, isolated mitral valve repair and isolated mitral valve replacement, which was coded with the correct main intervention code in the HDR ranged from 64.6% to 92.2% (table 2). The intervention code in the HDR was missing in 1923 (7.3%) procedures. As a result, the number of cardiac surgery interventions could not be accurately assessed using HDR data.
Inhospital mortality in the HDR database is derived from the municipal registries which are highly accurate. In the NVT database 42 of 762 (5.5%) patients who died during hospital stay were not coded as such and the other way around, 36 of 25 005 (0.1%) survivors were incorrectly coded as inhospital mortality during the same hospital admission.
Calibration of the administrative models and the clinical models
Calibration of the risk models is shown in figure 2. The original models (Administrative.1 and Clinical.1) were poorly calibrated. Administrative.1 underestimated the risk of mortality, whereas Clinical.1 overestimated the risk of mortality. Updating improved calibration of both models, as the difference between observed and predicted mortality became smaller. However, in all model pairs the Brier Score for the administrative models remained significantly higher in comparison with the clinical models, indicating inferior calibration of the administrative model (table 3). The maximum Brier score in this data was 3.0%. Rescaling of the Brier Score on a scale from 0% to 100% would result in a score of 93.8% for Administrative.3 and 87.8% for Clinical.3.
Discrimination of the administrative models and the clinical models
Discrimination of the models is shown in figure 3. The c-statistics of the administrative models (0.756–0.788) are substantially lower than that of the clinical models (0.838–0.846), indicating inferior discrimination of the administrative models (p<0.001 for all three model pairs). Updating of the administrative model did not improve the discrimination (figure 3).
The effect on benchmarking
The effect of the use of administrative versus clinical models on benchmarking is shown in figure 4. The majority of SMRs calculated using the original administrative model was higher than 1, which indicates that the model underestimated the risk of mortality. For the original clinical model the opposite was found: the model overestimated the risk of mortality.
Updating of models resulted in better predictions on hospital level (SMRs closer to 1). However, a considerable difference was found between the updated administrative versus the updated clinical models, for example in hospital B and hospital C (figure 4).The mean difference in SMR for Admistrative.1 versus Clinical.1 was 1.13 (range 0.23–2.08), 0.12 (range 0.004–0.37) for Administrative.2 versus clinical.2, and 0.11 (range 0.001–0.43) for Administrative.3 versus Clinical.3.
The SMRs calculated using the clinical and administrative models yielded different outliers. Hospital C and hospital J changed outlier status when either the updated model Administrative.3 or Clinical.3 was used. The analyses using only isolated CABG surgery yielded comparable results as those based on all cardiac surgery data (figure 4).
This study compared (1) data accuracy in the administrative HDR database to that in the clinical cardiac surgery database of the Netherlands Association of Cardio-Thoracic Surgery (NVT) and (2) the predictive performance of administrative models to that of the clinical EuroSCORE model.
The reported intervention code in the administrative database was incorrect in up to 26%, depending on the type of surgery. As a result, the number of cardiac surgery interventions could not be accurately assessed.
After updating of the models to our data, the calibration of the administrative model was inferior to that of the clinical model. The importance of this shortcoming is marked by the identification of other outliers when used for benchmarking of hospitals.
Why models based on administrative data have inferior calibration and discrimination
When developing a risk prediction model, the first logical step is to consider which variables could be predictors for the outcome. However, administrative models are limited to the routinely collected variables, which might not necessarily be the strongest predictors. In our study, several strong predictors for mortality (shown in table 1) were not available in the administrative database. The other way around, administrative risk factors that were strongly associated with mortality had a low prevalence in our study population. This is likely to have affected the calibration and discrimination of the administrative models. Previous studies reported that much of the predictive performance of risk models is derived from a relatively small number of clinical variables and the predictive performance of administrative models could be improved with the addition of a limited number of clinical variables.7 ,13 ,19 ,32 ,33
Why administrative data are inferior to clinical data for benchmarking purposes
The requirements of a risk-adjustment model depend on its goal. For benchmarking an adequate calibration is required: the model should adequately predict the expected mortality rate in a hospital. It can be seen as a scale that should weigh correctly. The performance of a scale mainly depends on its ability to weigh a (kilo) gram. If this feature is adequate, but the weighing is off par, the scale can be reset to zero to adjust it to any new situation. Similarly, the performance of a model depends on the strength of the predictors in the model (ie, discrimination), as the model can be recalibrated to update it in time or to make it suitable for a new population. It follows from the aforementioned that the inferior discrimination of administrative models (in comparison with clinical models) will result in inferior calibration. It is shown in this study that this could very well affect the outlier status of a hospital.
Other issues in the use of administrative data
There are other reasons why the HDR database with routinely collected data turned out to be unsuitable for analyses of outcomes in cardiac surgery. First, for a considerable number of records in our study population the intervention code was incorrect, unspecified (eg, “cardiac surgery”) or missing. Consequently, the number of cardiac surgery interventions performed could not be reliably assessed. Previous studies have also reported discrepant counts of operations in administrative data versus clinical data.8 ,17 ,34
Inaccurate coding could be attributed to the fact that data were collected by persons who were not actively involved in the clinical care and thus were dissociated from clinical information that could be necessary for correct reporting of data.35 In addition, occasionally not all interventions and diagnoses are recorded. Also, admission and discharge dates are collected, instead of dates of intervention. This has been reported before as an important reason for variance in cardiac surgery volumes between administrative and clinical databases.17
Furthermore, the HSMR method uses administrative models for specific diagnose codes. However, in cardiac surgery analyses of outcomes is performed by intervention type, as risk is considered to be mainly related to the performed intervention.
Implications for practice
The use of administrative data has many advantages over the use of clinical data. The data are routinely collected and stored, making them cheaper and readily available. However, the apparent benefits should be carefully weighed against the limitations and drawbacks of administrative models, when compared with clinical models.
Public benchmarking in general can be dangerous in the sense that the general public cannot be expected to understand the limitations and the prerequisites under which the results should be interpreted. The limitations are more pronounced for administrative data. This is particularly important because benchmarking could have far-reaching consequences when known to healthcare consumers, the media, health insurance companies or governmental bodies. In this context, development of models with a high predictive performance, which might include clinical risk factors, should be strived for at all times.36 If clinical data are already collected, their availability for benchmarking should be encouraged.
On the other hand, clinical data appeared to have an evident weakness as well. The outcome inhospital mortality was misclassified in nearly 6% of the records in the clinical database used in this study. For outcomes such as vital state and readmissions, administrative databases were highly accurate, as information was derived from municipal registries. Administrative data sources could be used to verify outcomes data, thus complementing clinical databases. In this way, the strengths of both types of data are combined in order to optimise benchmarking in healthcare.37
The findings in this study are likely to hold true for populations other than cardiac surgery patients and in other countries in the world. Most probably, other specific surgical interventions, such as for example oesophageal or hepatobiliary surgery, also require adjustment for risk factors not commonly included in administrative databases. Consequently, benchmarking in those populations will result in similar issues as encountered in this study.
These analyses were based on data from 10 out of 16 cardiac surgery centres in The Netherlands. In general, the population of the six hospitals not participating in this study did not differ from the study population with regards to age, sex and the median logistic EuroSCORE. However, it is unknown if the results with regards to data accuracy are generalisable to all centres.
Second, the sensitivity of the linkage between the clinical and the administrative database was 77%. Although we did not find a difference in the overall risk profile between the linked and non-linked records, we do acknowledge that a substantial part of the total population was excluded from the analyses. We have no reason to believe that administrative models would perform any differently in the non-linked records or that data accuracy was better in the non-linked records. The conclusions of our study are thus unlikely to be affected by this limitation.
The goal of this study was to assess the accuracy of administrative data and the predictive performance of the accompanying models. As such, it was not our intention to design a new model for risk prediction in cardiac surgery. Thus, we chose to stay in line with the methods used to construct the original models and refrain from further sophisticated methods such as hierarchical modelling and shrinkage of coefficients.
The outcome in this study is inhospital mortality. Several publications have previously shown why mortality at fixed time intervals is a more appropriate measure in outcomes evaluation. We acknowledge the limitations of this outcome and we are aware that mortality is one of the several indicators that can be used to measure quality, but certainly not the only one. For the purpose of our study, we have no reason to believe this has affected our results, as the clinical and the administrative models were fitted on this outcome.
Although there are advantages to the use of administrative models for benchmarking in cardiac surgery, their calibration and discrimination (and thus performance in benchmarking) is inferior to that of clinical models. The use of either an administrative or a clinical model may affect the outlier status of hospitals. Therefore, in specific populations such as cardiac surgery, the use of prediction models including clinical risk factors is recommended.
What is already known about this subject
Administrative data are inexpensive to collect and easily accessible, but the accuracy of coding remains questionable and it is known to contain limited information on patient condition and severity of disease.
Clinical data do not have most of these problems and have a good capability of predicting outcomes after cardiac surgery.
The hospital standardised mortality ratio (HSMR) is an increasingly used method for healthcare benchmarking using administrative data.
What does this study add
The number of cardiac surgery interventions performed could not be accurately assessed in routinely collected administrative data.
For cardiac surgery, administrative models (developed according to the Dutch HSMR method) have inferior predictive performance when compared with a clinical model (European System for Cardiac Operative Risk Evaluation).
The use of either administrative or clinical risk-adjustment models can affect the outlier status of hospitals when benchmarking is performed.
Risk-adjustment models including procedure-specific clinical risk factors are recommended.
How might this impact on clinical practice
The findings in this study might stimulate healthcare providers and policy makers to use clinical data for the purpose of provider profiling.
Administrative data should be used for outcomes such as mortality and readmissions, in addition to the clinical risk factors.
The conclusions of this study help to clarify the limitations of the HSMR method in specific patient populations, such as cardiac surgery.
SS and MEP contributed equally to this study.
Contributors Contributors: SS and MEP wrote the statistical analysis plan, cleaned and analysed the data, and drafted and revised the paper. RHHG, wrote the statistical analysis plan, supervised the statistical analyses and made thorough critical revisions of the paper. MLB, YvdG and CJK supervised the statistical analyses and made thorough critical revisions of the paper. MIMV and LAvH monitored data collection in the Netherlands Association for Cardio-Thoracic Surgery database, provided the data and made thorough critical revisions of the paper.
Funding The Department of Cardio-Thoracic Surgery UMC Utrecht has received financial support from the Netherlands Association of Cardio-Thoracic Surgery to cover part of the first author's salary.
Competing interests All authors have completed the Unified Competing Interests form at http://www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous 3 years; no other relationships or activities that could appear to have influenced the submitted work.
Ethics approval The Hospital Discharge Registry is national statistical data, made available by National Statistics Netherlands. The data from The Netherlands Association for Cardio-Thoracic Surgery database was collected in the 10 participating centres and sent to National Statistics. National Statistics performed the linkage to the Hospital Discharge Registry as a Trusted Third Party. The data provided for this study was fully anonymised; it was not possible to identify the individuals from the information provided. Considering the aforementioned, approval from the ethics committee was not required and not obtained.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.