Article Text

Fair comparison of mortality data following cardiac surgery
  1. C COLIN
  1. Unité de Biostatistique
  2. Département d'Information Médicale des Hospices Civils de Lyon
  3. 162 avenue Lacassagne
  4. 69424 Lyon Cedex 03, France
  5. rene.ecochard{at}
  6. Service de cardiologie
  7. Hôpital cardiologique des Hospices Civils de Lyon
  8. 59 boulevard Pinel
  9. 69500 Bron, France
  10. Unité de Méthodologie en Evaluation Médicale
  11. Département d'Information Médicale des Hospices Civils de Lyon
  12. Lyon, France

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Outcome based quality of care monitoring is currently the object of a lively debate, particularly in cardiac surgery. Medical, technical, and economical reasons subtend comparisons between surgeons, hospitals, or regions. This raises three major questions. What is the reference to compare with? In what form is information on surgical outcomes disclosed? How are random variations dealt with?

What is the reference to compare with?

The surgical outcomes of a given centre are compared either to a recognised standard or to other centres' results.

Comparing to a standard, predictive models are used to compute risk adjusted rates: the predicted risk is thus considered as a yardstick for acceptable practice. The Parsonnet scoring system is the most widely used for risk stratification in open heart surgery. The scores were calculated more than 10 years ago (1982–1987) in the USA, and involved over 3500 consecutive surgical procedures.1 The system has proved its validity in predicting coronary artery surgery mortality in the UK.2 However, it does not seem applicable to the present European practice. Studies carried in the UK3 ,4 and France5 have shown that current European mortality figures are 30–50% lower than those predicted by the Parsonnet score. Besides, statisticians criticised the methodology used in the original paper and stated that numerical risks obtained with the Parsonnet index should not be taken literally.6

In this issue, Wynne-Jones and colleagues propose a new predictive model based on data concerning all adult patients who underwent cardiac surgery during two years within an area which covers about one eighth of the UK cases.7 The observed mortality is 51% of that predicted by the Parsonnet score and the authors propose a correction factor. This article can be seen as an attempt to establish a modern European standard. As the patient spectrum widens constantly and varies from centre to centre, and as the relation between predictors and outcomes may change or disappear over time, any scoring system should be regularly assessed and updated.8

To avoid the problems caused by lack of validity of an external standard, the choice is frequently made to compare performance between institutions. The need to adjust for patient initial status has been strongly argued but this seems to be a difficult task; physicians may search for and note down more medical examination details at large hospitals or at specialised clinics than at small healthcare centres. In such a situation, adjusted comparisons are not valid. The validity of these comparisons depends on the homogeneity of the collection of risk factors. If this condition is met, adjusted comparisons between institutions will be a good alternative to the predicted risk approach.

In what form is information on surgical outcomes disclosed?

Comparisons can be disclosed under the form of league tables or of prospective monitoring charts.

Most analyses of outcome after cardiac surgery take the form of retrospective investigations and report mortality rates. Average surgical performances over time are delivered to the surgeons or the public as league tables. This choice was recently made in France where, for three consecutive years (1997 to 1999), the lay press has been ranking health institutions according to various criteria and in selected areas, including chiefly heart surgery.

An alternative option is to monitor prospectively the surgical outcomes using a plot to show the difference between the cumulative expected mortality and the deaths that actually occurred.3 ,4 ,9 ,10 Because it shows changes over time, this approach can be used by individual surgeons to monitor their own performance or the progress made in mastering a new technique.10

The choice between league tables and prospective monitoring depends on the context, and both methods are useful if the results are appropriately delivered. However, expeditious judgements on quality of care based on indicators or ranks, though more or less conveniently adjusted, must be regarded with extreme caution. Whenever comparisons are to be made between institutions, the following recommendations should be followed: keep the results in their own context, use several indicators rather than just one, and explain the statistical uncertainty.11

How are random variations dealt with?

Random variations can be sufficient to explain occasionally higher or lower death rates in some institutions. This is to be taken into account for both league tables and prospective monitoring charts.

In this issue, Sherlaw-Johnson and colleagues propose a method for assessing the natural variation in outcome.12 Their “nested prediction intervals” seem to be easy to generate. A surgeon can use any standard spreadsheet software to generate his own prediction intervals and judge whether poor performance could have occurred by chance. Nevertheless, there is some weakness concerning the power to detect poor performance. Using another statistical method, Poloniecki and colleagues showed that to detect a doubling of death rate with 9/10 chance, 16 expected deaths are required, which is about 160 operations!4 Thus, as stated by Sherlaw-Johnson and colleagues, “a surgeon should certainly not be complacent if mortality figures are poor, but manage to stay within the 95% limits”. Their approach is rather useful to “detect a change, particularly deterioration, as a trigger for further investigation.”4

The lack of power to detect differences between mortality rates was also mentioned for league tables. Indeed, the ranks of healthcare centres are particularly sensitive to sampling variability and may change broadly from year to year owing to random variability.13 Furthermore, this random variability is as wide as the number of patients is low and the number of extrinsic factors is high. Therefore, a moderately active healthcare centre may have a high mortality rate (say 12%) one year but 0% the year before or after, simply because of random variability.

To illustrate this point of view, we reanalysed the data on mortality rates following coronary artery bypass graft surgery in 32 institutions in France.14 We recalculated hospital ranks along with their potential errors—credible intervals—according to a statistical method intended to evaluate uncertainty associated with ranks.13 Figure 1 shows that Mulhouse hospital which ranks 16th (26th in the magazine14) might have been ranked third or 29th a year earlier or later by pure chance. We also see that very few hospitals may be singled out as being less or more efficient than any other because the credible intervals are too wide in general, despite the substantial number of cases per centre.

Figure 1

Hospital mortality rate ranks () following coronary artery bypass grafts, and their 95% credible intervals (—) according to our calculations.


Any surgeon or centre's performance that departs from a recognised standard should prompt questioning long before approaching confidence limits; it is unacceptable to operate 160 times before being aware of a performance loss. However, comparative results should not be rendered public before significant deviations are observed. Otherwise, the lay opinion may quickly and unfairly judge a given performance as inferior or, more openly, bad. Thus, we recommend that the results of comparisons to a standard or between centres should be considered as incentives for further improvements, rather than tools for final judgements.