Introduction

Type 2 diabetes is a strong risk factor for cardiovascular disease (CVD), with some studies suggesting that it confers an equivalent risk to having had a myocardial infarction [1, 2]. Multifactorial interventions, such as the Steno-2 study, have been effective in reducing the risk of non-fatal [3] and fatal CVD [4] among diabetic patients through therapy targeting hyperglycaemia, hypertension and hypercholesterolaemia. Despite this evidence of effectiveness, many countries use a rationing approach to the prescription of cardiovascular risk reduction treatment, with national guidelines suggesting that patients should have their risk of CVD calculated, to ensure therapy is targeted to patients at highest absolute risk. Multivariate risk scores have, therefore, been used to predict CVD risk in individuals with diabetes.

There are a large number of scores for the general population, but few that are specific to people with diabetes. Whether general population scores can accurately be used in a subgroup of individuals with diabetes is unclear [5, 6]. The most commonly used score is that originally developed in 5,573 men and women participating in the Framingham study in the early 1970s, which, in general, performs well in North America [7], but less well in other populations [79]. We aimed to review the published evidence on performance of CVD risk scores in diabetic populations. First, we examined the overall rationale for using cardiovascular risk scores in patients with diabetes. Second, we provide results from a systematic review of the published literature on CVD risk scores that have been developed or evaluated in individuals with diabetes. Finally, we explored methodological issues surrounding the development, validation and comparison of risk scores.

Why estimate CVD risk in individuals with diabetes?

There are several reasons why it may be important to quantify the risk of developing CVD in patients with diabetes. The clear identification of the rationale for developing a risk score is critical to how its validity is assessed. For example, if the purpose of a score is to rank individuals and groups according to absolute risk for the purpose of targeting therapy to those at greatest risk, then it is the ranking that is important and not necessarily the absolute risk estimate. If, on the other hand, the principal justification is to provide prognostic information or accurate estimation of the likely absolute benefit from a therapeutic intervention, then a precise computation of absolute risk is important. Finally, if the main reason for calculating risk as part of a preventive strategy is to motivate patients to change their behaviour and adhere to medical treatments, it may be important to calculate modifiable risk rather than use a score dominated by fixed variables. In the following sections we describe the origin of CVD risk scores and illustrate how purpose, construction and validation are often disconnected.

CVD risk scores have been widely used in the UK and elsewhere for over 10 years and were introduced at a time when the cost of cholesterol-lowering drugs was an important political issue [10]. National recommendations began appearing in the mid-1990s, when specified annual CVD risk estimates were used to determine thresholds for prescribing therapy. These thresholds were decided on the basis of the number needed to treat to prevent one CHD event, the cost-effectiveness of treatment and, predominantly, the proportion of the population requiring treatment and the total cost of treatment [11]. As such, the threshold for intervention based on the ranking of absolute risk was largely a financial decision. Although results from the Heart Protection Study [12] and the Collaborative Atorvastatin Diabetes Study [13] suggest that statin therapy is effective in patients with diabetes at high CVD risk, irrespective of their initial cholesterol concentrations, not all health systems can afford such a policy. The ranking of absolute risk is clearly important for making collective decisions about therapy, but, beyond statin prescribing, does the calculation of CVD risk aid clinical decision-making?

An estimate of absolute risk can inform the potential for absolute risk reduction, providing patients with an idea of expected benefit from a therapy or intervention. This is important for individual, rather than collective, decision-making. However, further research is needed to understand the process by which the clinician and patient interact once cardiovascular risk has been assessed [9]. While it is assumed that telling individuals their CVD risk is a motivating tool, there is only weak evidence that this assumption is valid. In a systematic review of randomised controlled trials that evaluated the effectiveness of risk-scoring methods, no strong evidence was found that a CVD risk assessment by a clinician improved CVD-related health outcomes [9]. Practitioners need to provide information about what patients can do to reduce this risk; they also need to acknowledge the possibility of false reassurance or a fatalistic response, both of which can in theory lead to an increase in population risk. Qualitative work has demonstrated that very few patients with type 2 diabetes understand the direct link between having diabetes and their CVD risk [14]. Although they are aware of CVD, they are more likely to attribute it to external or immutable factors such as stress and heredity, rather than modifiable risk factors such as high cholesterol, hyperglycaemia and smoking [14]. This phenomenon links in with the idea of using risk scores as preventive or motivating tools. Most of the predictive value of a CVD risk score comes from the inclusion of two unchangeable risk factors, age and sex. It is probably difficult to persuade patients to change their behaviour on the basis of CVD risk scores that are mostly driven by risk factors patients cannot change. Thus risk scores that only incorporate modifiable risk factors are more likely to be useful for preventive strategies in patients with diabetes.

Systematic review of CVD risk assessment tools

Search strategy

A comprehensive literature search for studies of CVD risk assessment tools was performed using MEDLINE, Web of Science and Cochrane Reviews from database inception up to 30 June 2008. The search strategy focused on four key elements: CVD, type 2 diabetes, risk assessment/score/prediction and specific names of known risk scores (see Appendix). We also screened the reference lists of papers identified from the initial electronic search. No language restriction was applied; articles were translated when necessary.

Selection criteria

We included studies reporting CVD risk assessment tools or scores that: (1) were derived from prospective cohort studies or randomised trials; (2) were derived in the general population and evaluated in individuals with diabetes, or developed in a diabetic population; and (3) reported a measure of performance of the risk score for predicting CVD. We restricted the review to studies that reported the following diseases as a primary outcome:

  • Fatal or non-fatal CVD

  • Fatal or non-fatal CHD

  • Fatal or non-fatal cerebrovascular disease or stroke

We excluded studies that derived cardiovascular risk scores for the general population but did not evaluate them in individuals with diabetes. We excluded studies that derived risk prediction tools other than score-type tools, such as those using carotid ultrasonography and myocardial perfusion scintigraphy. If scores and their evaluation were reported in several different papers, we included the score only once by selecting the paper that reported the most information on predictive ability.

Data extraction

Two reviewers (P. Chamnan, R. K. Simmons) independently reviewed the results from the primary search of titles, followed by the abstract and full paper searches (Fig. 1). Where reviewers disagreed, consensus was reached through discussion. The two reviewers used a standardised form to extract data on the performance of the risk scores. This included the name of the risk score and study, the country and setting, details on derivation and validation populations, follow-up for derivation and validation cohorts, definition of diabetes and CVD, risk factors included in the scores, and measures of predictive ability, including discrimination, calibration, sensitivity, specificity, and positive and negative predictive value. We also extracted data from original studies if articles identified through the initial search did not contain information on the development or validation of risk scores (Tables 1 and 2).

Fig. 1
figure 1

Flow of identification of included studies

Table 1 Summary of the derivation of CVD/CHD risk scores primarily developed in individuals with diabetes
Table 2 Summary of the derivation of CVD/CHD risk scores primarily developed in the general population

Results

Our electronic search retrieved 2,113 potentially relevant papers (Fig. 1). After reviewing titles, abstracts, full texts and citation lists, 13 articles reporting the predictive performance of 17 different CVD risk scores met the inclusion criteria. We provide a summary of the derivation of the risk scores in Tables 1 and 2, and a summary describing the evaluation of the scores in Table 3. One paper was translated.

Table 3 Performance of CVD risk scores evaluated in individuals with diabetes

Development of risk scores

Out of 17 different risk scores, 15 were developed in predominantly white populations (USA and Europe) and two were developed in Chinese populations (Hong Kong). Cohort size ranged from 1,500 [15] to 205,178 individuals [16] and follow-up time from 4.7 [17] to 25 years [16]. The age of men and women included in development cohorts ranged from 18 to 84 years; patients with previous CVD were usually excluded. Eight risk scores were originally developed in a cohort of individuals with diabetes (Table 1), while the other nine were developed in a general population and subsequently evaluated in a cohort of individuals with diabetes (Table 2). The majority of risk scores (n = 10) provided estimates of risk for CHD. Two risk scores estimated risk of non-fatal or fatal CVD outcomes, while two risk scores predicted CVD deaths. The remaining three scores estimated risk of non-fatal or fatal stroke. The majority of risk scores incorporated classic CVD risk factors, such as age, sex, smoking, blood pressure and total cholesterol. Most risk scores derived from the general population contained a dichotomous variable for diabetes (yes/no) (n = 8) and did not take account of diabetes-specific risk factors, such as duration of diabetes or glycaemia. Conversely, risk scores developed in individuals with diabetes often included age at diagnosis, duration of diabetes and/or a measure of glycaemic control, such as HbA1c or fasting plasma glucose (FPG).

Evaluation of risk scores

Sixteen risk scores were evaluated in 13 different validation cohorts. Only the Atherosclerosis Risk in Communities risk score was not evaluated in an external validation cohort. The majority of validation cohorts were based in Europe (n = 10) and varied in size from 112 to 5,823 individuals with diabetes (Table 3). The 10-year cumulative incidence of CVD varied considerably from 5% in a Chinese cohort [18] to 45% in a British cohort [19]. The age of individuals ranged from 18 to 75 years at baseline and median follow-up time from 4 to 10 years. All studies validated risk scores in patients with type 2 diabetes, including one study that evaluated a risk score in type 1 and type 2 diabetes [20]. Individuals with diabetes were recruited from the general population (n = 7) or specialist diabetes clinics (n = 6) and were identified through (1) computerised databases or registries in five cohorts; (2) clinical records in seven cohorts; and (3) from the placebo arm of a trial in one cohort. Only three studies included a clear definition of diabetes, e.g. whether individuals were diagnosed by FPG or an OGTT according to WHO criteria (Table 3). Two articles failed to include any information on how individuals with diabetes were identified or diagnosed. Similarly, four articles did not include a clear definition of which CVD endpoint was used, with definitions varying between studies, e.g. clinical diagnosis of fatal and non-fatal CVD, CHD determined using coronary angiography or fatal CVD. Clinical records were used to retrieve data on CVD endpoints (n = 11) in most validation cohorts and these were confirmed by expert reviewers in two studies. Diagnosis of CVD was made using coronary angiography in one validation study in Greece (Table 3). The majority of validation studies compared the performance of different CVD risk scores in a single population (n = 7). Among these, four studies compared the predictive ability of risk scores developed in individuals with diabetes and those developed in a general population.

Performance of the risk scores

Table 3 summarises the measures of predictive ability assessed in each validation study. Few studies reported complete measures of predictive performance, including discrimination, calibration and global model fit. The majority of studies reported a measure of discrimination (area under the receiver operating characteristic curve [aROC], also known as a c-statistic). Risk scores predicting CVD or CHD outcomes showed moderate to good discriminatory power in validation cohorts (aROC range 0.61 to 0.80). Similarly, risk scores predicting stroke reported aROCs from 0.59 to 0.79.

Three risk scores were evaluated using a Hosmer–Lemeshow χ 2 statistic as a measure of calibration in two validation cohorts, while eight studies reported either predicted and observed event rates, predicted to observed rate ratio or whether the scores over- or underestimated CVD risk. Two validation studies did not report any measure of calibration. The UK Prospective Diabetes Study (UKPDS) risk engine showed poor calibration in both British [19, 21] and non-British validation cohorts [18, 22], while the Swedish National Diabetes Register [23] and Hong Kong Diabetes Registry [18] reported good calibration in Swedish and Chinese populations respectively. Most risk scores developed in the general population underestimated CVD risk in diabetic patients. We found underestimations ranging from 11% for fatal CVD risk using the Diabetes Epidemiology: Collaborative Analysis of Diagnostic Criteria in Europe (DECODE) score [24] to 64% for fatal and non-fatal CVD risk using the Prospective Cardiovascular Münster (PROCAM) score [19]. Coleman et al. validated the Framingham, Systematic Coronary Risk Evaluation (SCORE) and DECODE risk engines in a population of 3,898 individuals with diabetes and showed that they did not provide reliable fatal CVD and CHD risk estimates [24]. Conversely, in populations with low CVD risk, such as Mediterranean countries, most of the risk scores overestimated CVD risk, even those developed in individuals without diabetes [22].

Studies comparing the predictive ability of a CVD risk score developed in a diabetic cohort and a score developed in the general population reported inconsistent results. In a study of 339 patients with diabetes, Protopsaltis et al. showed that the Framingham risk equations were more accurate than the UKPDS risk engine for predicting coronary artery disease risk (aROC 0.65 and 0.61 respectively) [25]. With similar sensitivity, Framingham risk equations had higher specificity and positive and negative predictive values than the UKPDS risk engine (65%, 43% and 75% vs 56%, 37% and 73%, respectively). However, the study population was small and the diagnosis of coronary artery disease was established by means of coronary angiography, which includes individuals with sub-clinical CHD, an endpoint not included in the UKPDS definition. By contrast, in a British community-based cohort of 428 individuals with newly diagnosed type 2 diabetes [21], the Framingham risk equations appeared to underestimate cardiovascular and coronary disease events by 33% and 32%. The UKPDS risk engine had lower levels of underestimation at 13%, suggesting that this diabetes-specific risk score performed better than a risk score developed in the general population. However, both the Framingham equations and UKPDS risk engine showed modest discriminatory ability and poor calibration (aROC of 0.66 and 0.67, Hosmer–Lemeshow χ 2 of 19.8 [p = 0.011] and 17.1 [p = 0.029] respectively), making it difficult to reach any firm conclusions.

Sensitivity, specificity and both positive and negative predictive values varied considerably between risk scores and validation cohorts, although most studies reported moderate to good sensitivity and specificity (Table 2). Risk scores for CHD and stroke developed in the Hong Kong Diabetes Register had good sensitivity and specificity when tested in Chinese validation cohorts [18, 26], while the Framingham risk score and the UKPDS risk engine also had good sensitivity, but relatively low specificity when tested in European populations [21, 22].

Discussion

This systematic review has shown that the predictive ability of CVD risk scores, which were developed mainly for White populations, varies considerably between different populations. There is little evidence to suggest that using risk scores developed in individuals with diabetes will help to estimate CVD risk among diabetic patients more accurately than use of those developed in the general population. The inconsistency in methods used to evaluate CVD risk scores makes it difficult to compare or summarise the predictive ability of different risk scores.

Our review supports previous research showing that CVD risk scores developed in the general population are likely to underestimate CVD risk in individuals with diabetes [9]. In theory, one way of addressing this underestimation would be to use risk prediction tools solely derived from diabetic populations [10]. However, our results suggest that diabetes-specific risk engines need to be validated in other populations before they are widely adopted and can replace Framingham-based methods of risk assessment. The issue of predicting risk for people with diabetes is likely to be increasingly complex, as a greater proportion of patients are treated with CVD risk factor modifying therapy. The development of risk prediction scores in populations with long-standing and treated diabetes, possibly incorporating information about the degree of glycaemic control, would be an important issue for future research. At the same time, the prediction of absolute future CVD risk in a clinical situation is likely to be based on single baseline measures of cardiovascular risk factors and not on any time-averaged measures collected during repeat visits. While the latter has a role in assessing the aetiological association between risk factors and outcome, repeated measures have limited utility in the practical clinical situation of risk prediction.

Most diabetes-specific risk scores incorporate measures of glycaemic control such as FPG and HbA1c, while most scores developed in the general population include a binary variable indicating whether an individual has diabetes or not. Hence, an important question in the general population is whether adding measures of glycaemic control improves the predictive ability of existing risk scores incorporating traditional CVD risk factors. In populations of people with diabetes, studies comparing the predictive performance of the Framingham risk score and the UKPDS risk engine have shown conflicting results [21, 25]. Part of the uncertainty about the additional predictive contribution of hyperglycaemia in this context is explained by the fact that, in comparison with age, cholesterol and smoking, hyperglycaemia is a relatively weak CVD risk factor, thus making it harder to show significantly improved prediction [27]. The differences between studies could also be explained by changes in the distribution of risk factors for CVD and their treatment over time, which could differ within and between populations. This has implications for the generalisability of a questionnaire, even within the population in which it was developed, since by necessity the development stage of a risk score has to take place in a different temporal period to the practical application of the score in a clinical setting. It is conceivable that temporal trends in risk factors and the way they are treated could have major impacts on the predictive accuracy of risk scores that were developed using historical data but are applied to predict what may happen in the future. Differences between populations in the underlying distributions of risk factors and their treatment would limit the potential generalisability of risk scores to populations other than that in which they were derived.

It is also important to recognise that the choice of validation population will have an influence on the performance of a risk score in estimating CVD risk. Studies that develop risk scores in one half of a cohort and then validate them in the other half are likely to report better predictive abilities. This is true of scores developed from the Hong Kong Diabetes Registry [18, 26] and the Swedish National Diabetes Register [23]. Conversely, validating risk scores in different populations and ethnic groups is likely to result in relatively poorer prediction. For example, the UKPDS risk engine had moderate discrimination and poor calibration when evaluated in a Chinese diabetic population [18]. This underlines the fact that the accuracy of a risk score largely relies on the background risk of a specific population to which it is applied. It may be more useful to develop or recalibrate population-specific risk prediction tools, rather than trying to find a universal risk score that will work in all populations.

This review used a comprehensive search without language restrictions. To the best of our knowledge, it is the first systematic review to assess the ability of risk scores to estimate cardiovascular risk in individuals with diabetes. We extracted data on widely used measures of predictive accuracy, including discrimination, calibration, sensitivity, specificity and positive and negative predictive values. Apart from a difference in the choice of validation population, each study had different inclusion criteria, follow-up, ascertainment methods, definition of diabetes and CVD endpoints. This makes it difficult to compare the predictive ability between different risk scores and between validation populations. The performance of CVD risk scores varied considerably, and there is no conclusive evidence of a difference in the predictive performance of risk scores developed in individuals with diabetes and of those developed in the general population. A variety of statistical approaches was used to describe and compare the predictive performance of the different risk scores; our review suggests that a more systematic and standardised approach is needed.

Methodological issues

The estimation of CVD risk is a dynamic research field and there are a number of unresolved methodological issues concerning the development, validation and comparison of risk scores. It is clear from the systematic review that a number of statistical techniques can be employed to assess the performance of predictive models. However, these measures are not uniformly calculated across all studies and it is sometimes difficult to obtain a comprehensive picture of the use of a risk score without taking all these measures into account. Risk scores are frequently derived using a logistic regression model, which models the log odds of having the event during the specified follow-up period as a function of the other variables. The estimated coefficients from the logistic regression model are then used to define the risk score. Although this risk estimate is continuous, many of the methods of comparing prediction assume that a threshold will be used to categorise people into two groups on the basis of the risk score.

Measures of discrimination

Discrimination is the ability of the prediction model to correctly separate individuals into those who will and those who will not have the event of interest. The commonly used receiver operating characteristic curve (ROC) is a plot of sensitivity on the vertical axis against (1−specificity) on the horizontal axis for every possible cut-off value of the continuous risk score. The aROC is the area under the ROC curve and is equal to the probability that a randomly selected individual with the event has a higher value of the risk score than a randomly selected individual without the event. A test that was totally uninformative would have a c value of 0.5, while a perfect test would have a c value of 1. The discrimination of two possible models can be compared by comparing the values of their aROCs. Net reclassification improvement (NRI) [28] is used to compare two models A and B, which share all risk factors except for one new marker included in model B. After classification of the predicted probabilities from the two models into risk categories, the NRI is the proportion of times model B correctly moves an individual with the event into a higher risk category or an individual without the event into a lower risk category, minus the proportion of times the new model incorrectly moves an individual with the event into a lower risk category or an individual without the event into a higher risk category. Integrated discrimination improvement (IDI) [28] is calculated by subtracting, for model A and B, the average probability of the event in the individuals without the event from the average probability of the event in the individuals with the event, and then calculating the difference in these two quantities. The values of NRI and IDI are both between 0 and 1, or 0 and 100%. The higher the values, the greater the improvement in performance of the new model. It is possible to test whether each statistic is significantly different from 0.

Calibration of risk scores

Calibration is the extent to which predicted risk from a model equals observed risk in the data. The Hosmer–Lemeshow Goodness of fit test is calculated by ordering the predicted probabilities of the event into, say, ten near equal sized groups. The observed and expected numbers of individuals with the event within each group are then compared using a χ 2 statistic (with nine degrees of freedom if there are ten groups). A statistically significant result implies that the model may be poorly calibrated.

Global measures of model fit

The likelihood ratio statistic is a global measure of model fit. The Bayesian information criterion (BIC) and Akaike’s information criterion (AIC) are two measures that combine both fit (as measured by the likelihood ratio statistic) and model complexity, in terms of number of variables (BIC and AIC) and also (for BIC) number of observations. These measures can be used in the same dataset for model selection; comparing two models, the model with the lower value of either BIC or AIC would be preferred. However, as Ware [29] has demonstrated, a new marker could be an important risk factor that significantly improves the fit of the model, but it may have almost no impact on model discrimination as measured by aROC.

CVD risk scores are often compared using the aROC, with a significant increase in the aROC being taken as evidence that the discrimination of a risk score has improved. More recently, Cook and others [30, 31] have argued that the aROC is an insensitive measure based purely on ranks and that we should also calculate a measure of calibration (e.g. the Hosmer–Lemeshow statistic) and global fit (e.g. BIC) in order to comprehensively assess the utility of a risk score. They argue that the critical issue for clinical application is ‘the proportion of patients reclassified using a new risk algorithm and whether the magnitude of this reclassification is large enough to alter physician behaviour with regard to prevention’ [31, 32]. Pepe et al. [33] recently suggested calculating the NRI and IDI measures when comparing two scores, in order to assess whether any reclassification was in the right or wrong direction.

An example from our previous research illustrates some of these analytical approaches [34]. Using data from EPIC-Norfolk, a UK population-based prospective cohort [35], we computed two novel risk scores by fitting Cox proportional hazards regression models with CHD as the outcome. In model A, we used the original Framingham risk score variables, i.e. age, total cholesterol, HDL-cholesterol, systolic blood pressure, smoking status and diabetes. In model B, we replaced the diabetes variable with HbA1c to examine whether the addition of this continuous variable improved the prediction of CHD. While the results on discrimination indicated that model B was better than model A at distinguishing between individuals with and without the disease, results from the NRI statistic suggested that this difference did not significantly improve reclassification. This conflicting information highlights the challenges of comparing CVD risk scores and demonstrates that despite statistical advances, a number of questions remain unanswered.

Most of the risk scores in this systematic review were developed using models for censored survival time data, such as the Cox proportional hazards model or the accelerated failure time model. However, the most common method for deriving the aROC is based on a logistic regression model, which ignores the time-dependent nature of the data. New approaches to assessing the performance of a risk score estimated from survival data in the presence of censoring have been proposed [36, 37]. These new performance measures have been shown to be unbiased, unlike the aROC from logistic regression, which tends to be underestimated and have a larger standard deviation [37].

Conclusions

The computation of CVD risk is likely to remain an important part of the process of prioritising therapy for individuals and populations. The degree to which such risk scores can be improved is questionable, as attempts to add novel risk factors to existing CVD scores and thereby improve their predictive ability have not been very successful [38, 39], largely because a risk factor must be very strongly associated with a disorder for it to be useful for prediction [40]. Few novel biomarkers have demonstrated an ability to predict risk over and above information available from global assessment tools such as Framingham [38, 41]. Furthermore, there is rarely evidence that reductions in any of these novel markers will lower cardiovascular risk [42]. Genetic risk information currently adds little to prediction, but may become increasingly important in the future. However, there is little evidence that the provision of genetic risk information is associated with behaviour change [43].

Thus while there is clearly some scope for improving prediction, it is likely that improvements will be marginal. Indeed, it may be more profitable to focus on ensuring that tools currently available for risk prediction are applied more broadly and routinely throughout clinical practice in order to address the gap between the promise of CVD prevention and its reality [39]. When attempting to reduce CVD risk, the precision of the instrument may be less important than how it is used. As such, there is still a need for further research into provider and patient perceptions of CVD risk. There is considerable uncertainty on how best to present risk information and on whether the presentation of risk information is associated with lifestyle change or the degree of medication adherence [4446]. The downsides of presenting risk information also need to be considered. For example, it is possible that showing how little CVD risk is reduced by lowering blood glucose alone may discourage people from adhering to hypoglycaemic medication.

Cardiovascular risk scores are useful tools in the management of individuals with diabetes, particularly when the score has been developed in a population of similar individuals. Scores that rank risk well are appropriate for identifying those at highest risk, to whom therapy can then be targeted. Conversely, the process of predicting risk accurately in order to provide prognostic information is better aided by risk scores that accurately quantify absolute risk. Finally, we see a potentially important role for scores capable of quantifying that element of risk that is modifiable, a strategy that could help motivate patients to change. While improvement in the predictive ability of risk scores might still be obtained, the public health utility of their application depends on a far wider range of issues.