Article Text
Abstract
Objective Machine learning (ML) can facilitate prediction of major adverse cardiovascular events (MACEs) in repaired tetralogy of Fallot (rTOF). We sought to determine the incremental value of ML above expert clinical judgement for risk prediction in rTOF.
Methods Adult congenital heart disease (ACHD) clinicians (≥10 years of experience) participated (one cardiac surgeon and four cardiologists (two paediatric and two adult cardiology trained) with expertise in heart failure (HF), electrophysiology, imaging and intervention). Clinicians identified 10 high-yield variables for 5-year MACE prediction (defined as a composite of mortality, resuscitated sudden death, sustained ventricular tachycardia and HF). Risk for MACE (low, moderate or high) was assigned by clinicians blinded to outcome for adults with rTOF identified from an institutional database (n=25 patient reviews conducted by five independent observers). A validated ML model identified 10 variables for risk prediction in the same population.
Results Prediction by ML was similar to the aggregate score of all experts (area under the curve (AUC) 0.85 (95% CI 0.58 to 0.96) vs 0.92 (0.72 to 0.98), p=0.315). Experts with ≥20 years of experience had superior discriminative capacity compared with <20 years (AUC 0.98 (95% CI 0.86 to 0.99) vs 0.80 (0.56 to 0.93), p=0.027). In those with <20 years of experience, ML provided incremental value such that the combined (clinical+ML) AUC approached ≥20 years (AUC 0.85 (95% CI 0.61 to 0.95), p=0.055).
Conclusions Robust prediction of 5-year MACE in rTOF was achieved using either ML or a multidisciplinary team of ACHD experts. Risk prediction of some clinicians was enhanced by incorporation of ML suggesting that there may be incremental value for ML in select circumstances.
- Tetralogy of Fallot
- Magnetic Resonance Imaging
- Quality of Health Care
- Risk Factors
Data availability statement
Data are available upon reasonable request. The data that support the findings of this study are available from the corresponding author upon reasonable request.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
Artificial intelligence applications in cardiovascular medicine are rapidly expanding given the capacity for deep learning and the inherent value of machine learning (ML) for mastery of large yet intricate datasets. Despite the tremendous potential for ML to transform contemporary clinical care, the practical aspects of ML applications in the clinical setting, alongside or in place of expert clinical judgement for risk prediction in cardiovascular medicine, have not been well defined.
WHAT THIS STUDY ADDS
Our data suggest that an augmented approach to risk prediction, combining ML with expert clinical judgement, can result in improved risk prediction for clinicians at the individual level with the greatest improvement observed in individuals with the least clinical experience (<20 years in clinical practice). The discriminatory capacity of an aggregate of five adult congenital heart disease experts with complimentary clinical interests exceeded the risk prediction of any single clinician, although the aggregate risk score was not further augmented by the ML model. The highest discriminatory capacity for risk was observed in the clinicians with the greatest years of clinical experience (>20 years in clinical practice).
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
With direct comparison of our previously validated ML algorithm to expert clinical judgement for risk prediction in adults with repaired tetralogy of Fallot, the present study expands our knowledge of the practical aspects of ML applications in the clinical setting. Our data support a cooperative approach between human and machine which is a promising paradigm through which ML can be effectively incorporated into routine clinical care.
Introduction
Artificial intelligence (AI) applications in cardiovascular medicine are rapidly expanding given the capacity for deep learning (DL) to facilitate workflow and the inherent value of machine learning (ML) for mastery of large yet intricate datasets.1 One of the most firmly established indications for AI in cardiovascular medicine is in the domain of image analysis, with studies demonstrating the non-inferiority of ML methodology as compared with conventional image processing by expert clinicians.2 3 A promising but relatively unexplored dimension of ML is the potential for prediction of adverse outcomes in those with cardiovascular disease.1 4 Furthermore, the incremental value of ML beyond expert clinical judgement remains largely undefined. Finally, despite a striking uptick in ML publications in recent years, closer scrutiny reveals a relative paucity of studies in patients with congenital heart disease (CHD) as opposed to acquired heart disease.1
In this study, we compare ML methodology with expert clinical judgement for risk prediction in adults with repaired tetralogy of Fallot (rTOF), the most common form of cyanotic CHD. Identification of patients with rTOF at increased risk of adverse clinical outcomes, such as sudden cardiac death or malignant arrhythmia, has been the focus of intensifying research efforts.5–8 We recently reported that an ML model could predict major adverse cardiovascular events (MACEs) in rTOF using variables extracted from the clinical health record which matched or exceeded previously published risk scores.8–11 The present study compares the predictive accuracy of our validated ML model against a panel of adult CHD (ACHD) experts with a view to obtaining a deeper understanding of the role for AI in ACHD clinical care.
Methods
ML model
Details of our validated ML model (the AI model based on Toronto patient outcome data (AiTOR)) have been published.9 Briefly, our AiTOR model was trained on an rTOF dataset (n=235 patients) using a supervised random forest algorithm to identify 10 high-performance variables from our institutional electronic health record which enabled accurate prediction of 5-year risk of MACE, defined as a composite of mortality, resuscitated sudden death, sustained ventricular tachycardia (VT) and admission for heart failure (HF) management. Variables in the AiTOR model included demographic features (n=2), cardiovascular magnetic resonance (CMR) measurements (n=5) and cardiopulmonary exercise (CP) test parameters (n=3). Variable strength was ranked and displayed using a feature map (SHapley Additive exPlantation) which graphically depicts the relative contribution of each variable to risk prediction (figure 1). In the present study, a population distinct from the previously reported derivation cohort is used to compare risk prediction for MACE between a panel of expert clinicians and the established AiTOR model.
Study participants
To represent the breadth of experience in ACHD clinical practice, five clinician experts (defined as ≥10 years of consulting experience in delivery of specialised ACHD clinical care) with complimentary skillsets were invited to participate. We included one ACHD surgeon and four cardiologists (two with paediatric cardiology and two with adult cardiology background training) with subspecialty interests spanning cardiac imaging, electrophysiology (EP), interventional cardiology and HF. With blinding to AiTOR variable selection, each clinician reviewed a list of more than 50 variables encompassing demographic features, surgical history, comorbidities, laboratory values, EP data, cardiac imaging measurements and CP exercise values.9 Clinicians were then asked to select the 10 highest-yield variables for 5-year MACE prediction in rTOF (composite of mortality, resuscitated sudden death, sustained VT and HF admission, as defined for the AiTOR model).9 The clinicians were asked to classify individual patient risk based on data from a pre-existing institutional database using only 10 variables, blinded to the variables selected by the other experts and to the clinical outcomes of the study population. Clinicians were stratified by duration of ACHD practice ≥20 or <20 years.
Study population
To identify the population used for risk prediction in the present study, we used an established institutional database of adults with rTOF, distinct from the derivation dataset, with operating characteristic curves created using bootstrap methodology with 1000 subsamplings.9 Briefly, the observation period for each patient was set to begin on the date of the oldest CMR study with complete volumetric and functional data.12 13 Clinical data within 2 years of the index CMR study were recorded. To predict 5-year MACE in rTOF, only events within the first 5 years were entered into analysis and patients with <5 years of follow-up were excluded. From this retrospective database, 25 representative, randomly selected patients with comprehensive clinical datasets were identified for review by each of the five clinicians. The MACE frequency was 24% (n=6 of 25).
MACE prediction
Following review of patient data based on their preselected set of 10 high-yield variables but without knowledge of the AiTOR model results, clinicians independently assigned the probability of 5-year MACE for each patient according to three predetermined categories: average risk, lower than average risk and higher than average risk with respect to our published institutional 5-year MACE prevalence of 12%.9 Individual patient risk estimates were assigned according to a weighted score as follows: 0 for low risk, 0.5 for average risk and 1 for high risk, as designated by each clinician. Concurrently, an ML-derived risk score was assigned to the same patient with a continuous scale ranging between 0 and 1 using our previously validated AiTOR algorithm. The aggregate score of five clinicians was expressed as a combined average of risk scores for each patient to represent cumulative risk stratification of the entire multidisciplinary ACHD team. An augmented score was calculated based on the combined contribution of ML and clinician for risk prediction at the patient level.
Statistical analysis
The Shapiro-Wilk test for normality defined data distribution. Continuous variables were expressed as mean (SD) for normally distributed data and median (IQR) for non-normally distributed data. Categorical data were reported using frequency (per cent). Comparisons between groups were demonstrated using the Student’s t-test or Wilcoxon rank-sum test for continuous variable and the Χ2 test or Fisher’s exact test for discrete variables, as appropriate. Accuracy of risk prediction was demonstrated using receiver operating characteristic curves created using bootstrap methodology with 1000 subsamplings. The area under the curve (AUC) with 95% CIs was calculated and comparisons were made using a two-sided DeLong approach. Sensitivity, specificity, positive predictive value and negative predictive values were explored and threshold values were selected according to the Youden index. The interobserver agreement for risk prediction among the five clinical experts was evaluated using the intraclass correlation coefficient (ICC). A p value of <0.05 was considered statistically significant. Analyses were conducted using JMP Pro V.16 statistical software package (SAS Institute, North Carolina, USA). The ML models were developed in Python V.3.9 using scikit-learn for random forest algorithms. We did not include the public or patients in our study design.
Results
Study participants
The majority of ACHD clinicians were male (80%, n=4). All were in full-time clinical practice at a quaternary care ACHD institution. Median number of years spent in ACHD clinical practice was 24 (range 11–34) years and the majority were in clinical practice ≥20 years (60%, n=3).
Variable selection
Out of 57 potential patient characteristics (online supplemental table 1), the five clinicians selected 22 distinct variables as being the highest yield for risk stratification (demographic n=5, comorbidities n=1, surgical n=5, CMR n=5, EP study n=4 and CP study n=2) (online supplemental figure 1). Several variables were selected by multiple clinicians. The most frequently selected variables were QRS duration on ECG, history of VT and CMR-derived right ventricular ejection fraction (RVEF) and left ventricular ejection fraction (LVEF) (figure 1A). The top 10 variables selected by clinicians and the 10 variables identified by AiTOR with the greatest predictive value are shown (figure 1A,B). Variables selected by clinicians stratified by ACHD subspecialty are demonstrated (figure 1C).
Supplemental material
Patient characteristics
The characteristics of the study population (n=25) are detailed (table 1) and are compared with the features of the total population of patients in our institutional rTOF database (n=411). The vast majority of variables were not statistically different between the study population and the larger patient population. The characteristics of patients with and without MACE stratified by variables selected by the ML model versus expert clinician are demonstrated (online supplemental table 2).
MACE prediction: expert clinical judgement versus ML
The AUC values for risk prediction, determined by the AiTOR ML model as compared with clinical experts, individually and by aggregate, are shown along with sensitivity and specificity thresholds (figure 2 and tables 2 and 3). There was moderate inter-reader agreement between the five clinicians (ICC 0.54 (0.36 to 0.72)). The discriminatory capacity of the AiTOR model was high (AUC 0.86, CI (0.58 to 0.96)) and did not differ statistically as compared with the individual AUC values for four out of the five clinicians. The AUC derived from the aggregate of all five clinicians (AUC 0.92, CI (0.72 to 0.98)) exceeded the AUC values of the individual clinicians as well as the AiTOR model, respectively, although differences were not found to be statistically significant in the majority of comparisons tested (table 3A). We observed a minor improvement in AUC between an aggregate of experts and AiTOR of 0.06 (95%CI 0.02 to 0.14), indicating that AiTOR approaches the accuracy of an expert panel. An augmented approach to risk prediction, where the AiTOR model and clinicians scores were combined, resulted in enhanced risk prediction at the individual level (table 3B). Experts with ≥20 years’ experience had superior discriminative capacity compared with <20 years (AUC 0.98 (CI 0.86 to 0.99) vs 0.80 (0.5 to 0.93), p=0.027). In those with <20 years’ experience, ML provided incremental value such that the combined (clinician+ML) AUC approached ≥20 years (AUC 0.85 (95% CI 0.61 to 0.95), p=0.055) (table 4). The aggregate of clinicians correctly assigned all patients with MACE as high risk (sensitivity 100%, specificity 68%), while the AiTOR model mistakenly classified one patient who experienced MACE as being low risk (sensitivity 83%, specificity 79%) (figure 2).
Discussion
This study compared risk prediction for MACE in rTOF by a validated ML model against the clinical judgement of ACHD experts. Main study findings include the following: (1) an augmented approach to risk prediction, combining ML with expert clinical judgement, resulted in improved risk prediction for clinicians at the individual level with the greatest improvement observed in individuals with the least clinical experience; (2) discriminatory capacity of an aggregate of five ACHD experts with complimentary clinical interests exceeded risk prediction of any single clinician, although their aggregate risk score was not further augmented by the ML model; (3) highest discriminatory capacity for risk was observed in the clinicians with the greatest experience (≥20 years in clinical practice).
Opportunities for AI in CHD
Although there is increasing recognition that AI can contribute to CHD care, applications in clinical practice remain under-realised.14 Opportunities for enhanced diagnosis, prognosis and management for patients with CHD as guided by ML exist, but are only slowly being recognised due to a paucity of supportive data. At present, much of the existing AI literature in CHD is centred upon cardiac imaging applications. In one of earliest studies of ML in ACHD, Diller and colleagues demonstrated that complex CHD, such as systemic right ventricular lesions, can be accurately identified echocardiographically using a DL algorithm.15 More recently, studies have successfully applied DL techniques to enable quantitative assessment of ventricular volumes, function and mass in rTOF.16 17
However, there are some emerging data which link cardiac imaging findings to adverse outcomes suggesting that ML techniques can facilitate prognostic evaluation in rTOF. In 2018, Samad and colleagues proposed an ML algorithm which could predict ventricular deterioration in rTOF using CMR.18 In 2020, Diller and colleagues described a DL approach to CMR image analysis whereby automated two-dimensional image analysis was associated with a composite clinical outcome in rTOF.19 In 2023, our group demonstrated that an ML algorithm could accurately predict MACE in rTOF through incorporation of clinical and imaging variables extracted directly from the electronic health record.9
Although a topic of considerable relevance, less is known about how ML models can be, and should be, incorporated into clinical setting for patient management. Specifically, direct comparisons of ML algorithms with expert judgement for day-to-day ACHD clinical management are largely absent from the published literature. Our study findings point towards a cooperative approach between humans and machines which may be a promising paradigm through which ML can be effectively incorporated into routine ACHD clinical care.4
ML can augment risk prediction in rTOF
Although the discriminatory capacity for risk prediction using the ML algorithm was strong (AUC 0.86, CI 0.58 to 0.96), this did not differ statistically from the majority of clinical experts (the ML-derived AUC was superior to only one clinician). Notably, adding the ML algorithm to expert clinical judgement in a hybrid score resulted in an augmented AUC for each of the clinical experts at the individual level (table 3). It is not surprising that this coordinated effort results in enhanced risk prediction as compared with the individual clinician as it capitalises on the distinct yet complimentary approaches to clinical problem-solving that can be used when determining risk. While a clinician will typically use a traditional, rules-based algorithm for evaluating risk based on previous experience and knowledge of the published literature, an ML model will establish risk in an open-ended, unbiased fashion based solely on the dataset available for study.
The highest discriminatory capacity was observed in the subset of clinicians with the greatest clinical experience (≥20 years) (AUC 0.98 (CI 0.86 to 0.99)). Yet, when the ML algorithm was incorporated into risk prediction of the clinicians with the least experience (<20 years), their AUC was augmented such that their performance approached that of the more experienced clinicians (AUC 0.85 (CI 0.61 to 0.95)). Worthy of mention is the strong aggregate score incorporating risk prediction from all five ACHD clinicians which was not substantially augmented despite incorporation of the ML algorithm (AUC 0.92 (CI 0.72 to 0.98) vs AUC 0.89 (0.68 to 0.97), p=0.413). This observation underscores the importance of a collaborative, integrated, multidisciplinary approach to management in ACHD directed by a team of ACHD clinical experts with different subspecialties and a range of complimentary clinical skillsets.12 13
Taken together, we suggest that machines are unlikely to replace humans for clinical management, and risk stratification for MACE, specifically identification of high-risk patients with ACHD. Veteran clinicians and aggregates of clinical experts outperformed our AiTOR ML algorithm (notably our AiTOR model still outperformed several established risk scores, as previously described).9 At the individual level, there is potential for augmented risk prediction through incorporation of our ML model. Despite variability in discriminatory capacity of the individual, importantly, all clinicians correctly identified the six patients with MACE as being average or high risk, whereas the AiTOR ML model incorrectly classified a patient with MACE as being low risk.
ML enhanced variable selection for risk prediction in rTOF
The ML algorithm provided some novel insights pertaining to high-yield variable selection for robust risk prediction in rTOF. The first observation relates to model performance using a short list of 10 variables. We previously observed that the AiTOR model had similar performance if 10 variables or if a full complement of 57 variables were used, and in the present study, we demonstrate that clinical experts can achieve similar high discriminatory performance using 10 variables only. The second finding relates to the types of variables selected. The only overlapping variables between the high-frequency variables selected by clinical experts and the most important features selected by the ML algorithm were age (at repair and at CMR, respectively) and systolic ventricular function on CMR (RVEF and LVEF). Although the ML model identified right ventricular end-systolic and right ventricular end-diastolic volumes as being two of the strongest predictors of outcome in addition to three exercise parameters (in keeping with contemporary literature),20–22 it is noteworthy that volumetric measurements were not selected by any of the clinicians and exercise data were only infrequently selected (raising the question of whether risk prediction of the individual clinician would have been further enhanced by incorporating variables preselected by the ML model) (online supplemental figure 1). Although the clinicians commonly selected historically established, traditional variables such as QRS duration on ECG and history of VT,5 10 these were not identified as being high performance by the ML model. This discrepancy highlights the interplay between the unique clinical characteristics of a patient population within a given institution and site-specific interpretation of risk by clinicians. Finally, it stands to reason that ML can refine collective decision-making by adding an additional, objective dimension to variable selection which is unencumbered by the potential biases of individual clinicians.
Study limitations
In addition to the shortcomings inherent in a retrospective study, there are several other limitations worthy of mention. First, the approach to risk prediction that we imposed on the clinical experts, namely identification of 10 high-yield variables, was not an accurate reflection of usual clinical practice which typically incorporates a patient interview and a physical examination thereby adding important dimensions to risk assessment not captured in this study. Furthermore, the handling of missing data differs between an ML algorithm which can impute data into missing fields as opposed to a clinician who might arrange for additional testing in an effort to achieve a more complete clinical dataset. In this study, we excluded datapoints of potential importance with a high degree of missingness (such as brain natriuretic peptide and late gadolinium enhancement data). Additionally, our exploratory study was modest in size (both in terms of clinician participation and patient population) and could be underpowered for detection of differences between ML and clinical experts. Moreover, our population represents quaternary care at a single site (possibly reflecting centre-specific management bias). Therefore, incorporation of additional centres would be required to augment study power and generalisability. Finally, this study focused only on clinical data; however, risk assessment would undoubtedly be enhanced by incorporation of additional dimensions, including genomic, metabolomic and environmental data.23
Conclusions
Robust prediction of 5-year MACE in rTOF can be achieved using ML or a multidisciplinary team of ACHD experts. Risk prediction of some clinicians was enhanced by incorporation of ML suggesting that there may be incremental value for ML in select clinical circumstances. Our findings point toward a coordinated approach to risk prediction in rTOF using both clinical expert judgement and ML modelling.
Data availability statement
Data are available upon reasonable request. The data that support the findings of this study are available from the corresponding author upon reasonable request.
Ethics statements
Patient consent for publication
Ethics approval
This study involves human participants and was approved by the University Health Network research ethics board (study number 20-5873). Due to the study’s retrospective nature, the institutional research ethics board granted a waiver of consent.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
Twitter @EOechslin, @michael.gritti, @drrachelwald
Contributors RMW has responsible for the overall content as the guarantor. AI, CM and RMW substantially contributed to the study conceptualisation. AI, CM, SLR, DJB, EO, LB, KN, MML, MNG, KH, GRK and RMW contributed to data analysis and interpretation. AI drafted the original manuscript. RMW supervised the conduct of this study. All authors critically reviewed and revised the manuscript draft and approved the final version for submission.
Funding This research was funded by the Canadian Institutes of Health Research (MOP-119353) to RW.
Competing interests None declared.
Patient and public involvement Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.