Systolic blood pressure, chronic obstructive pulmonary disease and cardiovascular risk

Objective In individuals with complex underlying health problems, the association between systolic blood pressure (SBP) and cardiovascular disease is less well recognised. The association between SBP and risk of cardiovascular events in patients with chronic obstructive pulmonary disease (COPD) was investigated. Methods and analysis In this cohort study, 39 602 individuals with a diagnosis of COPD aged 55–90 years between 1990 and 2009 were identified from validated electronic health records (EHR) in the UK. The association between SBP and risk of cardiovascular end points (composite of ischaemic heart disease, heart failure, stroke and cardiovascular death) was analysed using a deep learning approach. Results In the selected cohort (46.5% women, median age 69 years), 10 987 cardiovascular events were observed over a median follow-up period of 3.9 years. The association between SBP and risk of cardiovascular end points was found to be monotonic; the lowest SBP exposure group of <120 mm Hg presented nadir of risk. With respect to reference SBP (between 120 and 129 mm Hg), adjusted risk ratios for the primary outcome were 0.99 (95% CI 0.93 to 1.05) for SBP of <120 mm Hg, 1.02 (0.97 to 1.07) for SBP between 130 and 139 mm Hg, 1.07 (1.01 to 1.12) for SBP between 140 and 149 mm Hg, 1.11 (1.05 to 1.17) for SBP between 150 and 159 mm Hg and 1.16 (1.10 to 1.22) for SBP ≥160 mm Hg. Conclusion Using deep learning for modelling EHR, we identified a monotonic association between SBP and risk of cardiovascular events in patients with COPD.

A visualisation of the study design can be found in Supplementary Figure S1. This visualisation demonstrates the index date, the exposure period where repeat measurements of systolic blood pressure (SBP) are averaged to serve as exposure status, and the follow-up period, which starts 12 months after index date.

Introduction to deep learning and Bidirectional Electronic Health Records
Transformer Deep learning (DL) modelling is a subclass of machine learning (ML), which is in turn a subclass of artificial intelligence (AI) modelling. DL is a more recent paradigm that utilises artificial neural networks to progressively extract more latent and richer features from input data for a given task.
BEHRT, one such DL model, is a Transformer model that has indeed been shown in past works to better represent the complex multimodal EHR than previous DL models such as recurrent and convolutional neural networks in addition to conventional statistical models [3][4][5] . The flexible BEHRT model allows for including multiple facets of complex EHR data: the encounter itself (e.g., a diagnosis), time information of the encounter (i.e., both age and calendar year), and other attributes such as visit information. While all of these sources of information might provide useful features for utilisation for adjustment in association estimation tasks or risk prediction task, this nuanced data is hard to represent in previous approaches. BEHRT's flexible BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s) architecture allows for encoding this complex arrangement of data, and additionally is able to demonstrate stateof-the-art predictive performance on a host of tasks on EHR data [3][4][5] .

Targeted Bidirectional Electronic Health Records Transformer
We implemented the Targeted Bidirectional Electronic Health Records Transformer (T-BEHRT) for risk ratio (RR) estimation of the association between SBP and cardiovascular outcomes.
In order to include medical history variables in the T-BEHRT model, we conducted some processing of derived CPRD variables. First, the diagnostic records from primary care coded in the Read code format were mapped to the ICD-10 format for consistency with the secondary care coding format (ICD-10). This mapping process yielded a total of 1,497 codes 1 . Second, we mapped the medication codes in the CPRD "product code" format to 386 codes in the BNF coding format 2 . Third, we extracted smoking status (current, former, never a smoker) of a particular patient as the last known status in the 12 months before baseline. Fourth, we extracted patient sex for incorporation as a static variable in the T-BEHRT modelling framework.

5
Third and lastly, semi-parametric "doubly-robust" estimators have found success in mitigating bias and demonstrating more accurate estimates of causal effect. T-BEHRT modelling is powerful when combined with doubly-robust estimation to further reduce bias. To be able to conduct the doubly-robust estimation, the T-BEHRT DL neural first uses a one-layer neural network to predict propensity score (i.e., probability of being treated with a particular exposure) and next, outcome prediction is conducted with two-layer neural networks.
After the DL components are used for prediction, propensity score and outcome estimates are utilised in the cross validated targeted maximum likelihood (doubly-robust) estimation (CV-TMLE) algorithm to update the risk estimates utilising the propensity score estimates 7 . Trimming of propensity score greater than 0.97 and less than 0.03 was conducted before pursuing calculation of RR 3 .

Risk ratio estimation for T-BEHRT model
The SBP category of 120-129 mm Hg was considered as the reference exposure group in our study; RR was estimated in comparison to this reference category. For a given comparison to the reference group (e.g. 150-159 mm Hg compared to the reference), the T-BEHRT model was first trained to predict exposure category (propensity score) and outcome with k-fold cross-validation (k=10) implemented for training and testing 3 . Risk estimates and propensity score predictions across the 10 test sets were pooled, and by utilising "doubly-robust" post-hoc estimator, Cross Validated Targeted Maximum Likelihood Estimation (CV-TMLE), the risk estimates were further corrected for selection biases, and RR and 95% confidence intervals are derived 7 . The term "T-BEHRT" and associated model in this paper refers to the estimation framework consisting of (1) estimating risk of outcome and propensity score with DL modelling and (2) updating initial estimates with CV-TMLE in order to estimate RR and 95% CI.

Risk ratio estimation for logistic regression model
Logistic regression modelling (LR) was used for the conventional approach in this work. The modelling utilised direct standardisation method for estimation of the RR 8 . As an example, to estimate the effect of 150-159 mm Hg on cardiovascular outcomes with respect to the reference exposure, the trained LR model predicted risk with exposure for all patients set to the categorical variable representing 150-159 mm Hg and predicted risk with exposure similarly set to the reference group. The RR was derived as the ratio of the average of these two sets of predictions. For theoretical guarantees, we implemented k-fold cross-validation (k=10) for causal estimation 9 . RR was calculated as the average of RR estimations on the 10 individual test BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s)