Ethics assertion
The TRUE-HF research (NCT05008692) was carried out beneath permitted protocols from the College Well being Community (UHN) Analysis Ethics Board (Toronto, Canada; REB no. 20-5205). Written knowledgeable consent was obtained from all members earlier than enrollment. Particulars of the research protocol can be found21 and a abstract is offered herein. The statistical evaluation plan is included in Supplementary info.
Retrospective analyses of the All of Us Analysis Program used de-identified information accessed by its controlled-access platform. This system operates beneath a centralized institutional overview board and all members offered written knowledgeable consent for information assortment and secondary analysis use. Analyses complied with program insurance policies and relevant moral and regulatory necessities.
Research members of the TRUE-HF observational research
The TRUE-HF research (NCT05008692) enrolled outpatients with HF receiving care on the UHN (Fig. 1). Eligible members have been aged ≥18 years and will adequately comprehend English independently or with a caregiver’s assist. Analysis coordinators offered research info on the time of knowledgeable consent and contacted sufferers for follow-up.
Through the preliminary enrollment go to (research entry), we offered sufferers steering on establishing their Apple Watch. Throughout this session, we additionally educated sufferers on the way to use Apple Watch. The suitable digital case report kind (eCRF) and electrocardiogram (ECG) purposes have been downloaded and accessible on iPhone. Apple Watch ECG has solely been validated for sufferers aged >22 years; subsequently, solely members older than 22 years have been requested to obtain the Apple ECG app. Contributors interacted with an eCRF iOS cellular utility. The information gathered within the utility weren’t used to deal with the affected person. All sufferers underwent CPET, complete bloodwork, scientific examination and a supervised 6MWT throughout study-entry and end-of-study clinic visits.
All demographic and scientific measurements recorded in the course of the study-entry and end-of-study clinic visits have been captured in Analysis Digital Information Seize, guaranteeing information transcription for the research cohort. In collaboration with Apple builders, an eCRF iOS cellular utility was developed utilizing Swift to speak research info, conduct every day surveys and collect HealthKit wearable information from sufferers securely and de-identified in the course of the research period. The wearable-derived information weren’t used to tell scientific decision-making.
Day by day surveys captured unplanned healthcare utilization occasions in the course of the free-living statement interval. Sufferers have been instructed to finish every day surveys assessing the next signs: rising shortness of breath, leg swelling, palpitations, chest ache, light-headedness and fainting. As well as, the every day surveys assessed whether or not a affected person required the next within the final 24 h: modifications to remedy, intravenous furosemide, unscheduled well being go to, emergency room go to and/or hospital admission. Each month, we additionally carried out month-to-month health assessments, the place sufferers have been instructed to partake in a month-to-month unsupervised 6MWT and Tecumseh dice check. Tutorial movies might be used asynchronously to assist these assessments. We excluded sufferers with nonadherence, outlined by carrying their Apple Watch <1 d all through the 90-d free-living interval.
In deriving prediction fashions, there isn’t any speculation check to information pattern measurement calculation. Due to this fact, we aimed to calculate pattern measurement necessities for numerous CI widths round clinically acceptable measures of mannequin discrimination. The research’s pattern measurement was decided utilizing a classification-based strategy to foretell a >10% decline in pVO2, a threshold beforehand related to worse outcomes in sufferers with HF23,24,25,26. With an assumed AUROC of 0.70, a pattern measurement (n = 200 with accomplished CPETs) was estimated to supply a sturdy decrease certain of the CI for the research goal, together with mannequin improvement (n = 150) and held-out check (n = 50). To forestall analytical bias, mannequin analysis on the held-out check set was prespecified and carried out solely after the research was accomplished and all members had exited follow-up.
Exterior validation cohort: All of Us
The exterior validation cohort was constructed utilizing information from the NIH All of Us Analysis Program v8 to validate the unplanned healthcare utilization experiments. Initially, 1,664 members inside the All of Us dataset had a documented prognosis of HF and have been accessible for evaluation utilizing wearable information from Fitbit units. Amongst these, 400 people have been excluded on account of incomplete or lacking EHR information crucial for occasion adjudication and correct cohort characterization (Prolonged Information Fig. 2).
To align the All of Us cohort’s scientific severity with the TRUE-HF inhabitants, we restricted the cohort to incorporate solely sufferers with documented prior unplanned healthcare utilization. We outlined unplanned occasions as inpatient hospitalizations (excluding deliberate procedures) or intravenous furosemide administration40. For every participant, the study-entry date was set because the later of the next two dates: the discharge date of their qualifying unplanned healthcare occasion or the primary day of accessible wearable sensor information assortment. To make sure sufficient information availability, we excluded members with <30 d of wearable information after research entry. Contributors have been then filtered to make sure sufficient wearable information protection, outlined as no less than 40% every day measurement protection in the course of the observational interval after research entry, ensuing within the exclusion of further members.
Lastly, we outlined the statement endpoint (‘end-of-study’ go to) as both the incidence of a second unplanned healthcare utilization occasion inside 120 d or the completion of a follow-up interval that matched the TRUE-HF cohort median period (roughly 94.5 d), with a most of 120 d. Contributors who didn’t meet both endpoint criterion have been excluded, leading to a last exterior validation cohort comprising 193 people. Affected person demographics are summarized in Prolonged Information Desk 2.
Wearable information and have engineering
The next information have been collected, with knowledgeable consent from research members, from Apple Watch by HealthKit in the course of the approximate 90-d free-living interval (that’s, excluding days of baseline and end-of-study follow-up clinic visits): step depend, train time, distance traveled, stand time, energetic power burned, basal power burned, coronary heart fee, coronary heart fee variability and O2 saturation. These variables have been chosen as a result of they have been essentially the most often and constantly recorded in the course of the interim evaluation of the monitoring interval and particulars of every are outlined in Apple HealthKit41.
A standardized summarization protocol addressed the various temporal resolutions throughout information varieties. First, irregular information report errors have been eliminated utilizing an outlier strategy. Information with values >3 s.d. from the inhabitants imply for every information kind have been eliminated. Subsequent, we constructed illustration of the wearable information that might combine large-scale information for downstream utilization. Particularly, first, we normalized the disparate information streams by synthesizing 90-min aggregated metrics (imply, median, minimal, most and s.d.) of HealthKit variables (outlined above); the sum was used as an alternative of the imply to extra precisely seize the general train high quality of those HealthKit information varieties in the course of the 90-min time window. To keep up an estimate of variable trajectories during times of sensor nonrecording, we employed intrapatient forward-filling imputation to handle gaps within the 90-min abstract information, thereby preserving the autoregressive integrity of the time collection. We used intrapatient forward-filling imputation to take care of the autoregressive integrity of the time collection and conservatively estimated variable trajectories throughout sensor nonrecording.
TRUE-HF framework particulars
With wearable information and incorporating patient-specific scientific info akin to intercourse, race, age, prescription dosages, weight and peak, our mannequin predicts a person affected person’s cardiopulmonary health and modifications of their health over time. All 9 wearable-derived options offered in Fig. 1b have been included as mannequin inputs.
Our new technique leveraged a contextualized DL mannequin to retain and analyze temporal tendencies throughout 30 d of patient-wearable information, offering near-continuous every day monitoring by next-day predictions. To realize this, our mannequin integrated three distinct elements: (1) it contextualized temporal representations of the information utilizing a bottom-up strategy, extending from 90-min intervals to full-day aggregation; (2) it built-in patient-specific scientific info instantly into the wearable information options, permitting for adaptive characteristic calculations; and (3) it explicitly thought-about the temporal constraints, recognizing that every day actions are influenced by previous days, and used this to make ongoing predictions for every day.
The TRUE-HF mannequin (Prolonged Information Fig. 1) processed 30 d of 90-min summaries of wearable information, beginning with sequential 90-min summaries and assembling them into bigger time home windows, permitting the mannequin to be taught temporal relationships at totally different temporal resolutions whereas bettering processing effectivity. This strategy is embodied within the mannequin structure, a bottom-up variant of the transformer mannequin, which optimizes the characteristic map and reduces the temporal decision by pooling37,42,43.
HealthKit information are first tokenized by one-dimensional convolutions to break down options alongside the temporal area44. The enter is then processed by a TRUE-HF block consisting of a transformer layer adopted by a pooling layer. The transformer layer learns relationships inside the temporal decision in every TRUE-HF block. Subsequently, the pooling layer aggregates pairs of consecutive time factors (that’s, 90–180 min). Utilizing 4 consecutive TRUE-HF blocks, we successfully analyzed time resolutions of 90 min, 180 min, 360 min, 720 min and 1,440 min (every day decision). The ultimate prediction layer then aggregated info throughout the earlier 30 d, inclusive, to foretell the present day’s measurement.
To reinforce the mannequin’s understanding of wearable information, TRUE-HF incorporates patient-specific scientific info (described above), enabling it to be taught totally different operations primarily based on enter attributes moderately than treating all inputs equally. We achieved this by augmenting TRUE-HF with a feature-wise linear modulation block that modulates activations within the neural networks primarily based on scientific particulars45. The mannequin used solely demographic info from the baseline clinic go to to take care of temporal causality.
Lastly, by incorporating a causal self-attention mechanism, we explicitly constrained temporal studying to maneuver ahead solely (autocorrelation) within the TRUE-HF framework42,46,47. This mechanism introduces autoregressive properties into temporal studying. It safeguards predictions for any given day being influenced solely by information from that day and previous days. We leveraged informal consideration to allow further semi-supervised coaching, as described beneath47.
Mannequin coaching
All iterative mannequin coaching was carried out completely inside the first 154 sufferers, whereas the ultimate 63 sufferers have been used solely for held-out testing. We excluded the times of the study-entry and end-of-study clinic visits from coaching.
To mitigate the dearth of every day scientific CPET and 6MWT end result labels, we utilized semi-supervised studying concentrating on linear approximated values of every check throughout the research48,49,50.
In our research, express every day labels have been absent, and solely baseline and end-of-study clinic visits offered scientific standing for our goal outcomes (CPET pVO2 or scientific 6MWT). To ascertain an affordable approximation of our goal end result (pVO2 or scientific 6MWT) for every of those 30-d home windows, we used linear interpolation of scientific outcomes recorded on the preliminary study-entry evaluation and follow-up visits, yielding every day outcomes (Supplementary Fig. 2).
The ultimate TRUE-HF mannequin was an ensemble of ten fashions, every educated utilizing the identical TRUE-HF framework however with totally different random seeds. The TRUE-HF mannequin makes use of completely wearable information and scientific information to foretell future states, by no means previous states. The common prediction from these fashions was used to derive TRUE-HF predictions.
Extending the TRUE-HF mannequin to All of Us
The All of Us dataset offered per-min wearable measurements completely for HeartRate and StepCount, whereas vital options from the unique TRUE-HF mannequin of ActiveEnergyBurned, BasalEnergyBurned, Distance, AppleStandPlusTime, AppleExerciseTime, OxygenSaturation and HeartRateVariabilitySDNN have been unavailable.
To handle these variations, we employed a knowledge-distillation strategy, particularly a trainer–scholar coaching technique51,52. On this strategy, predictions generated by a extra complete trainer mannequin (TRUE-HF mannequin) served as coaching targets (pseudo-labels) for a streamlined scholar mannequin that accommodated the decreased characteristic set. Given the substantial characteristic hole between the unique TRUE-HF mannequin and the All of Us-compatible variant, we launched a ‘teacher-assistant’ mannequin to facilitate information switch and mitigate efficiency degradation53.
This teacher-assistant mannequin retained all unique wearable options however used a decreased scientific characteristic set aligned with the All of Us cohort. For every coaching batch, pseudo-labels have been generated from a randomly chosen member of the unique TRUE-HF ten-model ensemble. Subsequently, the ensemble of educated teacher-assistant fashions, which offered pseudo-labels to coach the ultimate All of Us-compatible TRUE-HF-RS mannequin, depends completely on HeartRate, StepCount and the decreased scientific characteristic set.
All TRUE-HF and TRUE-HF-RS fashions have been educated completely on the coaching set of the TRUE-HF cohort (n = 154). The All of Us Analysis Program was used completely for exterior validation.
Mannequin validation and outcomes
We in contrast our TRUE-HF pVO2 and TRUE-HF 6MWTD towards clinically measured CPET pVO2 and 6MWTD, in addition to to Apple VO2Max and sixMinuteWalkTestDistance, respectively. To foretell the CPET worth from the end-of-study clinic go to, we used wearable information collected over the 30 d previous it. This ensured that every one mannequin inputs mirrored free-living situations, unaffected by structured CPET or assessments carried out on the go to day. Apple VO2Max and Apple sixMinuteWalkTestDistance have been collected by our iOS cellular utility. Notice, sure situations or medicines that restrict coronary heart fee could trigger an overestimation of the Apple VO2Max algorithm—as communicated in Apple’s consumer interface19,54.
To additional assess our predictions’ accuracy in detecting pVO2 modifications over time, we measured the mannequin’s means to detect declines on the end-of-study clinic go to, measured towards the study-entry scientific measurements. For CPET pVO2, a end-of-study drop in CPET pVO2 was outlined as a ≥10% discount from study-entry go to to end-of-study clinic go to and was chosen as a result of a ≥6% to 10% lower in pVO2 is related to an elevated threat of medium-to-long-term hospitalization or dying in sufferers with HF23,24,25,26. To categorise sufferers as a ≥10% drop in pVO2, we calculated a share distinction between the final mannequin prediction (that’s, TRUE-HF prediction the day earlier than the clinic go to) to study-entry CPET pVO2. The identical evaluation technique was used for the TRUE-HF 6MWTD mannequin, the place we examined correlation measures and drops in 6MWTD and labeled a decline in 6MWTD (10% discount in distance walked), respectively.
Our secondary goal was to judge the affiliation between TRUE-HF-predicted declines in every day pVO2 and unplanned healthcare utilization in the course of the 90-d follow-up interval. This affiliation was in comparison with associations of conventional static threat components measured at baseline and scientific fashions (MAGGIC, SHFM and PREDICT-HF). We outlined unplanned healthcare utilization as hospitalization, unscheduled scientific visits or pressing intravenous furosemide remedy happening between the study-entry and the end-of-study visits. The primary prediction made by our TRUE-HF pVO2 mannequin required 30 d of knowledge. Therefore, this goal was evaluated solely amongst sufferers free from unplanned healthcare utilization within the first 31 d of our research.
Explainability and have evaluation
We examined the influence of eradicating structured month-to-month train classes. All wearable information collected throughout these classes have been masked by excluding measurements inside a 90-min window across the begin and finish occasions. A 30-d window was chosen earlier than last analyses (described within the statistical evaluation plan). We additionally evaluated how enter window size affected accuracy by evaluating shorter home windows (10 d or 20 d) with the 30-d window utilizing retrained fashions and zero-shot inference. Saliency analyses on the mixed TRUE-HF mannequin quantified characteristic significance, averaging saliency values every day for visualization. To evaluate characteristic contributions, we educated and in contrast (1) a totally related neural community with scientific baseline variables and (2) a mannequin utilizing solely wearable information.
Statistical evaluation
The analyses and strategies have been prespecified earlier than information analysis and carried out solely on the held-out check set (n = 63), consisting of the final 50 sufferers who efficiently accomplished an end-of-study clinic go to. We adopted the STROBE and MI-CLAIM reporting pointers55,56.
The first evaluation in contrast our mannequin’s noticed and predicted CPET pVO2 values. We used Spearman’s coefficient to measure the rank-based affiliation between noticed and predicted values. Pearson’s r was employed to quantify linear correlation. Collectively, these two correlation measures seize complementary features of the mannequin’s predictive constancy. The m.a.e. was chosen for its interpretability in the identical models as the result. AUROC was used to judge the diagnostic accuracy of appropriately predicting a significant drop in end result measurements.
To estimate AUROC CIs, we employed 1,000 stratified bootstrap resamples and stratified information resampling strategies. Primarily based on Delong’s check, a two-sided P < 0.05 was thought-about important between fashions.
The secondary evaluation used time-varying, prolonged, Simon and Makuch’s Kaplan–Meier cumulative threat curves to evaluate the affiliation between an noticed drop of ≥10% of their TRUE-HF every day pVO2 and unplanned healthcare utilization57. On this evaluation, the cohort was stratified into two teams: (1) sufferers with a ≥10% discount in every day pVO2; and (2) sufferers with <10% discount in every day pVO2. We constructed cumulative incidence curves for unplanned healthcare utilization and accounted for the time-varying nature of our unbiased variable. To evaluate the chance of event-free survival between these teams, we constructed cumulative threat curves to account for the continual change in our every day pVO2 covariate, facilitating a longitudinal comparability of scientific outcomes associated to unplanned healthcare utilization. An prolonged Cox’s proportional hazards mannequin quantified the HR and the energy of affiliation between time-varying incidence of pVO2 drops (scaled in 10% drops) in TRUE-HF every day pVO2 and subsequent scientific occasions58. Sensitivity analyses involving minimal covariate adjustment have been carried out. Because of the restricted pattern measurement, these analyses used single covariate changes moderately than complete multivariate changes.
Secondary evaluation AUROC was computed as the utmost share drop in mannequin prediction from the primary mannequin prediction to the prediction on the day for every affected person, previous any occasion or end-of-study clinic go to (censor). DeLong’s one-tailed check was carried out to judge the a priori speculation that steady monitoring strategies would surpass static measures in discriminative efficiency. A number of testing correction was utilized utilizing the Benjamini–Hochberg technique (goal false discovery fee = 0.05). Landmark-based, time-dependent AUROC evaluation was carried out throughout exterior validation to evaluate the mannequin’s means to foretell outcomes primarily based on the biggest share drop recognized as much as every landmark time t59. This evaluation was carried out in 10-d intervals, with 30-d end result home windows repeated.
Qualitative outcomes for the tendencies noticed with TRUE-HF predictions have been created utilizing LOWESS-smoothed trajectories, generated by forward-filling censored or lacking information throughout all the research window utilizing the final noticed worth earlier than censoring or occasion.
Information preprocessing and mannequin improvement have been carried out in Python utilizing Pandas (v1.5.3), NumPy (v1.21.3) and PyTorch (v2.0.0). Correlation analyses, m.a.e. calculations and Simon and Makuch’s Kaplan–Meier estimations have been carried out with SciPy (v1.7.1) and lifelines (v0.28.0). AUROC, DeLong’s check, the Benjamini–Hochberg correction and Cox’s proportional hazards fashions have been applied in R utilizing pROC (v1.18.5) and survival (v3.6-4) packages and the timeROC bundle (v0.4). TRUE-HF mannequin structure definitions, wearable information preprocessing and inference workflows have been made publicly accessible at https://github.com/mcintoshML/TRUEHF.
Reporting abstract
Additional info on analysis design is out there within the Nature Portfolio Reporting Abstract linked to this text.



