Ethics assertion
Knowledge assortment and analysis started following approval and waiver of consent by the Institutional Evaluation Board of Stanford College (Protocol #60342, March 2021). Further information from the College of Pennsylvania have been sourced after retrospective assortment was deemed IRB exempt by the College of Pennsylvania Well being System (Protocol #852332, November 2022).
Computational {hardware} and software program
MRI DICOM information have been pre-processed on siloed HIPAA-certified n2 situations on the Stanford Nero–Google Cloud platform. Particularly, we used an 8-core digital machine with 52 GB of reminiscence and 6 TB of hooked up stable state storage. Knowledge from the UK BioBank have been pre-processed on the Stanford Sherlock Excessive Efficiency Computing Cluster, utilizing 24 CPU cores (Intel Xeon Gold 5118, 2.30 GHz). Anonymized studies have been tokenized on a neighborhood encrypted desktop utilizing 48 CPU cores (AMD Threadripper, Lambda Computer systems). All fashions have been educated on the Stanford Sherlock Excessive Efficiency Computing Cluster utilizing servers with 4× Nvidia A100 GPUs, every with both 40 GB or 80 GB VRAM, and 64 CPU cores (AMD Epyc). Exterior validation on information from the College of Pennsylvania was carried out on the Penn CUBIC Excessive Efficiency Computing Cluster on a single Nvidia A40 GPU with 10 CPU cores. Further exterior checks befell on the Penn Superior Analysis Computing Middle (PARCC) Betty cluster, on a single Nvidia Blackwell B200 GPU with 10 CPU cores. Hyperparameter optimization experiments have been run on servers with quite a lot of GPU sources (Nvidia V100, 32 GB VRAM; Nvidia H100, 80 GB VRAM; Nvidia A100, 40 GB/80 GB VRAM; Nvidia P100, 32 GB VRAM; Nvidia Blackwell RTX 6000 MaxQ, 96 GB VRAM). We used the PyTorch deep-learning library (v.1.11.0) and the pytorch lightning framework (v.1.8.6)70. Main Python packages used on this work embrace numpy (v.1.21.2), pydicom (v.2.0.0), transformers (v.4.4.2) and stanza (v.1.5.0).
Datasets
Specifics of the pre-processing pipelines for each the MRI scans and the free-text studies are detailed within the ‘Dataset pre-processing’ part of Supplementary Info. Briefly, from every distinctive MRI examine, related scans have been extracted (4CH, 3CH, 2CH and SAX cine sequences) as 4D arrays and saved inside a single hdf5 file. Free texts from the studies have been segmented into particular person sentences utilizing the stanza pure language processing pipeline, tokenized utilizing the usual BERT auto-tokenizer, and the ensuing anonymized numeric arrays have been saved in a single listed json file27,71. Throughout the pre-training datasets, fine-tuning datasets, exterior check datasets and the UK BioBank, we included 65,492 people with ~550,156 distinctive movies throughout totally different view planes and cross-sections.
Medical CMR dataset
The full medical CMR dataset comprised 19,122 distinctive people. Cardiac MRI scans have been sourced from 17,088 particular person sufferers from a consortium of educational hospital techniques based mostly in the US (Stanford Healthcare, UCSF, Medstar). Cine MRI scans have been procured by way of Bunkerhill Well being (San Francisco, CA) as de-identified DICOM recordsdata, and related radiology studies have been sourced as a single csv file (IRB Protocol #60342, March 2021). The full pre-training dataset consisted of 293,110 distinctive 4CH, 3CH, 2CH and SAX movies. The scans have been carried out as a part of routine medical observe and studies have been generated by board-certified physicians with particular experience in cardiac MRI. Sequences have been acquired on a variety of scanners together with these manufactured by Siemens (Siemens Healthcare), Normal Electrical (GE) and Philips (Philips Healthcare), leading to substantial variance within the variety of frames per slice, imaging decision and reconstruction strategies (Supplementary Desk 2). Demographics wherever possible are detailed in Supplementary Desk 1. The information have been first separated into pre-training and downstream datasets in an approximate 75:25 break up on the affected person stage. For the pre-training break up, we additional divided the info right into a coaching and validation set with an approximate 66:33 break up. Equally, for downstream break up (meant for use as a labelled fine-tuning dataset for medical duties of curiosity) we additional divided the info into coaching, validation and testing datasets with an approximate 50:25:25 break up. We didn’t selectively exclude sufferers from this dataset; nevertheless, a fraction of the dicom recordsdata have been obtained as duplicates or have been corrupted and have been subsequently discarded. Supplementary Fig. 1 particulars the info splits and enumerates the excluded research at every stage. Cardiac MRI scans from an extra 2,033 particular person sufferers have been secured from the College of Pennsylvania Well being System (IRB exempt, Protocol #852332, November 2022). These scans have been carried out as a part of routine medical observe and purchased on scanners manufactured by Siemens and GE. Knowledge from the College of Pennsylvania have been used solely for exterior testing. Whereas rule-based automated information labelling strategies have been used prior to now, these have been outmoded by massive language fashions72. Constructing on our earlier work in exploring the zero-shot capabilities of huge language fashions for medical textual content, we utilized a publicly accessible massive language mannequin (medgemma3, 27-billion parameter variant) to parse free-text studies generated as a part of routine medical observe into pre-defined ‘disease labels’ for the illness analysis duties73,74. Particular prompts, parameters, efficiency comparisons vs human annotators, and a choice of random non-curated studies with critique of the deep-learning-predicted labels are detailed in Supplementary Fig. 5.
UK BioBank cardiac MRI cohort
Cine bSSFP-cardiac MRI sequences from 45,623 contributors have been sourced from the UK BioBank (Challenge ID: 71226). SAX sequences have been accessible for 11,005 contributors and comprise stacks of 8–10 particular person slices. One slice was accessible for every of the 4CH, 3CH and 2CH scans. This amounted to a complete of 257,046 distinctive movies accessible for evaluation. Sequences within the UK BioBank have been acquired on a medical 1.5 Tesla scanner utilizing a standardized protocol (MAGNETOM Aera, Syngo Platform VD13A, Siemens Healthcare)56. As a part of this protocol, the overwhelming majority of ventricular volumes and practical metrics have been calculated by way of automated contouring of the ventricular endocardium and epicardium with out handbook skilled quality control56,75. For fine-tuning and transfer-learning experiments to estimate LVEF%, we break up the UK BioBank dataset into an approximate 80:10:10 break up on the participant stage into coaching (n = 31,693), validation (n = 3,938) and hold-out check datasets (n = 3,938).
ACDC dataset
The ACDC dataset is a publicly accessible cardiac MRI dataset of 100 sufferers from the College Hospital of Dijon, France28. Every SAX sequence was paired with patient-level non-overlapping labels (n = 20 every) for hypertrophic cardiomyopathy, earlier myocardial infarction, dilated cardiomyopathy, irregular proper ventricles and regular controls. The scans have been acquired on both a 1.5 Tesla (Siemens Space, Siemens Healthcare) or 3.0 Tesla (Siemens Trio Tim, Siemens Healthcare) scanner with a standard SSFP sequence in breath maintain and gating.
Kaggle Knowledge Problem dataset
The 2015 Kaggle Knowledge Science Bowl launched information from 700 sufferers compiled by the Nationwide Institutes of Well being and the Kids’s Nationwide Medical Middle, and was on the time, an order of magnitude bigger than any cardiac MRI dataset beforehand described. Sufferers have been recruited from the US and scans have been carried out within the Washington DC space. Whereas demographic splits from the dataset should not accessible, the unique information have been sourced from a number of hospital techniques throughout a variety of age teams containing each regular and diseased hearts. The competitors closed on 14 March 2016, however information from 697 circumstances stay publicly accessible in DICOM format39. 2CH, 4CH and SAX cine sequences have been accessible to be used, together with skilled annotations for left ventricular end-systolic and end-diastolic volumes. The whole thing of the accessible dataset was used for exterior validation as is, with none high quality management.
Neural community architectures
We examined imaginative and prescient encoder architectures together with 3D residual convolutional networks and video imaginative and prescient transformers. We settled on utilizing an implementation of a multiscale imaginative and prescient transformer (mViT) with 36.3 million trainable parameters as our video encoder after experiments exhibiting superior generalization and embedding high quality26. Imaginative and prescient transformers have lately emerged as a performant different to convolutional neural networks, particularly within the setting of large-scale self-supervised pre-training76,77. Imaginative and prescient transformers retain the skip connections seen in conventional convolutional networks, however are additionally in a position to attend to native and international options of a picture in earlier levels78. The mViT structure is a imaginative and prescient transformer designed particularly for video information, which foregoes the successive layers of convolutional operations seen in typical convolutional neural networks, for a single convolutional layer to divide the enter video right into a linear sequence of overlapping cubes. These linear parts are processed by 16 layers of stacked transformer modules, permitting the community to successfully attend to distant enter options. Particular to the mViT structure is a sequential sequence of pooling and scaling operations that successfully allow the community to take care of easy visible options at excessive decision in early layers, adopted by advanced high-dimensional relationships at a coarser decision in deeper layers. Consequently, in contrast with different extensions of 2D-image transformers to the video area, mViT by design has a stronger temporal inductive bias. Whereas extra computationally costly than comparable convolutional networks, mViT is extra environment friendly than comparable imaginative and prescient transformers, requiring remarkably much less pre-training information to realize state-of-the-art outcomes on typical motion recognition datasets. Lastly, in comparison with conventional convolutional neural networks, mViT has proven superior efficiency on massive video motion recognition datasets regardless of fewer trainable parameters26.
We elected to make use of a pre-trained BERT mannequin for our textual content encoder27. In contrast to different language fashions which have come earlier than it, BERT is educated utilizing a ‘bidirectional’ method, the place the mannequin is educated to be taught the construction and context of human language by attending to sentences in each the left-to-right and right-to-left route. Particular particulars of the pre-training strategies for BERT are detailed within the authentic paper27. We used a 12-layer variant of BERT base, with 12 consideration heads and a hidden dimension of dimension 786, with a complete of 110 million trainable parameters. We examined a mixture of various pre-trained weights together with these from the unique publication, weights wonderful tuned on the MIMIC dataset, and weights from a mannequin educated on biomedical abstracts from PubMed with a customized vocabulary of 30,522 tokens79,80 (Supplementary Fig. 3).
Pre-training framework
We constructed on earlier makes an attempt at studying visible representations utilizing naturally occurring pairing of 2D medical imaging and textual information, extending these ideas to the spatiotemporal video-like nature of cardiac MRI scans14,15,16,17,19,20. Two parallel encoders have been educated: one for processing the MRI cine sequences and the opposite for processing the subsampled textual content from paired radiology studies. Self-supervised transformer networks specifically have proven superior efficiency on downstream duties in comparison with conventional supervised strategies76,81,82. We used an implementation of mViT with Kinetics-400-initialized weights for the imaginative and prescient encoder, and a pre-trained BERT mannequin for the textual content report encoder. Particularly, we utilized weights from BERT pre-trained on abstracts of biomedical publications on PubMed with a customized vocabulary79. Knowledge from 8,513 sufferers (9,427 scans and paired studies) have been used for coaching, and a separate set from 4,194 sufferers (4,646 scans and paired studies) have been used for validation.
We employed randomized sequential information augmentation schemes (AugMix) to stochastically pattern and layer a sequence of chained transformations together with however not restricted to resizing, solarization, shear, translate and random rotation of movies within the spatial dimensions, all whereas preserving the identical augmentations alongside the temporal dimension for temporal consistency83. Uniform temporal subsampling vastly improved downstream efficiency and generalizability. We augmented the radiology studies by randomly sampling 5 sentences from the complete report for every scan per coaching step. The output of every encoder was handed by way of a one-layer linear projection head to yield a pair of 512-dimensional embeddings. These low-dimensional, 512-dimensional embeddings are a compressed numeric illustration of the knowledge contained throughout the enter MRI scan and paired textual content report.
Earlier work has additionally proven the significance of huge batch sizes for efficient contrastive illustration studying81. To check this, we pre-trained fashions with a batch dimension of 16, 32 and 128 video–textual content pairs. For the UK BioBank LVEF prediction process, we discovered that wonderful tuning from the larger-batch-size pre-trained fashions led to improved downstream outcomes (Supplementary Fig. 2). Whereas computational budgets didn’t permit for an in depth hyperparameter search with the bigger batch sizes, we be aware that the downstream advantages didn’t look like clinically vital for this particular process. Nonetheless, this stays an space for extra future exploration.
Imaginative and prescient-only self-supervised strategies could be difficult to include the place scans from a number of visually distinct view planes exist for a similar affected person. We centered our efforts on text-to-video approaches given the success with textual content supervised visible illustration studying throughout radiology and motion recognition14,16,84,85. We thought-about approaches resembling Contrastive Language-Picture Pre-Coaching (CLIP); nevertheless, these are restricted by a brief context size appropriate for captions relatively than the bigger, largely unstructured paragraphs which might be typical of cardiac MRI studies85. Just like the work of ref. 16, we elected to make use of an uneven bidirectional implementation of the InfoNCE loss to maximise mutual data between every MRI video–textual content report pair16,22. The contrastive losses used are basically log-loss of an n-way classifier to foretell the proper pair of MRI scan and report (the place n = batch dimension). The primary loss operate is a video-to-text contrastive loss for the ith pair, the place vi represents a video embedding and ui represents a textual content embedding of the ith video–textual content pair. N right here represents the variety of video–textual content pairs in a complete batch being evaluated.
$${l}_{i}^{left({bf{v}}to {bf{u}}proper)}=-log frac{exp left(leftlangle {{bf{v}}}_{i},{{bf{u}}}_{i}rightrangle /tau proper)}{{sum }_{okay=1}^{N}exp left(leftlangle {{bf{v}}}_{i},{{bf{u}}}_{okay}rightrangle /tau proper)}$$
(1)
The second loss operate is a equally structured text-to-video contrastive loss. The tunable temperature parameter (left({tau }proper)) controls the energy of penalties on onerous destructive pairs sampled throughout coaching86.
The ultimate loss was outlined as a weighted mixture of the 2 losses averaged over all optimistic video–textual content pairs in every batch of knowledge. The scalar weight is given by λ.
$${mathscr{L}}=frac{1}{N}{sum }_{i=1}^{N}left({{lambda }{l}}_{i}^{left({bf{v}}to {bf{u}}proper)}+{left(1-lambda proper)l}_{i}^{left({bf{u}}to {bf{v}}proper)}proper)$$
(2)
We moreover applied a ‘flooding’ regularization method to stop the coaching loss (({mathscr{L}})) to method zero87. We set the flood stage (scalar worth given by b) to a coaching lack of 0.05 to permit for higher generalization. The ultimate loss ((widetilde{{mathscr{L}}})) is thus given by:
$$widetilde{{mathscr{L}}}=left|{mathscr{L}}{mathscr{-}}brilliant|+b$$
(3)
The precise pre-trained weights and vocabulary used for initializing the textual content encoder, batch dimension, augmentation scheme, InfoNCE temperature parameter and flood regularization have been vital for mannequin convergence88. The ultimate mannequin was pre-trained with a batch dimension of 32 per GPU, for 600 epochs. The primary 6 layers of the BERT textual content encoder was frozen, and the complete community was educated with a studying charge of 4.8 × 10−5 utilizing the AdamW optimizer with weight decay set to 1 × 10−6 and eps set to 1 × 10−8. We decayed the training charge by an element of 0.1 at 300 epochs. Checkpoints have been saved each 10 epochs in the course of the pre-training course of and the final checkpoint was used for wonderful tuning on downstream medical duties. The full time taken for pre-training was 13 days and 14 h (4 × 80 GB Nvidia A100 GPUs). The power of the imaginative and prescient transformer encoder to cluster totally different illness circumstances with none extra express supervised coaching was visualized utilizing the uniform manifold approximation and projection (UMAP) algorithm initialized utilizing default values89.
Multi-instance self-attention and downstream analysis
A gated multiview self-attention community was educated to assign an consideration worth (aokay) to every MRI view embedding produced by the primary imaginative and prescient encoder13,31. For every embedding inside a bag of okay embeddings, a excessive rating after softmax activation (close to 1) signifies {that a} explicit MRI view airplane is extremely informative for the downstream diagnostic process. Conversely, a low rating (close to 0) signifies that the MRI view airplane has little to no diagnostic worth. For classification duties, every enter embedding was moreover handed by way of a LayerNorm operate earlier than a ahead move into the self-attention blocks (Supplementary Fig. 6)90 (wT, consideration scoring vector; V, view stage weight parameters; U, view stage weight parameters; hj, low-dimensional embeddings; ⊙, element-wise product; tanh, tanH activation operate; sigm, sigmoid activation operate; N, complete variety of MRI view embeddings for a selected examine).
$${a}_{okay}=frac{exp left{{{bf{w}}}^{high }left(tanh left({{bf{Vh}}}_{okay}^{high }proper)odot mathrm{sigm}left({{bf{Uh}}}_{okay}^{high }proper)proper)proper}}{{sum }_{j=1}^{okay},exp left{{{bf{w}}}^{high }left(tanh left({{bf{Vh}}}_{j}^{high }proper)odot mathrm{sigm}left({{bf{Uh}}}_{j}^{high }proper)proper)proper}}$$
(4)
We made use of an consideration pooling mechanism to common the embeddings from all MRI views weighted by their predicted consideration scores, to return a single 512-dimensional embedding. This embedding might be handled as a ‘feature representation’ of the complete MRI examine for a selected downstream process of curiosity. For every downstream classification process of curiosity, we used a binary classification head with a sigmoid activation operate, as illness labels are often not mutually unique within the setting of cardiovascular issues. For downstream duties that contain regression of a numeric variable, we changed the binary classification head with a single output neuron with a linear activation operate.
LVEF regression process
We examined two modes of coaching for LVEF% prediction: (1) ‘fine tuning’ the place the final linear layer of the imaginative and prescient encoder and the classifier head are trainable and (2) ‘transfer learning’ the place the imaginative and prescient encoder, linear layer and classifier heads are all trainable. ‘Fine tuning’ permits for a point of flexibility in the best way embeddings are generated however retains the imaginative and prescient encoder frozen to utilize the discovered representations. With the system set to ‘transfer learning’, the community begins from the discovered representations; nevertheless, for the reason that total community is unfrozen, it’s attainable to ‘overwrite’ these parameters with every new replace of the coaching course of. For these experiments, we initialized the imaginative and prescient encoder with the contrastive pre-trained weights (ours) or Kinetics-400 weights (baseline), onto which we hooked up the regression head as described above.
We wonderful tuned our pre-trained checkpoints with 32-bit precision utilizing the AdamW optimizer, with a studying charge set to 1 × 10−4 and default worth of 0.01 for weight decay. We explored totally different augmentation schemes and achieved superior validation efficiency with AugMix on restricted hyperparameter sweeps with 10% of the coaching information83. For all experiments involving wonderful tuning with subsets of accessible information, we used a handbook seed worth for random subsampling to make sure reproducibility of outcomes. We made use of all accessible 4CH, 2CH, 3CH and a random subsample of fifty% of SAX views per examine, with no handbook screening for high quality management. We elected to coach our regression fashions with a Huber loss operate, and we used imply squared errors and imply absolute errors as efficiency metrics91. We moreover calculated the AUROC for diagnosing coronary heart failure on the premise of an LVEF cut-off of 40%. We educated fashions for a most of 100,000 steps on GPUs with a minimum of 16 GB of VRAM every. For experiments described in Fig. 3a,d, configuration recordsdata have been generated for every experimental setup and have been educated in parallel throughout quite a few GPUs on Stanford Sherlock.
Illness classification process
We outline each ‘fine tuning’ and ‘transfer learning’ as above, and used the identical community structure initialized with Kinetics-400 weights as our baseline. We wonderful tuned our pre-trained checkpoints with the identical total settings as described above for the regression duties, apart from the usage of a weight decay worth of 5 × 10−4 and the addition of a LayerNorm operate for the embeddings earlier than a ahead move by way of the multi-instance self-attention modules to help with convergence. We empirically used AugMix for our information augmentation technique, given the successes famous above. We made use of all accessible 4CH, 2CH, 3CH and SAX views per examine with no high quality management or screening. We utilized a binary cross-entropy loss operate with a sigmoid activation weighted by a scalar multiplier equal to the proportion of optimistic vs destructive lessons for every illness (calculated utilizing the inner coaching set prevalences). We used the AUROC as a efficiency metric, given the appreciable class imbalance of optimistic and destructive lessons92. For every illness label of curiosity, we educated fashions for twenty-four epochs on GPUs with a minimum of 24 GB of VRAM. For experiments described in Fig. 4a,b, configuration recordsdata have been generated for every experimental setup and have been educated in parallel throughout quite a few GPUs on Stanford Sherlock. Exterior check information have been evaluated on the Penn CBICA cluster on a single Nvidia A40 GPU with 40 GB VRAM, and on the PARCC Betty cluster on a single Nvidia Blackwell B200 GPU with 180 GB VRAM. Along with the losses and metrics, we saved predicted chances and relative self-attention scores for every view for downstream processing and statistical analyses.
Statistical analyses
We used the torchmetrics (v.1.0.1) package deal to calculate MSE and MAE for regression duties, and AUROC values for classification duties throughout the coaching and validation loops. We moreover manually calculated AUROCs as empirical curves within the sensitivity and specificity area, computed from predicted chances generated by our fashions93. To check the efficiency of fine-tuned classifier fashions (that’s, contrastive pre-trained vs baseline), we calculated non-parametric confidence intervals on the AUROC utilizing DeLong’s technique (paired)94, following which P values have been computed for the imply distinction between AUROC curves. Further analyses have been carried out to calculate the accuracy for every diagnostic label at totally different thresholds (optimizing for Youden’s statistic, a sensitivity of 0.90 or a specificity of 0.90). Variations between predicted LVEF% values and floor fact have been assessed utilizing Bland–Altman plots. Statistical analyses have been carried out and graphs have been plotted utilizing R (v.4.1.0); main packages used included pROC (v.1.17.0), ggplot2 (3.3.5) and blandr (0.5.1). The web test-set leaderboard webapp was created utilizing shiny (1.8.1).
Consideration visualizations
For each enter scan, we output the uncooked self-attention tensors from every head of every layer of the MRI imaginative and prescient encoder throughout analysis and processed them to yield 65 separate consideration warmth maps. As described earlier, the spatiotemporal decision was diminished with every successive stage within the mViT structure; the self-attention tensors have been diminished from an preliminary spatiotemporal decision of 8 × 56 × 56 on the first layer, to eight × 7 × 7 at the previous couple of layers. We stored solely the eye values from the output patches for the needs of visualization, and spatiotemporally interpolated these tensors again to a dimension of 16 × 224 × 224 by way of nearest-neighbour resampling. These arrays have been exported to mp4 recordsdata utilizing imageio and the ffmpeg library (Supplementary Figs. 17 and 18). Other than the self-attention warmth maps for every enter video, we additionally computed the uncooked self-attention values from the multi-instance classifier head for related downstream duties. After every scan was handed by way of the imaginative and prescient encoder, the resultant embedding was assigned a leaned uncooked self-attention rating throughout the multi-instance self-attention modules. We calculated the relative variations in self-attention scores throughout totally different view planes for every illness label. These relative self-attention values have been visualized as 2D warmth maps as proven in Fig. 4c. The multi-instance classifier head self-attention scores confirmed that the community learns to differentially prioritize view planes for various medical duties.
Reporting abstract
Additional data on analysis design is accessible within the Nature Portfolio Reporting Abstract linked to this text.



