Ethics statement
The UKB study received approval from the North-West Research Ethics Committee (reference number 06/MRE08/65), following the Declaration of Helsinki guidelines. Every participant provided written informed consent and retained the right to withdraw from the study at any time47. This project utilized data from UKB under authorized project numbers 53144, 49978, and 11350. All retinal images included in this paper are shared with UKB’s explicit permission.
Cohort characteristics
UKB is a large-scale prospective population-based study involving 502,355 participants. Initial assessments were conducted at 22 different study centers between January 2006 and October 2010. Participants underwent comprehensive evaluations covering demographics, lifestyle, and clinical measurements; they also provided DNA samples through blood tests and underwent various physical assessments. A notable subset of these volunteers completed an ophthalmic evaluation (23%, n = 117,279). Among them, 67,664 had spectral domain OCT scans between 2006 and 2010 (referred to as ‘instance 0’), while an additional 17,090 were imaged for the first time during a repeat assessment period from 2012 to 2013 (referred to as ‘instance 1’)48. The UKB data is categorized into data fields and categories, with specific category and field ID numbers provided in brackets to ensure reproducibility. To compare the characteristics of participants with and without imaging, statistical tests such as the Kruskal–Wallis test and χ2 test were applied.
Imaging quality control
OCT scans deemed to be of low quality were excluded if their image quality score fell below 40 (reference 15). The scoring system used was the default one provided by the Topcon OCT device. Color fundus photographs (CFPs) were excluded if they received a ‘reject’ quality grade according to a deep learning algorithm called ‘Automorph’; only ‘usable’ and ‘good’ images were kept49. Automorph was developed and trained using the EyePACS-Q dataset, where expert graders evaluated images based on illumination, presence of artifacts, and the ability to diagnose ocular diseases; complete details can be found in the cited publication.
Image preprocessing
For OCT scans, proprietary FDS files were converted into a readable format using the oct-converter Python package50. The 64th OCT slice was selected for further analysis. CFPs were processed through cropping and segmentation using Automorph, and the resulting binary images of the blood vessels were used to fine-tune the autoencoder’s loss function described later. Imaging data was divided into training, testing, and validation sets in a ratio close to 70:15:15 for deep learning model development. Participants’ images were grouped so that each person’s paired scans fell within the same dataset split.
Autoencoder model architecture and training
Image transformations
During the training of the autoencoder, several preprocessing transformations were applied to both OCTs and CFPs. To minimize computational load, all images were resized to 224 × 224 pixels using bicubic interpolation. To enhance the model’s ability to generalize, additional transformations were included during training, such as random horizontal flips, random rotations within a range of 0–15 degrees, and random adjustments to brightness, contrast, and saturation (each ranging from 0.9 to 1.1).
Model architecture
Our adversarial autoencoder (AAE) was designed to generate compressed representations of retinal images (both CFPs and OCT scans). The structure of the AAE aligns its aggregated posterior distribution, q(z), with a chosen prior distribution, p(z). We adopted a Gaussian prior, N(0,1), resulting in latent vectors (z) that closely approximate a multivariate Gaussian distribution51. This characteristic supports many downstream analyses which assume normally distributed variables. Although the training of our AAE did not differentiate between left and right eyes, distinct AAEs were trained separately for OCTs and CFPs. The encoder comprised multiple layers, each containing a 2D convolution, batch normalization, Leaky ReLU activation, a multi-scale residual block (with kernel sizes of 3 and 5), and an efficient multi-scale attention module52,53. At the bottleneck, a linear layer projected the image into a 256-dimensional vector. The decoder mirrored this setup, starting with a linear layer to upsample the vector, followed by transposed convolutions, batch normalization, Leaky ReLU, multi-scale residual blocks, and attention modules52,53. The adversarial discriminator consisted of three linear layers with Leaky ReLU activations.
Loss functions
For training both the OCT-AAE and CFP-AAE, we employed mean absolute error (commonly known as L1 loss) along with adversarial loss. Due to the generally low contrast and narrow background color range of CFPs, capturing fine vascular details can be challenging. Early versions of our model struggled to reconstruct blood vessel structures adequately because they made up only a small portion of the overall pixel count and thus had minimal impact on the total L1 loss (see Supplementary Fig. 90 for an example). Using Automorph, we processed the CFP images to create binary masks highlighting the vascular regions49. On average, blood vessels accounted for just 4.07% of all pixels in CFPs. As a solution, we adjusted the L1 loss to give 12.29 times greater importance to pixels containing blood vessels, enabling these structures to contribute approximately half of the L1 Loss. Additionally, we incorporated a perceptual loss function tailored for the CFP-AAE. Research indicates that the 4th ReLU activation of VGG16 (without batch normalization) delivers superior results in autoencoding tasks54. Therefore, we chose this configuration for our perceptual loss computation. Both original and reconstructed images were normalized using ImageNet parameters before being fed into VGG16 (without batch normalization)55. The perceptual loss was then computed as the mean squared error between the feature maps of the original and reconstructed images. To further standardize the latent vectors, we implemented an adversarial loss component. This was achieved by training a discriminator network to differentiate between the latent vectors produced by the AAE and those from a standard normal distribution. To stabilize training, we applied one-sided label smoothing56. The relative contribution of each component to the total loss was optimized through a Bayesian hyperparameter sweep. Even with weighted loss, terminal arterioles and venules were not fully preserved during reconstruction. However, expanding the latent space dimensionality (e.g., testing 128-dimensions yielded SSIM = 0.876 vs. 256-dimensions = 0.891) showed improvement in SSIM scores. Despite this trend, we chose not to further increase latent space size to maintain the statistical and computational efficiency needed for subsequent analyses.
Image quality metrics
To evaluate image quality, we used Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and L1 loss57.
Hyperparameter search
We
A Bayesian hyperparameter search was carried out using the Weights and Biases platform (version 0.19.8)58, with the SSIM metric on the test dataset serving as the objective function for the optimization. For both the CFP and OCT AAEs, the sweep was used to tune the discriminator learning rate, the autoencoder learning rate, the number of convolutional layers, the optimizer weight decay, and the adversarial loss weighting. In the CFP sweep, the perceptual loss weighting was also included as an additional search parameter. Each hyperparameter configuration was evaluated over 35 training epochs. The best-performing hyperparameters for the CFP-AAE consisted of 5 convolutional layers, a weight decay of 0.01 (using the AdamW optimizer), an autoencoder learning rate of 0.0005, a discriminator learning rate of 0.0001, a perceptual loss weight of 0.2, and a discriminator loss weight of 0.00159. For the OCT-AAE, the optimal configuration comprised 4 convolutional layers, a weight decay of 0.001 (AdamW optimizer), an autoencoder learning rate of 0.0001, a discriminator learning rate of 0.00005, and a discriminator loss weight of 0.0001.
Additional AAE experimental details
All experiments were implemented in PyTorch60. A learning rate warm-up schedule was applied over the first 15 epochs. The batch size was set to 32, and a random seed of 42 was used for reproducibility. The CFP model was trained for 400 epochs, while the OCT model was trained for 250 epochs. For sensitivity analysis, a second OCT and CFP model was trained with a latent dimension of 128, using the same hyperparameters as the main analysis. Training was performed on 4 Tesla T4 graphics processing units (GPUs). Two ophthalmologists (T.H.J. and P.I.S.) reviewed randomly selected reconstructed images to confirm that critical anatomical structures were faithfully preserved.
OCT- and CFP-AAE phenotypic association tests
Latent vectors derived from the left eye (both OCT and CFP modalities) of UKB study participants were used to reduce the dimensionality of downstream analyses. Consistent with prior studies, the left eye was chosen because its images are consistently of higher quality than those of the right eye11,15. The total number of participants varied depending on the specific association test and is reported in the text where feasible, or in the supplementary materials where participant counts differed by phenotype. Both instance 0 and instance 1 ophthalmic images were included in these analyses (unless otherwise specified), and for participants imaged at both time points, only the first set of images was used.
Pearson correlation analyses
The ‘positive control’ Pearson correlation analyses encompassed all traits within the UKB eye measures category (category 100013), excluding right eye measures.
The experimental analyses included a combination of anatomical and functional cardiovascular and neurological traits. Specifically, all traits from the following UKB categories were included: brain MRI (category 1014), cognitive function (category 1005), heart MRI (category 1015), carotid ultrasound (category 101), and selected traits from the physical measures category (category 1006). This yielded a total of 50 cardiovascular traits and 892 neurological traits. The large number of neurological traits reflects the extensive set of derived neuroimaging features available in the UKB. To improve interpretability across thematic groupings in plots, HDBSCAN clustering was applied61. Traits that overlapped across a large number of study participants were grouped into clusters, while the remaining traits were assigned to a ‘not clustered’ category—though they were still included in subsequent analyses. Data were scaled using the interquartile range via ‘RobustScaler’ in SciKit Learn62. The clusterable features comprised 35 ophthalmic traits (n = 62,729), 889 neurological features (n = 12,219), and 37 cardiovascular traits (n = 11,295). A leaf clustering approach was used. Given the smaller number of cardiovascular and ophthalmic traits, a minimum cluster size of 3 was set for these, while a minimum cluster size of 10 was used for neurological traits due to their greater number.
For each analysis, outlier values for cardiovascular, neurological, or ophthalmic measures were excluded—specifically, data points exceeding 2.5 standard deviations from the mean. Pearson correlation coefficients and associated P values were then computed using the SciPy Python package63.
Cox proportional hazards and Welch’s t-test analyses
Disease outcomes were obtained from the ‘First Occurrences’ UKB catalog (category 1712). The UKB first occurrences were generated by mapping Read code information from ‘Primary Care data’ (category 3000), ICD-9 and ICD-10 codes from ‘Hospital inpatient data’ (category 2000), ICD-10 codes from ‘Death Register records’ (fields 40001 and 40002), and ‘self-reported medical condition codes’ (field 20002) reported at baseline or subsequent UKB assessment center visits. The UKB estimates that hospital inpatient and death data are largely complete through 31 May 2022, beyond which data may be incomplete64. Accordingly, 31 May 2022 was used as the censoring date for all time-to-event analyses. For the Cox proportional hazards models, participants who had experienced a given disease event prior to retinal imaging were excluded. For the t-test, the goal was to assess whether latent features differed significantly based on the presence or absence of disease at the time of imaging, and therefore participants with disease events at the time of scan were included. Disease outcomes comprised (1) a positive control set of diseases identifiable on retinal imaging as determined by two ophthalmologists (T.H.J. and P.I.S.), including disorders of the retina, choroid, optic nerve, and lens (19 diseases); (2) all available neurodegenerative diseases (10 diseases); and (3) all available cardiovascular disorders (23 diseases). The full list of diseases examined is provided in Supplementary Table 31.
Cox proportional hazards modeling was performed using the lifelines Python package65. The primary outcome of interest was the association between left eye Ret-AAE embeddings and time-to-event data for ophthalmic, cardiovascular, and neurodegenerative diseases. Covariates included genetic sex and age at the time of scan. t-tests were conducted using the SciPy package in Python63. Welch’s t-test was selected over the standard Student’s t-test, as it is more robust when the two populations have unequal sample sizes or variances66. Results with fewer than 30 samples per group were excluded from the final analysis in both approaches. Baseline associations between Ret-AAE embeddings and Huntington’s disease, spinal muscular atrophy and related syndromes, Alzheimer’s disease, and vascular dementia could not be tested due to insufficient cases (<30) at baseline imaging. Huntington's disease was additionally excluded from Cox proportional hazards analysis for the same reason.
CCA
Canonical correlation analysis (CCA) was employed to investigate the relationship between CFP–OCT-derived embeddings and 251 metabolic biomarkers measured by Nightingale Health (category 220)67. Because of participant overlap between instance 0 and instance 1 metabolic sampling, only instance 0 data were used.
In this analysis, participants lacking imaging data or metabolite samples were excluded. To minimize the impact of extreme outliers, Winsorization was applied to the top and bottom 1% of values68. The analysis employed the SciKit-learn CCA model, with both embeddings and metabolomic data scaled using the CCA’s built-in scaler. Statistical significance of each canonical variate was evaluated via permutation testing: the metabolomic matrix was randomly shuffled 1,000 times while keeping the embedding matrices fixed, generating a null distribution of correlations69,70. The P value was derived by comparing the original dataset’s average CCA correlation against this null distribution. The number of modes tested was selected by examining a Scree plot of the first 10 modes and truncating at the point where the regression coefficient dropped substantially.
SPLS
Alongside CCA, we applied a canonical SPLS (Sparse Partial Least Squares) analysis using the mixOmics R package (version 6.32.0)71. Unlike standard partial least squares, SPLS selects only the most relevant features based on sparsity constraints—making it better suited for highly correlated data like metabolomics. This approach served as a robust validation of our CCA findings. Data preprocessing mirrored that used for CCA. Lasso penalization parameters were tuned via fivefold cross-validation repeated five times, optimizing for maximum correlation between predicted and observed components. For CFP SPLS, we used 2 components: 30 embedding variables and 25 metabolites in component 1, and 10 embedding variables and 10 metabolites in component 2. OCT SPLS used 1 component with 10 embedding variables and 10 metabolite variables. All SPLS analyses were run in canonical mode, and associated visualizations were generated using mixOmics.
Genetic analyses
Choice of GWAS method
We selected REGENIE version 4.1 for genome-wide association studies (GWAS)72. Step 1 utilized UK Biobank microarray data (field 22418), while Step 2 used imputed genetic data (field 22828).
Phenotypes and covariate data
The phenotypes analyzed were latent features extracted from left-eye OCT and color fundus photography (CFP) images for each participant. This resulted in 256 GWAS per participant per imaging modality (512 total GWAS per participant). Covariates included sex (field 31), age at scan date (field 22006), and spherical equivalent refractive error. The latter was calculated using the most reliable autorefraction measurement (field 5276): spherical equivalent = spherical error (field 5085) + 0.5 × cylindrical error (field 5086)15.
Quality control for genetic analyses
Discovery cohort analyses included participants from instance 0 (initial assessment, 2006–2010); replication cohort analyses used those from instance 1 (first repeat visit, 2012–2013). Overlapping individuals were excluded to ensure independence. Only participants of genetically confirmed European ancestry—determined via principal components (field 22006)—were included.
Participant-level quality control excluded individuals with mismatched genetic and reported sex (fields 22001 and 31), sex chromosome aneuploidy (field 22019), close genetic relatives in UKB (field 2202), extreme heterozygosity or missingness outliers (field 22027), or >10% missing genotype data73.
Variant-level filtering was performed using plink274. For microarray data (REGENIE Step 1), criteria included: minor allele frequency ≥ 0.01, minor allele count ≥ 20, ≤10% missingness, and Hardy–Weinberg equilibrium P ≥ 1 × 10−15. For imputed data (Step 2), additional filters included: INFO score ≥ 0.8 post-GWAS, removal of duplicate variants, and Hardy–Weinberg P ≥ 1 × 10−6.
GCTA COJO
We refined significant associations using GCTA-COJO12. Variants on different chromosomes or separated by more than 1,500 kb were assumed independent; otherwise, default GCTA-COJO settings were applied.
Gene set analysis
Gene set enrichment was conducted with MAGMA24, using canonical pathway collections from MSigDB75: BioCarta (292 sets), KEGG MEDICUS (658), PID (196), Reactome (1,787), WikiPathways (885), and legacy KEGG (186). Variants were mapped to genes using dbSNP v135 and NCBI 37.3 annotations. Linkage disequilibrium (LD) reference data came from the European subset of the 1000 Genomes Project76.
Global genetic correlation analysis
We assessed genome-wide genetic correlations between OCT/CFP latent features and cardiovascular or neurological diseases using LD score regression (LDSC)77. Summary statistics for these diseases were obtained from public GWAS studies in GRCh37 that included necessary metadata78–83. Only latent features with mean χ² > 1.02 (indicating polygenicity) were included. Following established protocols77, variants were filtered for presence in HapMap3, MAF > 0.01, INFO > 0.9, and exclusion of strand-ambiguous or duplicate SNPs. LD scores were based on the 1000 Genomes European panel76. The LDSC intercept was left unconstrained. Heritability estimates for polygenic latent features were also derived via LDSC, though without formal significance testing.
LDSC results are not provided here, as this method was not part of the analysis pipeline.
Grad-CAM
To make our learned embeddings easier to interpret, we used Grad-CAM84. In line with earlier studies, we found that visualizations from the earliest convolutional layers lacked meaningful semantic content, while those from the deepest layers were too coarse to pinpoint specific image features85. Therefore, we centered our main analyses on the third convolutional layer for both CFP and OCT imaging modalities, as it offered the best trade-off between spatial detail and semantic richness. When Grad-CAM did not produce a clear or interpretable map, we switched to Layer-CAM or examined nearby convolutional layers instead; such cases are explicitly noted in the corresponding figure legends86. Saliency maps are shown only for embeddings discussed in the text. Whenever feasible, we describe embeddings in terms of their anatomical location to aid understanding. However, due to the abstract nature of embeddings, this is not always possible.
Latent feature traversals
Beyond Grad-CAM, we conducted systematic latent feature traversals using the trained AAE87. For each latent dimension (embedding) zi, we first selected a representative reference image whose value in that dimension was closest to the mean (approximately zero). We then varied only zi by ±4 standard deviations while keeping all other dimensions fixed. These modified latent vectors were passed through the AAE decoder to generate reconstructed images, allowing us to visualize how isolated changes along each latent axis affect the output.
Because the resulting changes can be subtle, we supplemented visual inspection with quantitative difference maps. For CFPs (three-channel RGB images), we calculated the average change in pixel intensity per color channel compared to the reference reconstruction. For OCT images (single-channel), we computed the absolute difference in intensity at each pixel. These maps highlight the image regions most responsive to changes in each latent dimension and help clarify what visual features the model has learned to encode.
A key strength of autoencoding is its ability to detect image patterns that may not be obvious to human observers. As a result, many latent traversals produce diffuse, hard-to-interpret changes in pixel intensity with limited biological or clinical meaning. Only those figures containing human-interpretable features are included in the supplementary material (Supplementary Figs. 14–68).
Annotations for both latent traversals and Grad-CAM saliency maps were independently reviewed and confirmed by two ophthalmologists (T.H.J. and P.I.S.).
Controlling for type 1 error risk
We applied family-wise error rate control using Bonferroni correction to manage the risk of false positives. To define what constitutes a “family” of tests, we examined whether autoencoder embeddings were strongly correlated with one another. As shown in Supplementary Figs. 91 and 92, the embeddings exhibited minimal meaningful correlation. Therefore, each embedding was treated as its own independent family, and multiple testing correction was applied based on the number of tests performed within each embedding. Importantly, this is not a study-wide correction—we did not adjust across all embeddings, as doing so would be overly conservative and lead to excessive type 2 errors, especially given the large scale and hypothesis-free nature of this work. Instead, we evaluate our findings in light of convergent evidence across multiple analyses, report raw P values to support reader interpretation, assess results based on biological plausibility, and, where possible, validate key findings in an independent UK Biobank cohort. While this approach provides a more balanced and nuanced interpretation, the risk of type 1 error remains a consideration, and the main findings would benefit from external experimental validation when suitable multi-omic datasets become available. Bonferroni-corrected significance thresholds for all other analyses are listed in the relevant supplementary table legends.
Statistics and reproducibility
All methods are detailed in the appropriate sections of the paper. Sample sizes varied depending on the analysis and are reported alongside each result where feasible, or otherwise provided in the Supplementary Data. No formal power calculations were used to predetermine sample sizes; instead, all available data that met analysis-specific quality control criteria were included. All analyses were carried out using reproducible computational pipelines, as outlined in the Methods section.
Reporting summary
Additional details about the research design can be found in the Nature Portfolio Reporting Summary associated with this article.



