Dataset description
Dataset overview
We used two publicly available and one private digital pathology datasets of NSCLC stained with haematoxylin and eosin in the binary lung cancer subtyping task: (1) 941 WSIs from TCGA, comprising 467 LUAD and 474 LUSC cases, (2) 1,306 WSIs from the CPTAC, comprising 644 LUAD and 662 LUSC cases, and (3) 3,123 WSIs from Queen Mary Hospital, comprising 2,645 LUAD and 478 LUSC cases. For simplicity, we refer to these three datasets as TCGA, CPTAC and QMH-NSCLC, respectively. QMH-NSCLC was collected under the approval of Institutional Review Board of The University of Hong Kong/Hospital Authority Hong Kong West Cluster (HKU/HA HKW IRB) (UW22-168) with a waiver of informed consent due to the retrospective nature of the study and the use of de-identified data.
Supplementary Tables 1 and 2 summarize the basic characteristics of these three datasets. In the TCGA dataset, each patient is represented by a single slide, whereas in the CPTAC and QMH-NSCLC datasets, each patient is associated with an average of 3.1 and 2.3 slides, respectively. To evaluate TRUECAM’s ability in detecting OOD inputs, we constructed a clinically meaningful dataset consisting of 218 cancer-adjacent normal tissues (defined as solid tissue normal in TCGA) from the TCGA-NSCLC project52, which we refer to as OOD in this paper. Because non-cancerous lung tissues share many biological and morphological characteristics with NSCLC slides, distinguishing them poses a particularly challenging and clinically relevant task for OOD detection.
In addition to the binary NSCLC subtyping task, we used five additional datasets totalling 14,932 WSIs, which encompass diverse subtyping tasks across different organs and varying number of cancer subtypes to further evaluate TRUECAM’s generalizability and scalability (Supplementary Table 2). These datasets include the following: (1) Brain tumour subtyping from TCGA (referred to as TCGA-BRTS)—1,225 WSIs for a three-class brain tumour subtyping task across 617 patients, (2) BRACS—547 WSIs for a three-class breast cancer subtyping task across 189 patients53, (3) TCGA-BRCA—1,049 WSIs for breast cancer subtyping across 984 patients54, (4) TCGA-RCC—925 WSIs for three-class kidney cancer subtyping across 895 patients38 and (5) TCGA-OT—11,186 WSIs from 9,149 patients for a 46-class pan-cancer slide-level classification task defined based on the OncoTree classification system31.
Image preprocessing
Regardless of distinct model input requirements, all WSIs were segmented and split into appropriate tiles using a standard preprocessing pipeline implemented in Slideflow (version 2.1.0)55. Directly feeding large-gigapixel WSIs into a neural network is computationally impractical, so image tiles of manageable sizes were extracted to align with the model’s specifications. For example, foundation models like UNI, CONCH and Prov-GigaPath accept image tiles of size 256 × 256 pixels at ×20 magnification; as we follow the Inception-v3 model developed in ref. 16, it requires image tiles of 299 × 299 pixels at ×10 magnification; unlike UNI, CONCH and Prov-GigaPath, the latest foundation model TITAN accepts image tiles with a size of 512 × 512 pixels at ×20 magnification. The preprocessing pipeline also includes several standard steps to create deep learning-ready datasets. Otsu’s thresholding56 was applied to differentiate the foreground (tissue) from the background (empty slide), followed by greyspace filtering to remove background tiles. All extracted tiles were then stain-normalized and saved in TFRecord format for model training, inference and performance evaluation. No pathologist-annotated regions of interest were used; instead, we relied solely on slide-level diagnoses and assigned a uniform weak label to all tiles extracted from a given WSI when needed for model training. Regarding the five additional datasets, we adopted the same data preprocessing approach as previous studies, such as tissue segmentation and tiling28,29,31.
Dataset configuration
In the set-up for the specialized model based on Inception-v3, we randomly partitioned the TCGA dataset, allocating 65% of the patients’ data for training, 15% for validation and 20% for calibration and testing. We repeated the random partitioning procedure 20 times, producing 20 independently trained models for evaluation. Within the 20% allocated for calibration and testing, data from 100 patients were used for calibration to establish CP (referred to as the TCGA calibration dataset), while data from 89 patients was set aside for testing (referred to as the TCGA testing dataset). To assess the variability of CP performance, the 20% split was randomized 500 times, generating distinct calibration and testing datasets for each model. As a result, we report model performance based on 20 models × 500 evaluations per model. The CPTAC dataset was fully used for external validation to assess the effectiveness of TRUECAM in scenarios that need transferability, with data from 100 patients designated for calibration to establish CP (referred to as the external CPTAC calibration dataset) and the remaining 316 patients reserved for testing (referred to as the external CPTAC testing dataset). Similarly, the CPTAC dataset was randomly partitioned 500 times to evaluate the variability of CP performance.
In the scenario of using foundation models for NSCLC subtyping, which involves training a separate model using latent tile representations generated by foundation models, the TCGA and CPTAC datasets were used independently for the entire model training and evaluation process. Each dataset was partitioned into three subsets: 65% for training, 15% for validation and 20% for calibration and testing. Consistent with the strategy used for Inception-v3 evaluation, this partitioning and model training process was repeated 20 times, with each involving 500 random splits to establish and evaluate CP.
For all non-lung cancer subtyping datasets except TCGA-OT, we applied the same procedure as for TCGA (described above), partitioning each dataset into training (65%), validation (15%) and testing (20%). This random split was repeated five times, resulting in five independently trained models for evaluation. We report the mean with 95% confidence interval (95% CI) of performance metrics across these five folds. For TCGA-OT, we followed the data split configuration from ref. 31 and used non-parametric bootstrapping with 1,000 samples to estimate the mean and standard deviation of model performance. For the additional NSCLC subtyping dataset QMH-NSCLC, we used the same 65%/15%/20% split for training, validation and calibration/testing but performed the partitioning only once and applied non-parametric bootstrapping with 150 samples to estimate the mean and standard deviation of model performance.
SNGP
We leveraged SNGP 57,58 to provide a principled estimation of data uncertainty in deep learning models for NSCLC subtyping. SNGP introduces two important modifications to a regular neural network: (1) incorporating spectral normalization into the hidden layers and (2) replacing the conventional dense output layer with a Gaussian process. Incorporating spectral normalization ensures that the transformation from the input x in the original input space to the latent representation in the penultimate layer \(h\left(\right)\) preserves the relative distances. In other words, the distance \({\parallel h(_)-h(_)\parallel }_{{\rm}}\) between representations of any two image data points \(\left({{\boldsymbol}}_{1},{{\boldsymbol{x}}}_{2}\right)\) in the latent space meaningfully reflects their distance \({\left\Vert {{\boldsymbol{x}}}_{1}-{{\boldsymbol{x}}}_{2}\right\Vert }_{{\rm{X}}}\) in the original data space, relative to other data pairs. In practice, adding spectral normalization addresses a problem in training neural networks known as feature collapse59. This issue occurs when OOD data or data from different classes, which are geometrically distant from remaining data in the input space, are unexpectedly mapped to nearby points in the latent space, thus leading to unreasonable uncertainty estimations. Mathematically, the spectral normalization operation is equivalent to requiring the mapping function h(⋅) to satisfy the following bi-Lipschitz condition for any pair of inputs x1 and x258,60:
$${L}_{1}\times ||{{\boldsymbol{x}}}_{1}-{{\boldsymbol{x}}}_{2}|{|}_{{\rm{X}}}\le ||h({{\boldsymbol{x}}}_{1})-h({{\boldsymbol{x}}}_{2})|{|}_{{\rm{H}}}\le {L}_{2}\times ||{{\boldsymbol{x}}}_{1}-{{\boldsymbol{x}}}_{2}|{|}_{{\rm{X}}},$$
(1)
where L1 and L2 are the positive lower and upper Lipschitz bounds (0 < L1 < 1 < L2) of feature extractor h(⋅), and ∣∣⋅∣∣X and ∣∣⋅∣∣H correspond to the distance metrics chosen for the original and latent space, respectively. The lower Lipschitz bound L1 serves to preserve distances in the latent space and the upper Lipschitz bound L2 helps enforce the smoothness and robustness of the neural network by limiting over-sensitive transformations to perturbations in the input space of x.
In SNGP, the bi-Lipschitz constraint is enforced by applying spectral normalization to neural network weights \({\left\{{{\boldsymbol{W}}}_{l}\right\}}_{l=1}^{L-1}\) (ref. 61), where l represents the lth hidden layer of the neural network. During each training step, the spectral normalization method first estimates the spectral norm \(\widehat{\lambda }\approx {\left\Vert {{\boldsymbol{W}}}_{l}\right\Vert }_{2}\) using the power iteration method62 and then normalizes the neural network weights as follows:
$${\widehat{{\boldsymbol{W}}}}_{l}=\left\{\begin{array}{l}c{{\boldsymbol{W}}}_{l}/\widehat{\lambda },\,\,\mathrm{if}\,\,c < \widehat{\lambda }\\ {{\boldsymbol{W}}}_{l},\,\,\mathrm{otherwise}\end{array}\right.,$$
(2)
where c is a hyperparameter used to adjust the exact spectral norm upper bound on \({\left\Vert {{\boldsymbol{W}}}_{l}\right\Vert }_{2}\) (that is, \({\left\Vert {{\boldsymbol{W}}}_{l}\right\Vert }_{2}\le c\)).
SNGP then estimates data uncertainty by leveraging the preserved distances in the latent space, achieved by replacing the dense output layer with a Gaussian process. Consider a dataset consisting of N training points \({\mathcal{D}}={\left\{{{\boldsymbol{x}}}_{i},{y}_{i}\right\}}_{i=1}^{N}\), and let hi(xi) denote the representation of xi in the second-to-last layer of neural network. The output \({{\boldsymbol{f}}}_{N\times 1}={\left[f({h}_{1}),f({h}_{2}),\cdots \,,f({h}_{N})\right]}^{T}\) of Gaussian process follows a multivariate Gaussian distribution:
$${{\boldsymbol{f}}}_{N\times 1} \sim {\boldsymbol{N}}\left({{\boldsymbol{0}}}_{N\times 1},{{\boldsymbol{K}}}_{N\times N}\right),\,\mathrm{where}\,\,\,{{\boldsymbol{K}}}_{i,j}=\exp \left(\frac{-{\left\Vert {h}_{i}-{h}_{j}\right\Vert }_{2}^{2}}{2}\right).$$
(3)
Conditional on the learned latent representation hi, SNGP approximates Gaussian process using the Laplace approximation with the random Fourier feature (RFF) expansion63. Specifically, SNGP approximates the Gaussian process prior in equation (3) with a low-rank approximation to the kernel matrix K = ΦΦT using random features64:
$${{\boldsymbol{f}}}_{N\times 1} \sim {\boldsymbol{N}}\left({{\boldsymbol{0}}}_{N\times 1},\Phi {\Phi }_{N\times N}^{T}\right),\,\mathrm{where}\,\,\,{\Phi }_{i}=\sqrt{\frac{2}{{D}_{L}}}\cos \left(-{{\boldsymbol{W}}}_{L}{h}_{i}+{{\boldsymbol{b}}}_{L}\right),$$
(4)
where Φi is the last layer of the neural network with a dimension DL, WL is a fixed weight matrix randomly generated from \({\mathcal{N}}(0,1)\) and bL is a fixed bias vector randomly generated from a uniform distribution \({\mathcal{U}}(0,2\pi )\). As WL and bL are fixed, the RFF approximation to the Gaussian process prior for the kth logit of the classification problem (in equation (3)) can be reformulated as follows:
$${g}_{k}\left({h}_{i}\right)=\sqrt{\frac{2}{{D}_{L}}}\cos {\left(-{{\boldsymbol{W}}}_{L}{h}_{i}+{{\boldsymbol{b}}}_{L}\right)}^{T}\times {\beta }_{k},\,k=1,\cdots \,,K,$$
(5)
where \({\beta }_{k} \sim {\mathcal{N}}\left(0,{\boldsymbol{I}}\right)\), K denotes the number of classes in the classification problem, and \({\boldsymbol{\beta }}={\left\{{\beta }_{k}\right\}}_{k=1}^{K}\) are the only learnable parameters in the output layer.
Overall, the trainable parameters consist of β along with the weights and biases \({\boldsymbol{\alpha }}={\left\{{{\boldsymbol{W}}}_{l},{{\boldsymbol{b}}}_{l}\right\}}_{l=1}^{L-1}\) in the first L − 1 layers of the neural network. The RFF approximation to Gaussian process reduces the high-dimensional Gaussian process to a standard Bayesian linear model while also producing a closed-form posterior that is end-to-end trainable alongside the rest of the neural network using stochastic gradient descent. After inferring the MAP estimate of \(\widehat{{\beta }_{k}}\), for an unobserved point x*, the first L − 1 layers of SNGP serve as a feature extractor to derive its latent representation \({\Phi }^{* }=\sqrt{\frac{2}{{D}_{L}}}\cos \left(-{{\boldsymbol{W}}}_{L}h({{\boldsymbol{x}}}^{* })+{{\boldsymbol{b}}}_{L}\right)\). Next, the mean and variance associated with the kth logit in the classification problem are computed as follows:
$${\mu }_{k}\left({{\boldsymbol{x}}}^{* }\right)={\left[{\Phi }^{* }\right]}^{T}\widehat{{\beta }_{k}},\,{\sigma }_{k}\left({{\boldsymbol{x}}}^{* }\right)={\left[{\Phi }^{* }\right]}^{T}{\Sigma }_{k}{\Phi }^{* }.$$
(6)
Finally, the predictive distribution for x* is calculated as follows:
$$p\left(\left.{y}^{* }\right|{{\boldsymbol{x}}}^{* }\right)=\mathop{\int }\limits_{s \sim {\mathcal{N}}\left({\mu }_{k}\left({{\boldsymbol{x}}}^{* }\right),{\sigma }_{k}\left({{\boldsymbol{x}}}^{* }\right)\right)}\gamma (s),$$
(7)
where γ denotes the softmax activation function.
CP
CP is a distribution-free approach for constructing prediction sets for arbitrary prediction or classification algorithm. It provides statistical guarantees of a bounded error rate for model outputs, rigorously aligned with the error level specified by developers or service providers under data exchangeability assumption65. Consider a classification problem comprising K classes. Let \({\mathcal{D}}={\left\{{{\boldsymbol{x}}}_{i},{y}_{i}\right\}}_{i=1}^{T}\) be independently and identically distributed (i.i.d.) data points sampled from distribution \({\mathcal{X}}\times {\mathcal{Y}}\), where \({{\boldsymbol{x}}}_{i} \sim {\mathcal{X}}\), \({y}_{i} \sim {\mathcal{Y}}\), yi ∈ {1, ⋯ , K}, and \(\alpha \in \left[0,1\right]\) denotes a specified error level (also referred to as significance level). Then the objective of CP is to construct a prediction set for each data point, denoted as
$$\Gamma :{\mathcal{X}}\to \left\{\,{\rm{subset\; of}}\,\,\,\left\{1,2,\cdots \,,K\right\}\right\},$$
(8)
such that for a new i.i.d. data point \(({{\boldsymbol{x}}}^{* },{y}^{* }) \sim P\left({\mathcal{X}},{\mathcal{Y}}\right)\), the probability that the prediction set fails to contain the truth label is α bounded. This is formally expressed as follows66:
$$\left.1-\alpha \le {{{\mathbb{P}}}}\left(\,{y}^{* }\in \Gamma ({{\boldsymbol{x}}}^{* })| {{\boldsymbol{x}}}^{* }\right)\right)\le 1-\alpha +\frac{1}{T+1},$$
(9)
where the probability is taken over the data \({({{\boldsymbol{x}}}_{i},{y}_{i})}_{i=1}^{T}\cup ({{\boldsymbol{x}}}^{* },{y}^{* })\).
By design, CP guarantees that the prediction set Γ(x*) contains the true label y* with a probability of at least 1 − α. The bounded error rate is achieved by generating a set of likely class labels for the inputs with insufficient certainty in a systematic way, in contrast to the single-value (in regression) or single-label (in classification) outputs produced by regular machine learning models.
Using two separate and disjoint datasets for model training and calibration, the development of an inductive conformal predictor consists of three key steps:
-
1.
Define a nonconformity score. Conformity reflects how similar or conformal a new data point is to the training data67 and is typically quantified using a nonconformity score. In classification tasks, the nonconformity score is oftentimes defined as \(s({{\boldsymbol{x}}}_{i},{y}_{i})=1-\widehat{f}{\left({{\boldsymbol{x}}}_{i}\right)}_{{y}_{i}}\), where \(\widehat{f}{\left({{\boldsymbol{x}}}_{i}\right)}_{{y}_{i}}\) denotes the probability assigned by the trained model \(\widehat{f}\) to the true class yi for the input xi. Thus, the lower the model’s confidence in correctly classifying xi as yi, the higher the corresponding nonconformity score. We used this nonconformity score throughout the study.
-
2.
Determine the nonconformity threshold. We then computed the nonconformity score \(s\left({{\boldsymbol{x}}}_{i},{y}_{i}\right)\) for each data point xi (i = T + 1, ⋯ , T + R) in the calibration set and obtain a set of nonconformity scores: {s(xT + 1, yT + 1), s(xT + 2, yT + 2), ⋯ , s(xT + R, yT + R)}. We ranked these scores in ascending order and determine the threshold \(\widehat{q}\) as the \(\frac{\lceil (R+1)(1-\alpha )\rceil }{R}\) quantile of these ordered scores. This threshold ensures that the resulting prediction sets have a coverage rate of at least 1 − α on new i.i.d. datasets.
-
3.
Generate prediction sets. For a new data point x*, the prediction set is formed by including all class labels k ∈ {1, 2, …, K} whose corresponding scores \(s\left({{\boldsymbol{x}}}^{* },k\right)\) are smaller than or equal to \(\widehat{q}\). Mathematically, we have the following:
$$\Gamma ({{\boldsymbol{x}}}^{* })=\left\{k:s\left({{\boldsymbol{x}}}^{* },k\right)\le \widehat{q}\right\},\,k=1,\ldots ,K.$$
(10)
Unlike regular UQ methods, CP converts the traditional heuristic concept of uncertainty, often represented by softmax outputs, into a statistically meaningful measure in the form of a prediction set. By following the steps described above, CP ensures to generate statistically valid prediction sets that contain the true label with a pre-specified coverage of 1 − α. This approach offers a standardized framework for constructing reliable confidence intervals that enables machine learning systems to deliver robust outputs while upholding responsible and trustworthy scientific inferences.
Deep learning models
To assess the effectiveness of TRUECAM, we evaluated two types of AI architecture: task-specific models (referred to as specialized models) and models based on pathology foundation models (referred to as foundation models). We trained Inception-v3 from scratch, a widely used convolutional neural network architecture for image inference26,27, to serve as a representative classifier for specialized medical imaging tasks. All Inception-v3 models were trained at the tile level, with slide-level classifications obtained by averaging tile-level outputs across each WSI 16. For patients with multiple WSIs, the final patient-level diagnosis was derived by averaging the slide-level classifications across all WSIs for that patient. In this context, we used TCGA for model training and internal validation, while we used CPTAC as an independent external validation dataset to assess TRUECAM’s effectiveness in scenarios requiring model transferability. We further investigated four digital pathology foundation models that exemplify the recent advancements in large, general-purpose models in the field: (1) UNI28, (2) CONCH29, (3) Prov-GigaPath30 and (4) TITAN31. We paired these models with multiple cutting-edge MIL mechanisms (that is, ABMIL68, CLAM38 and TransMIL69) where applicable. As foundation models are pretrained on diverse datasets, offering strong knowledge transferability as an image encoder, and downstream task-specific classifiers are typically lightweight and easy to train, our evaluation focused solely on the setting where each healthcare organization trains and evaluates its own site-specific model using its data. All considered datasets were used to mimic this setting such that each dataset represents a distinct healthcare organization and independently supports the required model training and validation. Our evaluation primarily focused on UNI and CONCH, with additional results provided for Prov-GigaPath and TITAN to showcase the generalizability of TRUECAM’s advantages.
Specialized models based on Inception-v3
We trained three deep learning models based on the Inception-v3 architecture26,27. The first is the original deterministic version, which does not include UQ. The other two models do incorporate UQ techniques: (1) MC Dropout11 and (2) SNGP. Inception-v3 is a 48-layer deep convolutional neural network designed to capture spatial features at varying scales by using parallel convolution operations with different filter sizes within the same layer. To address the vanishing gradient problem and provide additional supervision signals, Inception-v3 integrates an auxiliary classifier to propagate label information to earlier network layers. In addition, Inception-v3 uses bottleneck layers to reduce dimensionality and computational costs. These design features collectively enhance performance and resource efficiency, rendering Inception-v3 widely adopted for a diverse range of general computer vision and medical imaging applications. Supplementary Table 3 provides parameter and training details.
MC Dropout was used as the primary baseline approach for UQ11. By assuming that the prior of neural network weights follows a standard normal distribution, MC Dropout is mathematically equivalent to sampling from a variational posterior consisting of two independent Gaussians with fixed covariance. To implement MC Dropout in Inception-v3, we followed the design adopted by ref. 16. Specifically, we appended two fully connected layers, each containing 1,024 neurons, to the end of the neural network and applied dropout with a rate of 0.1 after each fully connected layer. During inference, each image tile was forward-passed through the dropout-enabled network five times. The standard deviation of the model’s predictions across these five passes was calculated as the measure of tile-level uncertainty. The tile-level predictions and their corresponding uncertainty were then aggregated to produce slide- and patient-level predictions, along with the associated uncertainty estimates.
In addition, we included VBLL as another baseline for UQ. We adopted a post-training approach to optimize a variational last layer atop features extracted from a pretrained neural network. Both the discriminative and generative classifiers were implemented as a two-layer dense neural network with a hidden dimension of 128 and diagonal parameterization. All other hyperparameters for VBLL training followed the configuration described in ref. 32. Performance results for VBLL, MC Dropout and SNGP in the NSCLC subtyping task are summarized in Supplementary Table 8.
We trained specialized models from scratch using a 24 GB NVIDIA RTX 4090 GPU with a batch size of 64, where image tiles were resized to 299 × 299 pixels. We adopted a learning rate scheduler with an initial learning rate of 0.0003 and a decay rate of 0.98 every 512 steps. Each model was trained for four epochs using the Adam optimizer with the first and second moment exponential decay rate setting to 0.9 and 0.999, respectively. During model training, we monitored model performance on the validation dataset and saved the best-performing model for evaluation on the testing data. Standard data augmentation techniques, including random horizontal and vertical flips, 90° rotations, JPEG compression with a 50% probability (quality range, 50–100) and Gaussian blur with a 10% probability (sigma, 0.5–2.0), were used to augment the training process of each model.
Foundation models
TRUECAM seamlessly integrates with pathology foundation models for trustworthy cancer subtyping. In this study, we explored integrating TRUECAM with two types of foundation model: tile-level (or patch-level) models, including UNI28 and CONCH29, and slide-level models, including Prov-GigaPath30 and TITAN31, both using the vision transformer-based architecture as image encoders. All of them are general-purpose models systematically pretrained on diverse sources of histopathology images using self-supervised learning paradigms. Among them, CONCH, Prov-GigaPath and TITAN are visual-language models designed to align image and text modalities within their representation space, whereas UNI is exclusively image-pretrained. These foundation models show exceptional adaptability and knowledge transferability through image encoding, making them highly effective for a wide range of downstream tasks in histopathology, including image classification, region-of-interest retrieval and cell type segmentation.
To adapt tile-level foundation models for cancer subtyping, we used the image encoders of UNI and CONCH to extract tile embedding (that is, latent representations), concatenated the resulting representations for each slide and applied the widely used MIL-based mechanisms, including ABMIL, CLAM38 and TransMIL69, for weakly supervised classification using slide-level diagnostic labels. The ABMIL model architecture and training configuration followed those established in UNI28 and CONCH29. Specifically, ABMIL includes a fully connected layer with a rectified linear unit activation function to transform tile-level features into 512-dimensional embedding, followed by an intermediate layer with size 384. Finally, another fully connected layer maps the attention-pooled slide-level representations to logits and class probabilities through softmax normalization. Dropout was applied at multiple stages: a rate of 0.1 for the input features and 0.25 after each intermediate layer. In addition, we used the CLAM-SB (single-attention branch) and TransMIL models, with parameter configurations following ref. 38 and ref. 69, respectively. To incorporate SNGP, we added spectral normalization to all fully connected layers and replaced the final layer with a RFF-approximated Gaussian process. All MIL-based models were trained via a maximum of 20 epochs using the AdamW optimizer70 with a cosine learning rate scheduler, an initial learning rate of 1 × 10−4 and cross-entropy loss.
For slide-level foundation models, we fine-tuned the slide encoder of Prov-GigaPath and performed linear probing on the slide representations of TITAN for downstream classification tasks as suggested by refs. 30,31. For both models, we appended a Gaussian process layer to the slide-level representations without applying spectral normalization, as the representations extracted from Prov-GigaPath and TITAN inherently preserve distances effectively, due to their dedicated self-supervised training strategy.
EAT
Regardless of the number of subtypes S in a given subtyping task, ambiguous tiles in a WSI were identified and eliminated using an entropy-based metric. While the underlying concept of entropy is shared, specialized models (for example, Inception-v3) and foundation models (for example, UNI, CONCH, Prov-GigaPath and TITAN) differ slightly in their strategies to leverage it.
For the setting of specialized models (for example, Inception-v3), we performed k-means clustering for each tested k (k > S) using the latent representations of all training-set tiles obtained from SNGP. For each resultant cluster i (i = {1, 2, ⋯ , k}), we define the cluster label assignment entropy as \({\mathbb{H}}(k,i)=-{\sum }_{j=1}^{S}p(k,i,j)\log p(k,i,j)\), where p(k, i, j) denotes the fraction of tiles labelled as subtype j in cluster i. This measure quantifies the degree of subtype heterogeneity within a cluster, with higher values indicating greater ambiguity. The set of clusters were then ranked in descending order based on \({\mathbb{H}}(k,i)\), and the top m clusters were designated as ambiguous clusters. A new specialized model was then trained on tiles that are not from these ambiguous clusters. In the inference stage, tiles from a new WSI that were assigned to ambiguous clusters were eliminated and the remaining tiles were processed by the newly trained model for inference (Fig. 1c). The values for k and m were determined based on the validation performance of the retrained model.
In the foundation model setting, estimating data ambiguity for EAT requires a tile-level classifier aligned with the target subtyping task. As the tile-level features extracted from foundation model encoders are not compatible with the input of the specialized model, we used AutoGluon (version 1.1.1)71 to train a lightweight proxy classifier from scratch. AutoGluon treats the extracted tile-level representations as tabular data and searches the design space, which is composed of multiple state-of-the-art models (for example, LightGBM, NeuralNet, XGBoost and ExtraTrees) for cancer subtyping. It then automatically ensembles the weights of these base models to maximize validation performance. The tile label assignment entropy of a tile x is estimated as \({\mathbb{H}}(x)=-{\sum }_{j=1}^{S}p(j,x)\log p(j,x)\), where p(j, x) denotes the probability of assigning x to subtype j by the proxy classifier. Those tiles with their label assignment entropy greater than a threshold were excluded from inference. This threshold was determined based on the validation performance of the MIL model that was used.
UQ
To derive data uncertainty in the context of SNGP, once the model is properly trained, the RFFs for x* can be calculated as \({\Phi }^{* }=\sqrt{\frac{2}{{D}_{L}}}\cos \left(-{{\boldsymbol{W}}}_{L}{h}_{i}({{\boldsymbol{x}}}^{* })+{{\boldsymbol{b}}}_{L}\right)\). With this DL-dimensional latent representation of x*, the tile-level uncertainty estimates can be derived as \(\sqrt{\tau {\left[{{\boldsymbol{\Phi }}}^{* }\right]}^{T}{({{\boldsymbol{\Phi }}}_{N\times {D}_{L}}^{T}{{\boldsymbol{\Phi }}}_{N\times {D}_{L}}+\tau {\boldsymbol{I}})}^{-1}{{\boldsymbol{\Phi }}}^{* }}\), where τ denotes the ridge factor. SNGP requires only a single forward pass of an image tile through the neural network to extract the corresponding RFF features for uncertainty estimation. By contrast, MC Dropout requires multiple forward passes of the same image tile to quantify prediction uncertainty. This makes SNGP more computationally efficient during inference, which is particularly important for large-scale, safety-critical applications with stringent real-time inference requirements like cancer diagnostics.
It is important to note that SNGP and CP differ in their approaches for UQ. Beyond the regular classification probability of x* produced by a deep learning model, SNGP generates an estimated standard deviation σ(x*) as the uncertainty measure for x*. By contrast, CP does not produce any explicit uncertainty measure. Instead, it operates within the class probability space established by the deep learning model and constructs a prediction set Γ(x*) for x* by including labels with the corresponding prediction probabilities beyond a threshold. Thus, SNGP and CP fundamentally differ in how they quantify uncertainty.
For OOD detection and DSC, two scores were considered: (1) a probability-based OOD score and (2) an uncertainty-based OOD score. The probability-based OOD score aggregates tile-level class probabilities and measures the degree of OOD as \(1-\frac{1}{N}{\sum }_{i=1}^{N}{\max }_{k}\,p({\widehat{y}}_{k}| {{\boldsymbol{x}}}_{i})\), where \(p({\widehat{y}}_{k}| {{\boldsymbol{x}}}_{i})\) denotes the probability of tile i being classified as \({\widehat{y}}_{k}\). Unlike the probability-based OOD score, the uncertainty-based OOD score quantifies tile-level prediction uncertainty using the standard deviation estimated by MC Dropout or SNGP. It then uses the first δ tiles with the lowest uncertainty for each patient to represent the patient-level prediction uncertainty. Clearly, the uncertainty-based OOD score offers additional insight beyond what deterministic models can provide, effectively highlighting the inherent decision limitations of deep learning models, as shown by its superior performance in identifying OOD images in the foundation model-based approach.
CRC
CRC extends regular CP to accommodate a broader range of loss functions that assess different aspects of prediction quality beyond miscoverage, such as accuracy, calibration and other task-specific performance metrics72. More concretely, we define the prediction set for x by model \(\widehat{f}\) as \({\Gamma }_{\rho }\left({\boldsymbol{x}}\right)=\{k:\widehat{f}{\left({\boldsymbol{x}}\right)}_{k}\ge 1-\rho \}\), where k ∈ {1, ⋯ , K} denotes all possible class labels and ρ is a threshold that controls the conservativeness of the prediction set. Then, for any bounded loss function l(⋅) that is non-increasing as the prediction set \({\Gamma }_{\rho }\left({\boldsymbol{x}}\right)\) gets larger, CRC provides a statistical guarantee on the loss in the following form:
$${\mathbb{E}}\left[l\left({\Gamma }_{\rho }\left({\boldsymbol{x}}\right),y\right)\right]\le \alpha .$$
(11)
Compared to regular CP, CRC extends CP to control the expected value of any monotone loss function instead of only coverage. The goal of CRC is to determine the lowest possible \(\widehat{\rho }\) based on the calibration dataset, such that the risk control defined in equation (11) holds for unseen data. In the deployment of pathology AI where In-D and OOD data coexist, the observed coverage over the calibration data guides the identification of \(\widehat{\rho }\) via binary search. The threshold \(\widehat{\rho }\) can be iteratively adjusted until the target coverage 1 − α is satisfied in the presence of OOD data. After the optimal \(\widehat{\rho }\) is identified, CRC then applies it to the unseen data such that their coverage is 1 − α guaranteed.
Model fairness evaluation
For model fairness evaluation, we examined the performance gap between the best- and worst-performing subgroups based on sex and race73. Two performance dimensions were assessed: (1) accuracy of model predictions and (2) size of CP prediction sets. Group-level average values of these metrics were calculated to represent the performance of each corresponding subgroup. As the gap value is always non-negative, a smaller gap indicates a fairer model. Sex and race information is available in both the TCGA and CPTAC datasets. TCGA includes 569 male and 372 female patients, while CPTAC comprises 288 male and 127 female patients, with one patient’s sex not reported. For race, categories with fewer than 20 patients are grouped as ‘Others’. Specifically, in TCGA, patients are grouped into Others (n = 98), White (n = 678) and Not reported (n = 165). In CPTAC, patients are grouped into Asian (n = 123), Others (n = 49) and White (n = 244). Demographic information is summarized in Supplementary Table 1.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.



