Dataset description
Dataset overview
For the binary lung cancer subtyping task, we utilized two public and one private digital pathology datasets of NSCLC stained with haematoxylin and eosin: (1) 941 whole-slide images (WSIs) from TCGA, including 467 LUAD and 474 LUSC cases, (2) 1,306 WSIs from CPTAC, including 644 LUAD and 662 LUSC cases, and (3) 3,123 WSIs from Queen Mary Hospital, including 2,645 LUAD and 478 LUSC cases. For brevity, we call these three datasets TCGA, CPTAC, and QMH-NSCLC, respectively. The QMH-NSCLC collection was approved by the Institutional Review Board of The University of Hong Kong/Hospital Authority Hong Kong West Cluster (HKU/HA HKW IRB) (UW22-168), with informed consent waived owing to the retrospective design and the use of de-identified data.
Supplementary Tables 1 and 2 outline the core features of these three datasets. In the TCGA dataset, each patient is represented by one slide, while in the CPTAC and QMH-NSCLC datasets, each patient has an average of 3.1 and 2.3 slides, respectively. To test TRUECAM’s capacity to detect out-of-distribution (OOD) inputs, we built a clinically relevant dataset of 218 cancer-adjacent normal tissues (labeled as solid tissue normal in TCGA) drawn from the TCGA-NSCLC project52, which we refer to as OOD throughout this paper. Since non-cancerous lung tissues share numerous biological and morphological traits with NSCLC slides, telling them apart represents a particularly demanding and clinically meaningful challenge for OOD detection.
Beyond the binary NSCLC subtyping task, we employed five additional datasets totaling 14,932 WSIs, covering diverse subtyping tasks across different organs and varying numbers of cancer subtypes, to further probe TRUECAM’s generalizability and scalability (Supplementary Table 2). These datasets are as follows: (1) Brain tumor subtyping from TCGA (called TCGA-BRTS)—1,225 WSIs for a three-class brain tumor subtyping task across 617 patients, (2) BRACS—547 WSIs for a three-class breast cancer subtyping task across 189 patients53, (3) TCGA-BRCA—1,049 WSIs for breast cancer subtyping across 984 patients54, (4) TCGA-RCC—925 WSIs for three-class kidney cancer subtyping across 895 patients38, and (5) TCGA-OT—11,186 WSIs from 9,149 patients for a 46-class pan-cancer slide-level classification task defined according to the OncoTree classification system31.
Image preprocessing
Irrespective of differing model input requirements, all WSIs were segmented and divided into suitable tiles through a standard preprocessing pipeline implemented in Slideflow (version 2.1.0)55. Feeding entire gigapixel WSIs directly into a neural network is computationally infeasible, so image tiles of manageable dimensions were extracted to match each model’s specifications. For instance, foundation models such as UNI, CONCH, and Prov-GigaPath take image tiles of 256 × 256 pixels at ×20 magnification; following the Inception-v3 model developed in ref. 16, we used image tiles of 299 × 299 pixels at ×10 magnification; unlike UNI, CONCH, and Prov-GigaPath, the more recent foundation model TITAN accepts image tiles of 512 × 512 pixels at ×20 magnification. The preprocessing pipeline also includes several routine steps to produce deep-learning-ready datasets. Otsu’s thresholding56 was applied to separate the foreground (tissue) from the background (empty slide), followed by greyspace filtering to discard background tiles. All extracted tiles were then stain-normalized and stored in TFRecord format for model training, inference, and performance evaluation. No pathologist-annotated regions of interest were used; instead, we relied exclusively on slide-level diagnoses and assigned a uniform weak label to all tiles taken from a given WSI when needed for model training. For the five additional datasets, we followed the same data preprocessing approach used in prior studies, including tissue segmentation and tiling28,29,31.
Dataset configuration
In the setup for the specialized model based on Inception-v3, we randomly split the TCGA dataset, assigning 65% of the patients’ data for training, 15% for validation, and 20% for calibration and testing. We repeated this random partitioning 20 times, yielding 20 independently trained models for evaluation. Within the 20% reserved for calibration and testing, data from 100 patients were used for calibration to establish CP (referred to as the TCGA calibration dataset), while data from 89 patients was held out for testing (referred to as the TCGA testing dataset). To gauge the variability of CP performance, the 20% split was randomized 500 times, producing distinct calibration and testing datasets for each model. Consequently, we report model performance based on 20 models × 500 evaluations per model. The CPTAC dataset was used entirely for external validation to assess how well TRUECAM performs in scenarios requiring transferability, with data from 100 patients set aside for calibration to establish CP (referred to as the external CPTAC calibration dataset) and the remaining 316 patients reserved for testing (referred to as the external CPTAC testing dataset). Similarly, the CPTAC dataset was randomly partitioned 500 times to evaluate the variability of CP performance.
In the scenario involving foundation models for NSCLC subtyping, which entails training a separate model using latent tile representations produced by foundation models, the TCGA and CPTAC datasets were each used independently for the full model training and evaluation process. Each dataset was divided into three subsets: 65% for training, 15% for validation, and 20% for calibration and testing. Consistent with the strategy used for Inception-v3 evaluation, this partitioning and model training process was repeated 20 times, with each iteration involving 500 random splits to establish and evaluate CP.
For all non-lung cancer subtyping datasets except TCGA-OT, we applied the same procedure as for TCGA (described above), dividing each dataset into training (65%), validation (15%), and testing (20%). This random split was repeated five times, producing five independently trained models for evaluation. We report the mean along with the 95% confidence interval (95% CI) of performance metrics across these five folds. For TCGA-OT, we followed the data split configuration from ref. 31 and used non-parametric bootstrapping with 1,000 samples to estimate the mean and standard deviation of model performance. For the additional NSCLC subtyping dataset QMH-NSCLC, we used the same 65%/15%/20% split for training, validation, and calibration/testing but performed the partitioning only once and applied non-parametric bootstrapping with 150 samples to estimate the mean and standard deviation of model performance.
SNGP
We employed SNGP57,58 to deliver a principled estimation of data uncertainty in deep learning models for NSCLC subtyping. SNGP introduces two key modifications to a standard neural network: (1) adding spectral normalization into the hidden layers and (2) substituting the conventional dense output layer with a Gaussian process. Adding spectral normalization ensures that the mapping from the input x in the original input space to the latent representation in the penultimate layer (hleft(right)) preserves relative distances. In other
In practice, the distance ({parallel h(_)-h(_)parallel }_{{rm}}) between the representations of any two image data points (left({{boldsymbol}}_{1},{{boldsymbol{x}}}_{2}right)) within the latent space effectively mirrors their separation ({leftVert {{boldsymbol{x}}}_{1}-{{boldsymbol{x}}}_{2}rightVert }_{{rm{X}}}) in the original input space, when compared relative to other pairs of data. To address a common training issue known as feature collapse59, spectral normalization is introduced. This problem arises when out-of-distribution data or samples from distinct categories—which are widely separated in the input space—get mapped too closely together in the latent representation, resulting in unreliable uncertainty estimates. From a mathematical standpoint, applying spectral normalization is equivalent to imposing a bi-Lipschitz constraint on the mapping function h(⋅), which must hold for any pair of inputs x1 and x258,60:
$${L}_{1}times ||{{boldsymbol{x}}}_{1}-{{boldsymbol{x}}}_{2}|_{{rm{X}}}le ||h({{boldsymbol{x}}}_{1})-h({{boldsymbol{x}}}_{2})|_{{rm{H}}}le {L}_{2}times ||{{boldsymbol{x}}}_{1}-{{boldsymbol{x}}}_{2}|_{{rm{X}}},$$
(1)
Here, L1 and L2 represent the positive lower and upper Lipschitz constants (with 0 < L1 < 1 < L2) for the feature extractor h(⋅), while ∣∣⋅∣∣X and ∣∣⋅∣∣H denote the distance measures used in the original space and the latent space, respectively. The lower bound L1 ensures that distances are maintained in the latent space, whereas the upper bound L2 promotes network smoothness and robustness by preventing excessive sensitivity to small input perturbations of x.
Within the SNGP framework, this bi-Lipschitz requirement is enforced by applying spectral normalization to the network weights ({left{{{boldsymbol{W}}}_{l}right}}_{l=1}^{L-1}) (ref. 61), where l indexes the lth hidden layer. At each training step, the procedure first approximates the spectral norm (widehat{lambda }approx {leftVert {{boldsymbol{W}}}_{l}rightVert }_{2}) via the power iteration method62, and then rescales the weights according to:
$${widehat{{boldsymbol{W}}}}_{l}=left{begin{array}{l}c{{boldsymbol{W}}}_{l}/widehat{lambda },,,mathrm{if},,c < widehat{lambda } {{boldsymbol{W}}}_{l},,,mathrm{otherwise}end{array}right.,$$
(2)
where c is a hyperparameter that controls the specific upper limit placed on the spectral norm ({leftVert {{boldsymbol{W}}}_{l}rightVert }_{2}) (that is, ({leftVert {{boldsymbol{W}}}_{l}rightVert }_{2}le c)).
SNGP then quantifies predictive uncertainty by capitalizing on the distance-preserving latent space, which is achieved by swapping the standard dense output layer for a Gaussian process. Suppose we have a dataset of N training samples ({mathcal{D}}={left{{{boldsymbol{x}}}_{i},{y}_{i}right}}_{i=1}^{N}), and let hi(xi) represent the activation of xi at the penultimate layer. The output vector ({{boldsymbol{f}}}_{Ntimes 1}={left[f({h}_{1}),f({h}_{2}),cdots ,,f({h}_{N})right]}^{T}) of the Gaussian process is governed by a multivariate Gaussian distribution:
$${{boldsymbol{f}}}_{Ntimes 1} sim {boldsymbol{N}}left({{boldsymbol{0}}}_{Ntimes 1},{{boldsymbol{K}}}_{Ntimes N}right),,mathrm{where},,,{{boldsymbol{K}}}_{i,j}=exp left(frac{-{leftVert {h}_{i}-{h}_{j}rightVert }_{2}^{2}}{2}right).$$
(3)
Conditioned on the learned latent features hi, SNGP employs the Laplace approximation alongside the random Fourier feature (RFF) expansion63 to approximate the Gaussian process. Concretely, the prior in equation (3) is approximated by factorizing the kernel matrix K = ΦΦT into a product involving random features64:
$${{boldsymbol{f}}}_{Ntimes 1} sim {boldsymbol{N}}left({{boldsymbol{0}}}_{Ntimes 1},Phi {Phi }_{Ntimes N}^{T}right),,mathrm{where},,,{Phi }_{i}=sqrt{frac{2}{{D}_{L}}}cos left(-{{boldsymbol{W}}}_{L}{h}_{i}+{{boldsymbol{b}}}_{L}right),$$
(4)
where Φi corresponds to the activations of the final layer with dimensionality DL, WL is a fixed weight matrix drawn from ({mathcal{N}}(0,1)), and bL is a fixed bias vector sampled from a uniform distribution ({mathcal{U}}(0,2pi )). Because WL and bL remain fixed, the RFF-based approximation to the Gaussian process prior for the kth logit in the classification task (from equation (3)) can be rewritten as:
$${g}_{k}left({h}_{i}right)=sqrt{frac{2}{{D}_{L}}}cos {left(-{{boldsymbol{W}}}_{L}{h}_{i}+{{boldsymbol{b}}}_{L}right)}^{T}times {beta }_{k},,k=1,cdots ,,K,$$
(5)
where ({beta }_{k} sim {mathcal{N}}left(0,{boldsymbol{I}}right)), K is the total number of classes, and ({boldsymbol{beta }}={left{{beta }_{k}right}}_{k=1}^{K}) are the sole trainable parameters in the output layer.
Altogether, the learnable parameters include β along with the weights and biases ({boldsymbol{alpha }}={left{{{boldsymbol{W}}}_{l},{{boldsymbol{b}}}_{l}right}}_{l=1}^{L-1}) from the first L − 1 layers. The RFF approximation effectively reduces the high-dimensional Gaussian process to a conventional Bayesian linear model, yielding a closed-form posterior that can be trained end-to-end with the rest of the network via stochastic gradient descent. Once the MAP estimate (widehat{{beta }_{k}}) is obtained, for a new test point x*, the first L − 1 layers of SNGP act as a feature extractor to produce its latent representation ({Phi }^{* }=sqrt{frac{2}{{D}_{L}}}cos left(-{{boldsymbol{W}}}_{L}h({{boldsymbol{x}}}^{* })+{{boldsymbol{b}}}_{L}right)). Subsequently, the mean and variance for the kth logit are computed as follows:
$${mu }_{k}left({{boldsymbol{x}}}^{* }right)={left[{Phi }^{* }right]}^{T}widehat{{beta }_{k}},,{sigma }_{k}left({{boldsymbol{x}}}^{* }right)={left[{Phi }^{* }right]}^{T}{Sigma }_{k}{Phi }^{* }.$$
(6)
Finally, the predictive distribution for x* is derived as follows:
$$pleft(left.{y}^{* }right|{{boldsymbol{x}}}^{* }right)=mathop{int }limits_{s sim {mathcal{N}}left({mu }_{k}left({{boldsymbol{x}}}^{* }right),{sigma }_{k}left({{boldsymbol{x}}}^{* }right)right)}gamma (s),$$
(7)
where γ represents the softmax activation function.
CP
CP is a distribution-agnostic methodology for generating prediction sets for any prediction or classification algorithm. It delivers formal statistical assurances that the error rate of model outputs remains bounded, strictly adhering to the error tolerance defined by developers or service providers under the assumption of data exchangeability65. Consider a classification task involving K classes. Let ({mathcal{D}}={left{{{boldsymbol{x}}}_{i},{y}_{i}right}}_{i=1}^{T}) be independently and identically distributed (i.i.d.) data points drawn from a distribution ({mathcal{X}}times {mathcal{Y}}), where ({{boldsymbol{x}}}_{i} sim {mathcal{X}}), ({y}_{i} sim {mathcal{Y}}), yi ∈ {1, ⋯ , K}, and (alpha in left[0,1right]) represents a predetermined error level (also called the significance level). Then, the goal of CP is to produce a prediction set for each input, expressed as
$$Gamma :{mathcal{X}}to left{,{rm;subset; of}},,,left{1,2,cdots ,,Kright}right},$$
(8)
such that for a new i.i.d. data point (({{boldsymbol{x}}}^{* },{y}^{* }) sim Pleft({mathcal{X}},{mathcal{Y}}right)), the probability that the prediction set fails to include the ground-truth label is controlled at level α. More precisely, this is stated as follows66:
$$left.1-alpha le {{{mathbb{P}}}}left(,{y}^{* }in Gamma ({{boldsymbol{x}}}^{* })| {{boldsymbol{x}}}^{* }right)right)le 1-alpha +frac{1}{T+1},$$
(9)
where the probability is evaluated over the data ({({{boldsymbol{x}}}_{i},{y}_{i})}_{i=1}^{T}cup ({{boldsymbol{x}}}^{* },{y}^{* })).
By construction, CP ensures that the prediction set Γ(x*) contains the true label y* with a probability of at least 1 − α. The bounded error rate is attained by systematically assembling a set of plausible class labels for inputs about which the model lacks sufficient certainty, in contrast to the single-value (in regression) or single-class (in classification) predictions produced by conventional machine learning models.
Using two separate, non-overlapping datasets—one for model training and one for calibration—the construction of an inductive conformal predictor involves three essential steps:
-
1.
Define a nonconformity score. Conformity captures how closely a new data point aligns with the training data67 and is commonly measured via a nonconformity score. For classification tasks, the nonconformity score is frequently defined as (s({{boldsymbol{x}}}_{i},{y}_{i})=1-widehat{f}{left({{boldsymbol{x}}}_{i}right)}_{{y}_{i}}), where (widehat{f}{left({{boldsymbol{x}}}_{i}right)}_{{y}_{i}}) denotes the probability that the trained model (widehat{f}) assigns to the correct class yi for input xi. Thus, the lower the model’s confidence in correctly classifying xi as yi, the higher the corresponding nonconformity score. We adopted this nonconformity score throughout our study.
-
2.
Determine the nonconformity threshold. We then calculated the nonconformity score (sleft({{boldsymbol{x}}}_{i},{y}_{i}right)) for each calibration data point xi (i = T + 1, ⋯ , T + R), yielding a collection of scores: {s(xT + 1, yT + 1), s(xT + 2, yT + 2), ⋯ , s(xT + R, yT + R)}. Arranging these scores in ascending order, we set the threshold (widehat{q}) at the (frac{lceil (R+1)(1-alpha )rceil }{R}) quantile of this ordered sequence. This threshold guarantees that the resulting prediction sets achieve a coverage rate of at least 1 − α on new, i.i.d. datasets.
-
3.
Generate prediction sets. For a new input x*, the prediction set is composed by including every class label k ∈ {1, 2, …, K} whose corresponding score (sleft({{boldsymbol{x}}}^{* },kright)) is less than or equal to (widehat{q}). Formally, this is expressed as:
$$Gamma ({{boldsymbol{x}}}^{* })=left{k:sleft({{boldsymbol{x}}}^{* },kright)le widehat{q}right},,k=1,ldots ,K.$$
(10)
Unlike standard UQ approaches, CP transforms the traditionally heuristic notion of uncertainty—commonly encoded through softmax outputs—into a statistically rigorous measure cast as a prediction set. By following the procedure outlined above, CP ensures the generation of statistically valid prediction sets that capture the true label with a pre-specified coverage of 1 − α. This approach provides a principled framework for constructing reliable confidence sets, enabling machine learning systems to produce robust outputs while supporting responsible and trustworthy scientific inference.
Deep learning models
To evaluate the performance of TRUECAM, we assessed two categories of AI architectures: task-specific models (termed specialized models) and models built upon pathology foundation models (termed foundation models). We trained Inception-v3 from the ground up—a widely adopted convolutional neural network architecture for image-based inference26,27—to serve as a benchmark classifier for specialized medical imaging tasks. All Inception-v3 models were trained at the tile level, with slide-level predictions obtained by averaging tile-level outputs across each whole-slide image (WSI)16. For patients with multiple WSIs, the final patient-level diagnosis was derived by averaging the slide-level predictions over all slides belonging to that patient. In this setup, we leveraged TCGA for model training and internal validation, while CPTAC served as an independent external validation dataset to evaluate TRUECAM’s effectiveness in transfer-learning scenarios. We further examined four digital pathology foundation models that exemplify recent advances in large-scale, general-purpose models in pathology:We integrated four foundation models into our framework: (1) UNI28, (2) CONCH29, (3) Prov-GigaPath30, and (4) TITAN31. These were combined with several state-of-the-art MIL approaches (specifically ABMIL68, CLAM38, and TransMIL69) wherever applicable. Since foundation models are pretrained on broad datasets and serve as highly transferable image encoders, while downstream task-specific classifiers are generally lightweight and straightforward to optimize, our evaluation concentrated exclusively on the scenario in which each healthcare institution trains and assesses its own site-local model using its proprietary data. All datasets were selected to replicate this setup, with each dataset standing in for a separate healthcare organization that independently handles the necessary model training and validation. Our assessment centered primarily on UNI and CONCH, with supplementary results for Prov-GigaPath and TITAN to demonstrate the broader applicability of TRUECAM’s benefits.
Specialized models based on Inception-v3
We built three deep learning models on top of the Inception-v3 architecture26,27. The first is the standard deterministic variant, which does not include any uncertainty quantification. The remaining two models do incorporate UQ methods: (1) MC Dropout11 and (2) SNGP. Inception-v3 is a 48-layer deep convolutional neural network engineered to extract spatial features at multiple scales by employing parallel convolutions with varying filter sizes within each layer. To mitigate the vanishing gradient issue and supply extra supervision signals, Inception-v3 incorporates an auxiliary classifier that relays label information back to earlier layers. Furthermore, Inception-v3 leverages bottleneck layers to compress dimensionality and lower computational overhead. These architectural choices collectively boost performance and computational efficiency, making Inception-v3 a popular choice across a wide spectrum of general computer vision and medical imaging tasks. Supplementary Table 3 lists the parameter counts and training configurations.
MC Dropout served as the principal baseline method for UQ11. Under the assumption that neural network weight priors follow a standard normal distribution, MC Dropout is mathematically equivalent to drawing samples from a variational posterior composed of two independent Gaussians with fixed covariance. To apply MC Dropout within Inception-v3, we followed the procedure outlined in ref. 16. Concretely, we attached two fully connected layers—each with 1,024 units—to the network’s output and applied dropout at a rate of 0.1 after each one. At inference time, each image tile was passed through the dropout-enabled network five times. The standard deviation across these five prediction passes was taken as the tile-level uncertainty score. Both the tile-level predictions and their uncertainty values were subsequently aggregated to yield slide-level and patient-level predictions along with corresponding uncertainty estimates.
We also included VBLL as an additional UQ baseline. We took a post-training strategy to learn a variational last layer on top of features from a pretrained network. Both the discriminative and generative classifiers were built as two-layer dense neural networks with a hidden size of 128 and diagonal parameterization. All remaining hyperparameters for VBLL training followed the setup described in ref. 32. Performance outcomes for VBLL, MC Dropout, and SNGP on the NSCLC subtyping task are reported in Supplementary Table 8.
We trained specialized models from the ground up using a 24 GB NVIDIA RTX 4090 GPU with a batch size of 64, resizing image tiles to 299 × 299 pixels. We employed a learning rate scheduler starting at 0.0003 with a decay factor of 0.98 applied every 512 steps. Each model was trained for four epochs using the Adam optimizer with first and second moment exponential decay rates set to 0.9 and 0.999, respectively. Throughout training, we tracked validation performance and checkpointed the best model for testing. Standard data augmentation methods—random horizontal and vertical flips, 90° rotations, JPEG compression applied with 50% probability (quality range 50–100), and Gaussian blur applied with 10% probability (sigma 0.5–2.0)—were used to enrich the training pipeline for every model.
Foundation models
TRUECAM integrates effortlessly with pathology foundation models to enable trustworthy cancer subtyping. In this work, we examined TRUECAM’s compatibility with two categories of foundation model: tile-level (patch-level) models—namely UNI28 and CONCH29—and slide-level models—namely Prov-GigaPath30 and TITAN31—both leveraging vision transformer architectures as image encoders. All are general-purpose models that underwent systematic pretraining on diverse histopathology image collections via self-supervised learning. Among them, CONCH, Prov-GigaPath, and TITAN are visual-language models engineered to align image and text modalities within a shared representation space, while UNI is pretrained on images alone. These foundation models exhibit remarkable adaptability and knowledge transfer capacity through their encoded representations, proving highly effective for a broad array of downstream histopathology tasks such as image classification, region-of-interest retrieval, and cell type segmentation.
To repurpose tile-level foundation models for cancer subtyping, we used the image encoders of UNI and CONCH to generate tile embeddings (latent representations), concatenated the per-slide embeddings, and then applied popular MIL-based mechanisms—including ABMIL, CLAM38, and TransMIL69—for weakly supervised classification using slide-level diagnostic labels. The ABMIL architecture and training setup followed those defined in UNI28 and CONCH29. Specifically, ABMIL comprises a fully connected layer with a rectified linear unit activation to project tile-level features into 512-dimensional embeddings, followed by an intermediate layer of size 384. A final fully connected layer then maps the attention-pooled slide-level representations to logits and class probabilities via softmax normalization. Dropout was applied at several points: a rate of 0.1 on the input features and 0.25 after each intermediate layer. Additionally, we adopted the CLAM-SB (single-attention branch) and TransMIL models, with parameter settings following ref. 38 and ref. 69, respectively. To incorporate SNGP, we added spectral normalization to all fully connected layers and substituted the final layer with an RFF-approximated Gaussian process. All MIL-based models were trained for up to 20 epochs using the AdamW optimizer70 with a cosine learning rate scheduler, an initial learning rate of 1 × 10−4, and cross-entropy loss.
For slide-level foundation models, we fine-tuned the slide encoder of Prov-GigaPath and performed linear probing on the slide representations produced by TITAN for downstream classification tasks, as recommended by refs. 30,31. For both models, we appended a Gaussian process layer to the slide-level representations without applying spectral normalization, since the representations from Prov-GigaPath and TITAN inherently maintain meaningful distances owing to their targeted self-supervised training strategies.
EAT
Irrespective of the number of subtypes S involved in a given subtyping task, ambiguousHere is the rewritten article in HTML format:
Image tiles within whole-slide images (WSIs) that introduced noise were detected and filtered out using an entropy-driven evaluation method. Although the core principle of entropy remains the same across approaches, task-specific architectures (such as Inception‑v3) and foundation models (including UNI, CONCH, Prov‑GigaPath, and TITAN) adopt distinct strategies for applying this measure.
Regarding task-specific models (for instance, Inception‑v3), we applied k-means clustering at every tested cluster count k (k > S) to the latent representations of all training tiles produced by SNGP. For any resulting cluster i (i ∈ {1, 2, …, k}), the label assignment entropy of that cluster is defined as (mathbb{H}(k,i)=-sum_{j=1}^{S}p(k,i,j)log p(k,i,j)), where p(k, i, j) represents the proportion of tiles in cluster i that are assigned label j. This entropy value captures how much subtype mixing exists inside each cluster, with higher values signaling greater heterogeneity. All clusters were then ordered from highest to lowest by (mathbb{H}(k,i)), and the top m clusters were flagged as ambiguous. A new task-specific model was subsequently trained on tiles drawn exclusively from the non-ambiguous clusters. During inference, any tiles from a newly unseen WSI that fell into excluded clusters were omitted, and only the remaining tiles were processed by the retrained model (see Fig. 1c). The optimal values for k and m were chosen according to the retrained model’s performance on the validation set.
Within the foundation‑model framework, estimating data‑level ambiguity for EAT calls for a tile‑level classifier that matches the target subtyping task. Since the tile representations produced by foundation‑model encoders do not directly fit the input format expected by the specialized model, we trained a lightweight proxy classifier from the ground up using AutoGluon (v1.1.1)71. AutoGluon accepts the extracted tile‑level embeddings as tabular data and systematically explores a search space encompassing multiple modern learners (such as LightGBM, NeuralNet, XGBoost, and ExtraTrees) for cancer subtyping. It then automatically selects and ensembles the strongest models to maximize validation performance. For a given tile x, the label assignment entropy is computed as (mathbb{H}(x)=-sum_{j=1}^{S}p(j,x)log p(j,x)), where p(j, x) denotes the probability that proxy classifier assigns tile x to subtype j. All tiles whose entropy exceeded a fixed cutoff were left out of inference, and this threshold was set based on the validation results of the MIL model in use.
UQ
When leveraging SNGP to estimate data uncertainty, once the training phase is completed the random Fourier features (RFFs) for an input x* can be obtained as (Phi^{*} = sqrt{frac{2}{D_{L}}} cos!big( – mathbf{W}_{L} h_{i}(mathbf{x}^{*}) + mathbf{b}_{L} big)). From this DL-dimensional latent representation of x*, the per‑tile uncertainty can be approximated by (sqrt{tau , [Phi^{*}]^{!T} big( Phi_{N times D_{L}}^{T} Phi_{N times D_{L}} + tau mathbf{I} big)^{-1} Phi^{*}}), where τ is the ridge regularization parameter. SNGP performs just a single forward pass through the network to extract RFF features for quantifying uncertainty. In contrast, Monte‑Carlo Dropout (MC Dropout) must perform multiple passes through the same image tile to obtain the uncertainty estimate. Consequently, SNGP has far lower computational cost, which is especially relevant for large‑scale, safety‑critical applications that impose strict real‑time constraints—such as routine cancer diagnostics.
It is worth highlighting that SNGP and Conformal Prediction (CP) tackle uncertainty quantification (UQ) through fundamentally different philosophies. Alongside the standard classification probability for x*, SNGP explicitly outputs an estimated standard deviation σ(x*) as its uncertainty metric. CP, however, does not produce any standalone uncertainty score; it works within the class‑probability space and constructs a prediction set Γ(x*) by collecting all labels whose predicted probabilities exceed a certain threshold. Therefore, SNGP and CP diverge in both mechanism and interpretation of uncertainty.
For both out‑of‑distribution (OOD) detection and difficulty‑sensitive calibration, two complementary scores were evaluated: (1) a probability‑based OOD score and (2) an uncertainty‑based OOD score. The probability‑based score sums the maximum class probabilities across all tiles and is formulated as (1 – frac{1}{N} sum_{i=1}^{N} max_{k} p(hat{y}_{k} mid mathbf{x}_{i})), where (p(hat{y}_{k} mid mathbf{x}_{i})) denotes the probability of tile i belonging to class (hat{y}_{k}). Differing from the probability approach, the uncertainty‑based score relies on per‑tile prediction uncertainty, measured by the standard deviation output from MC Dropout or SNGP, and then selects the first δ tiles with the lowest uncertainty for each patient to represent the overall patient‑level confidence. Clearly, the uncertainty‑based perspective offers a layer of insight that purely deterministic models cannot reveal, effectively exposing the inherent decision‑making boundaries of deep networks, as demonstrated by its stronger performance on OOD detection in the foundation‑model pipeline.
CRC
Classification‑based Risk Control (CRC) extends traditional CP by accommodating a wider family of loss functions that evaluate different quality facets of prediction—including accuracy, calibration error, and any other domain‑specific criteria—rather than vanilla miscoverage alone72. Formally, given a predictive model (hat{f}), the prediction set for input x is defined as (Gamma_{rho}(mathbf{x}) = { k : hat{f}(mathbf{x})_{k} ge 1 – rho }), where k ∈ {1, …, K} enumerates every possible class label, and ρ is a tunable parameter that governs how conservative the prediction set should be. Then, for any bounded loss function l(⋅) that decreases monotonically as the set (Gamma_{rho}(mathbf{x})) grows larger, CRC furnishes a statistical guarantee of the form
$$mathbb{E}!left[ l!left( Gamma_{rho}(mathbf{x}), y right) right] le alpha.$$
(11)
In contrast with standard CP, CRC extends the scope of control from constraining label coverage alone to bounding the expectation of any monotone loss. CRC’s objective is to identify the lowest admissible value of (hat{rho}) using the calibration dataset such that the risk control defined above generalizes to unseen data. In pathology AI deployments where in‑distribution and out‑of‑distribution samples co‑occur, the threshold (hat{rho}) is located through binary search—guided by the observed coverage on the calibration set—until the desired coverage 1 − α is attained in the presence of OOD data. Once the optimal (hat{rho}) is confirmed, it is applied to all new samples, guaranteeing a coverage level of 1 − α.
Model fairness evaluation
To assess fairness of the system, we quantified the disparity between the best‑performing and worst‑performing subgroups defined by sex and race73. Two evaluation criteria were considered: (1) prediction accuracy and (2) the average size of CP prediction sets. Group‑level averages of each metric were computed for every subgroup. Because the gap metric is inherently non‑negative, a narrower gap indicates a fairer model. Both TCGA and CPTAC provide demographic annotations: TCGA comprises 569 male and 372 female patients; CPTAC includes 288 male and 127 female patients, with one case missing sex information. For race, categories containing fewer than 20 patients are merged into “Others.” Concretely, TCGA patients are partitioned into Others (n = 98), White (n = 678), and Not reported (n = 165). CPTAC patients are grouped into Asian (n = 123), Others (n = 49), and White (n = 244). Full demographic breakdowns are listed in Supplementary Table 1.
Reporting summary
Additional details concerning the research design can be found in the Nature Portfolio Reporting Summary associated with this article.



