How do you mix SigLIP2, DINOv3, and SAM3 right into a single imaginative and prescient spine with out sacrificing dense or segmentation efficiency? NVIDIA’s C-RADIOv4 is a brand new agglomerative imaginative and prescient spine that distills three robust trainer fashions, SigLIP2-g-384, DINOv3-7B, and SAM3, right into a single scholar encoder. It extends the AM-RADIO and RADIOv2.5 line, preserving related computational value whereas bettering dense prediction high quality, decision robustness, and drop-in compatibility with SAM3.
The important thing concept is straightforward. As an alternative of selecting between a imaginative and prescient language mannequin, a self supervised dense mannequin, and a segmentation mannequin, C-RADIOv4 tries to approximate all three directly with one spine.

Agglomerative distillation in RADIO
The RADIO household makes use of agglomerative distillation. A single ViT type scholar is skilled to match each dense characteristic maps and abstract tokens from a number of heterogeneous academics.
Earlier RADIO fashions mixed DFN CLIP, DINOv2, and SAM. They already supported multi decision coaching however confirmed ‘mode switching’, the place the illustration modified qualitatively as enter decision modified. Later work reminiscent of PHI-S, RADIOv2.5, and FeatSharp added higher multi decision distillation and regularization, however the trainer set was nonetheless restricted.
C-RADIOv4 upgrades the academics:
- SigLIP2-g-384 for stronger picture textual content alignment
- DINOv3-7B for prime quality self supervised dense options
- SAM3 for segmentation oriented options and compatibility with the SAM3 decoder
The coed is skilled in order that its dense options match DINOv3 and SAM3, whereas its abstract tokens match SigLIP2 and DINOv3. This provides one encoder that may assist classification, retrieval, dense prediction, and segmentation.
Stochastic multi decision coaching
C-RADIOv4 makes use of stochastic multi decision coaching fairly than a small fastened set of resolutions.
Coaching samples enter sizes from two partitions:
- Low decision:
{128, 192, 224, 256, 384, 432} - Excessive decision:
{512, 768, 1024, 1152}
SigLIP2 operates natively at 384 pixels. Its options are upsampled by an element of three utilizing FeatSharp to align with 1152 pixel SAM3 options. SAM3 is skilled with mosaic augmentation at 1152 × 1152.
This design smooths the efficiency curve over decision and improves low decision conduct. For instance, on ADE20k linear probing, C-RADIOv4-H reaches round:
- 55.20 mIoU at 512 px
- 57.02 mIoU at 1024 px
- 57.72 mIoU at 1536 px
The scaling development is near DINOv3-7B whereas utilizing roughly an order of magnitude fewer parameters.
Eradicating trainer noise with shift equivariant losses and MESA
Distilling from massive imaginative and prescient fashions tends to repeat their artifacts, not simply their helpful construction. SigLIP2 has border noise patterns, and ViTDet type fashions can present window boundary artifacts. Direct characteristic regression can drive the scholar to breed these patterns.
C-RADIOv4 introduces two shift equivariant mechanisms to suppress such noise:
- Shift equivariant dense loss: Every trainer and the scholar see independently shifted crops of a picture. Earlier than computing the squared error, options are aligned by way of a shift mapping and the loss solely makes use of overlapping spatial positions. As a result of the scholar by no means sees the identical absolute positions because the trainer, it can not merely memorize place fastened noise and is pressured to trace enter dependent construction as a substitute.
- Shift equivariant MESA: C-RADIOv4 additionally makes use of MESA type regularization between the net community and an EMA copy. Right here once more, the scholar and its EMA see totally different crops, options are aligned by a shift, and the loss is utilized after layer normalization. This encourages clean loss landscapes and robustness, whereas being invariant to absolute place.
As well as, coaching makes use of DAMP, which injects multiplicative noise into weights. This additional improves robustness to corruptions and small distribution shifts.
Balancing academics with an angular dispersion conscious abstract loss
The abstract loss in earlier RADIO fashions used cosine distance between scholar and trainer embeddings. Cosine distance removes magnitude however not directional dispersion on the sphere. Some academics, reminiscent of SigLIP2, produce embeddings concentrated in a slender cone, whereas DINOv3 variants produce extra unfold out embeddings.
If uncooked cosine distance is used, academics with wider angular dispersion contribute bigger losses and dominate optimization. In apply, DINOv3 tended to overshadow SigLIP2 within the abstract time period.
C-RADIOv4 replaces this with an angle normalized loss. The squared angle between scholar and trainer embeddings is split by the trainer’s angular dispersion. Measured dispersions present SigLIP2-g-384 round 0.694, whereas DINOv3-H+ and DINOv3-7B are round 2.12 and a pair of.19. Normalizing by these values equalizes their affect and preserves each imaginative and prescient language and dense semantics.
Efficiency: classification, dense prediction, and Probe3d
On ImageNet-1k zero shot classification, C-RADIOv4-H reaches about 83.09 % top-1 accuracy. It matches or improves on RADIOv2.5-H and C-RADIOv3-H throughout resolutions, with the perfect efficiency close to 1024 px.
On k-NN classification, C-RADIOv4-H improves over RADIOv2.5 and C-RADIOv3, and matches or surpasses DINOv3 beginning round 256 px. DINOv3 peaks close to 192–256 px after which degrades, whereas C-RADIOv4 retains secure or bettering efficiency at larger resolutions.
Dense and 3D conscious metrics present the meant tradeoff. On ADE20k, PASCAL VOC, NAVI, and SPair, C-RADIOv4-H and the SO400M variant outperform earlier RADIO fashions and are aggressive with DINOv3-7B on dense benchmarks. For C-RADIOv4-H, typical scores are:
- ADE20k: 55.20 mIoU
- VOC: 87.24 mIoU
- NAVI: 63.44
- SPair: 60.57


On Probe3d, which incorporates Depth Normals, Floor Normals, NAVI, and SPair, C-RADIOv4-H achieves the perfect NAVI and SPair scores within the RADIO household. Depth and Floor metrics are near these of C-RADIOv3-H, with small variations in both route, fairly than a uniform enchancment.
Integration with SAM3 and ViTDet-mode deployment
C-RADIOv4 is designed to be a drop in substitute for the Notion Encoder spine in SAM3. The SAM3 decoder and reminiscence parts stay unchanged. A reference implementation is offered in a SAM3 fork. Qualitative examples present that segmentation conduct is preserved for each textual content prompts reminiscent of “shoe”, “helmet”, “bike”, “spectator” and field prompts, and in some reported circumstances C-RADIOv4 primarily based SAM3 resolves failure circumstances from the unique encoder.
For deployment, C-RADIOv4 exposes a ViTDet-mode configuration. Most transformer blocks use windowed consideration, whereas just a few use world consideration. Supported window sizes vary from 6 × 6 to 32 × 32 tokens, topic to divisibility with patch measurement and picture decision. On an A100, the SO400M mannequin with window measurement at most 12 is quicker than the SAM3 ViT-L+ encoder throughout a variety of enter sizes, and the Enormous mannequin with window measurement 8 is shut in latency.
This makes C-RADIOv4 a sensible spine for prime decision dense duties the place full world consideration in any respect layers is just too costly.
Key Takeaways
- Single unified spine: C-RADIOv4 distills SigLIP2-g-384, DINOv3-7B, and SAM3 into one ViT-style encoder that helps classification, retrieval, dense prediction, and segmentation.
- Any-resolution conduct: Stochastic multi decision coaching over {128…1152} px, and FeatSharp upsampling for SigLIP2, stabilizes efficiency throughout resolutions and tracks DINOv3-7B scaling with far fewer parameters.
- Noise suppression by way of shift equivariance: Shift equivariant dense loss and shift equivariant MESA stop the scholar from copying trainer border and window artifacts, focusing studying on enter dependent semantics.
- Balanced multi-teacher distillation: An angular dispersion normalized abstract loss equalizes the contribution of SigLIP2 and DINOv3, preserving each textual content alignment and dense illustration high quality.
- SAM3 and ViTDet-ready deployment: C-RADIOv4 can straight change the SAM3 Notion Encoder, provides ViTDet-mode windowed consideration for quicker excessive decision inference, and is launched underneath the NVIDIA Open Mannequin License.
Take a look at the Paper, Repo, Mannequin-1 and Mannequin-2. Additionally, be at liberty to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be part of us on telegram as effectively.




