I’ll paraphrase the provided HTML article to improve readability and understanding while preserving the HTML structure and original language.
Developing a family of large language models (LLMs) has traditionally involved a significant overhead: each model variant in the family—whether it’s an 8B, 30B, or 70B version—typically demands its own full training process, separate storage, and an individual deployment setup. For a development team running inference at scale, this translates to multiplying compute costs by the number of different model sizes they intend to support. NVIDIA researchers are now introducing an alternative strategy known as Star Elastic.
Star Elastic is a post-training technique that integrates multiple nested submodels—each with different parameter budgets—into a single parent reasoning model, all achieved through a single training run. When applied to Nemotron Nano v3 (a hybrid Mamba–Transformer–MoE model featuring 30B total parameters and 3.6B active parameters), Star Elastic generates 23B (2.8B active) and 12B (2.0B active) nested variants, trained using approximately 160B tokens. All three variants reside within one checkpoint and can be extracted without requiring any additional fine-tuning.
What “Nested” Truly Means in This Context
If you’re unfamiliar with elastic or nested architectures, here’s the core concept: instead of training three distinct 30B, 23B, and 12B models separately, you train a single model that encompasses the smaller versions as subsets of itself. These smaller submodels leverage the most critical weights from the parent model, identified through a process called importance estimation.
Star Elastic evaluates each model component—including embedding channels, attention heads, Mamba SSM heads, MoE experts, and FFN channels—based on its contribution to the model’s accuracy. Components are then ranked and organized, ensuring that submodels with smaller budgets consistently utilize the highest-ranked contiguous subset of components from the larger model. This characteristic is referred to as nested weight-sharing.
The method supports nesting across various dimensions: the SSM (State Space Model) dimension, embedding channels, attention heads, Mamba heads and head channels, MoE expert count, and FFN intermediate dimension. For MoE layers specifically, Star Elastic employs Router-Weighted Expert Activation Pruning (REAP), which ranks experts based on both routing gate values and expert output magnitudes. This provides a more principled signal than naive frequency-based pruning, which overlooks the actual contribution of each expert to the layer’s output.
A Learnable Router, Not a Static Compression Method
A key differentiator from previous compression methods like Minitron is that Star Elastic utilizes an end-to-end trainable router to determine the architectures of the nested submodels. This router takes a target budget (e.g., “provide a model with 2.8B active parameters”) as a one-hot input and generates differentiable masks that select which components are active at that specific budget level. These masks are trained concurrently with the model using Gumbel-Softmax, enabling gradient flow through discrete architectural choices.
The loss function integrates knowledge distillation (KD), where the non-elastified parent model serves as the teacher, with a router loss that penalizes deviations from the target resource budget (parameter count, memory, or latency). This means the router learns to make architecture choices that genuinely enhance accuracy under KD, rather than simply minimizing a proxy metric.
Training follows a two-stage curriculum: an initial short-context phase (sequence length 8,192 tokens) with uniform budget sampling, succeeded by an extended-context phase (sequence length 49,152 tokens) with non-uniform sampling that prioritizes the full 30B model (p(30B)=0.5, p(23B)=0.3, p(12B)=0.2). The extended context phase is crucial for reasoning performance. The research team’s ablations on Nano v2—explicitly replicated as the empirical foundation for the same curriculum choice on Nano v3—demonstrate gains of up to 19.8% on AIME-2025 for the 6B variant and 4.0 percentage points for the 12B variant from Stage 2 alone, justifying its application here.
Elastic Budget Control: Tailoring Models to Different Reasoning Phases
Current budget control mechanisms in reasoning models, including Nemotron Nano v3’s default behavior, typically operate by limiting the number of tokens generated during a phase before forcing a final answer. This approach uses the same model throughout. Star Elastic enables a different strategy: employing different nested submodels for the thinking phase compared to the answering phase.
The researchers assessed four configurations. The most effective one, termed ℳS → ℳL (small model for thinking, large model for answering), assigns a more economical model to generate extensive reasoning traces and reserves the full-capacity model for synthesizing the final answer. The 23B → 30B configuration, in particular, advances the accuracy–latency Pareto frontier, achieving up to 16% higher accuracy and 1.9× lower latency compared to default Nemotron Nano v3 budget control. The underlying reasoning is that reasoning tokens are high-volume but can tolerate some reduction in capacity, whereas the final answer demands higher precision.
Quantization While Preserving the Nested Structure
A straightforward approach to deploying a quantized elastic model would involve quantizing each variant independently after slicing. However, this breaks the nested weight-sharing property and necessitates a separate quantization pass for each size. Instead, Star Elastic applies Quantization-Aware Distillation (QAD) directly to the elastic checkpoint, maintaining the nested mask hierarchy throughout the process.
For FP8 (E4M3 format), post-training quantization (PTQ) proves sufficient, recovering 98.69% of BF16 accuracy for the 30B variant. For NVFP4 (NVIDIA’s 4-bit floating-point format), PTQ alone results in a 4.12% average accuracy drop. Therefore, a brief nested QAD phase (~5B tokens at 48K context) is implemented, restoring recovery to 97.79% for the 30B variant. In both scenarios, zero-shot slicing of the 23B and 12B variants from the single quantized checkpoint remains intact.
The memory implications are substantial. Storing separate 12B, 23B, and 30B BF16 checkpoints demands 126.1 GB; the single elastic checkpoint requires only 58.9 GB. The 30B NVFP4 elastic checkpoint fits within 18.7 GB, allowing the 12B NVFP4 variant to operate on an RTX 5080, where all BF16 configurations would run out of memory. On an RTX Pro 6000, the 12B NVFP4 variant achieves 7,426 tokens/s, representing a 3.4× throughput improvement over the 30B BF16 baseline.
Depth vs. Width: The Rationale Behind Star Elastic’s Width Compression
One design choice that merits explicit discussion: the research team compared two compression strategies—completely removing layers (depth compression) versus reducing internal dimensions such as hidden size, expert count, and head count (width compression). With a 15% parameter reduction and 25B tokens of knowledge distillation, width compression recovered 98.1% of baseline performance, whereas depth compression recovered only 95.2%, showing noticeable degradation on HumanEval and MMLU-Pro. Consequently, Star Elastic prioritizes width-based elasticity for its primary results, although depth compression (layer skipping) remains an option for scenarios with extreme latency constraints.
Across the evaluation suite—AIME-2025, GPQA, LiveCodeBench v5, MMLU-Pro, IFBench,
When it comes to benchmark performance, the Elastic-30B variant matches its parent model, Nemotron Nano v3 30B, across most tests. Meanwhile, the Elastic-23B and Elastic-12B variants hold their own against separately trained models of comparable size. Notably, the Elastic-23B achieves an AIME-2025 score of 85.63, outperforming Qwen3-30B-A3B’s 80.00, even though it uses fewer active parameters.
In terms of training efficiency, the team reports a 360× reduction in tokens compared to training each variant from the ground up. This also represents a 7× improvement over previous top-tier compression techniques, which require separate distillation runs for each model size. On an H100 GPU using bfloat16 with identical input and output sequence lengths, the 12B variant delivers 2.4× the throughput of the 30B parent model.
Getting Started with NVIDIA Star Elastic
Step-by-Step Guide
Nemotron Nano v3 Elastic — 30B / 23B / 12B in one checkpoint · BF16 / FP8 / NVFP4
Step 1 of 5
Key Takeaways
- Star Elastic trains 30B, 23B, and 12B nested reasoning models from a single 160B-token post-training run, cutting token usage by 360× compared to pretraining from scratch.
- Elastic budget control (23B for thinking, 30B for answering) pushes the accuracy–latency Pareto frontier forward by up to 16% accuracy and 1.9× latency improvements.
- A learnable router with Gumbel-Softmax enables end-to-end trainable architecture selection, removing the need for separate compression runs per model size.
- Nested QAD preserves zero-shot slicing across FP8 and NVFP4 quantized checkpoints, shrinking the 30B elastic checkpoint to just 18.7 GB in NVFP4.
- All three precision variants (BF16, FP8, NVFP4) are publicly available on Hugging Face under
nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B.
Check out the Paper, Elastic Models on Hugging Face in BF16, FP8, and NVFP4. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.
Need to partner with us for promoting your GitHub repo, Hugging Face page, product release, webinar, or more? Connect with us



