NVIDIA AI Unveils Star Elastic: A Single Checkpoint Powering 30B, 23B, And 12B Reasoning Models With Zero-Shot Slicing

I’ll paraphrase the provided HTML article to improve readability and understanding while preserving the HTML structure and original language.

Developing a family of large language models (LLMs) has traditionally involved a significant overhead: each model variant in the family—whether it’s an 8B, 30B, or 70B version—typically demands its own full training process, separate storage, and an individual deployment setup. For a development team running inference at scale, this translates to multiplying compute costs by the number of different model sizes they intend to support. NVIDIA researchers are now introducing an alternative strategy known as Star Elastic.

Star Elastic is a post-training technique that integrates multiple nested submodels—each with different parameter budgets—into a single parent reasoning model, all achieved through a single training run. When applied to Nemotron Nano v3 (a hybrid Mamba–Transformer–MoE model featuring 30B total parameters and 3.6B active parameters), Star Elastic generates 23B (2.8B active) and 12B (2.0B active) nested variants, trained using approximately 160B tokens. All three variants reside within one checkpoint and can be extracted without requiring any additional fine-tuning.

What “Nested” Truly Means in This Context

If you’re unfamiliar with elastic or nested architectures, here’s the core concept: instead of training three distinct 30B, 23B, and 12B models separately, you train a single model that encompasses the smaller versions as subsets of itself. These smaller submodels leverage the most critical weights from the parent model, identified through a process called importance estimation.

Star Elastic evaluates each model component—including embedding channels, attention heads, Mamba SSM heads, MoE experts, and FFN channels—based on its contribution to the model’s accuracy. Components are then ranked and organized, ensuring that submodels with smaller budgets consistently utilize the highest-ranked contiguous subset of components from the larger model. This characteristic is referred to as nested weight-sharing.

The method supports nesting across various dimensions: the SSM (State Space Model) dimension, embedding channels, attention heads, Mamba heads and head channels, MoE expert count, and FFN intermediate dimension. For MoE layers specifically, Star Elastic employs Router-Weighted Expert Activation Pruning (REAP), which ranks experts based on both routing gate values and expert output magnitudes. This provides a more principled signal than naive frequency-based pruning, which overlooks the actual contribution of each expert to the layer’s output.

A Learnable Router, Not a Static Compression Method

A key differentiator from previous compression methods like Minitron is that Star Elastic utilizes an end-to-end trainable router to determine the architectures of the nested submodels. This router takes a target budget (e.g., “provide a model with 2.8B active parameters”) as a one-hot input and generates differentiable masks that select which components are active at that specific budget level. These masks are trained concurrently with the model using Gumbel-Softmax, enabling gradient flow through discrete architectural choices.

The loss function integrates knowledge distillation (KD), where the non-elastified parent model serves as the teacher, with a router loss that penalizes deviations from the target resource budget (parameter count, memory, or latency). This means the router learns to make architecture choices that genuinely enhance accuracy under KD, rather than simply minimizing a proxy metric.

Training follows a two-stage curriculum: an initial short-context phase (sequence length 8,192 tokens) with uniform budget sampling, succeeded by an extended-context phase (sequence length 49,152 tokens) with non-uniform sampling that prioritizes the full 30B model (p(30B)=0.5, p(23B)=0.3, p(12B)=0.2). The extended context phase is crucial for reasoning performance. The research team’s ablations on Nano v2—explicitly replicated as the empirical foundation for the same curriculum choice on Nano v3—demonstrate gains of up to 19.8% on AIME-2025 for the 6B variant and 4.0 percentage points for the 12B variant from Stage 2 alone, justifying its application here.

Elastic Budget Control: Tailoring Models to Different Reasoning Phases

Current budget control mechanisms in reasoning models, including Nemotron Nano v3’s default behavior, typically operate by limiting the number of tokens generated during a phase before forcing a final answer. This approach uses the same model throughout. Star Elastic enables a different strategy: employing different nested submodels for the thinking phase compared to the answering phase.

The researchers assessed four configurations. The most effective one, termed ℳS → ℳL (small model for thinking, large model for answering), assigns a more economical model to generate extensive reasoning traces and reserves the full-capacity model for synthesizing the final answer. The 23B → 30B configuration, in particular, advances the accuracy–latency Pareto frontier, achieving up to 16% higher accuracy and 1.9× lower latency compared to default Nemotron Nano v3 budget control. The underlying reasoning is that reasoning tokens are high-volume but can tolerate some reduction in capacity, whereas the final answer demands higher precision.

Quantization While Preserving the Nested Structure

A straightforward approach to deploying a quantized elastic model would involve quantizing each variant independently after slicing. However, this breaks the nested weight-sharing property and necessitates a separate quantization pass for each size. Instead, Star Elastic applies Quantization-Aware Distillation (QAD) directly to the elastic checkpoint, maintaining the nested mask hierarchy throughout the process.

For FP8 (E4M3 format), post-training quantization (PTQ) proves sufficient, recovering 98.69% of BF16 accuracy for the 30B variant. For NVFP4 (NVIDIA’s 4-bit floating-point format), PTQ alone results in a 4.12% average accuracy drop. Therefore, a brief nested QAD phase (~5B tokens at 48K context) is implemented, restoring recovery to 97.79% for the 30B variant. In both scenarios, zero-shot slicing of the 23B and 12B variants from the single quantized checkpoint remains intact.

The memory implications are substantial. Storing separate 12B, 23B, and 30B BF16 checkpoints demands 126.1 GB; the single elastic checkpoint requires only 58.9 GB. The 30B NVFP4 elastic checkpoint fits within 18.7 GB, allowing the 12B NVFP4 variant to operate on an RTX 5080, where all BF16 configurations would run out of memory. On an RTX Pro 6000, the 12B NVFP4 variant achieves 7,426 tokens/s, representing a 3.4× throughput improvement over the 30B BF16 baseline.

Depth vs. Width: The Rationale Behind Star Elastic’s Width Compression

One design choice that merits explicit discussion: the research team compared two compression strategies—completely removing layers (depth compression) versus reducing internal dimensions such as hidden size, expert count, and head count (width compression). With a 15% parameter reduction and 25B tokens of knowledge distillation, width compression recovered 98.1% of baseline performance, whereas depth compression recovered only 95.2%, showing noticeable degradation on HumanEval and MMLU-Pro. Consequently, Star Elastic prioritizes width-based elasticity for its primary results, although depth compression (layer skipping) remains an option for scenarios with extreme latency constraints.

Across the evaluation suite—AIME-2025, GPQA, LiveCodeBench v5, MMLU-Pro, IFBench,

When it comes to benchmark performance, the Elastic-30B variant matches its parent model, Nemotron Nano v3 30B, across most tests. Meanwhile, the Elastic-23B and Elastic-12B variants hold their own against separately trained models of comparable size. Notably, the Elastic-23B achieves an AIME-2025 score of 85.63, outperforming Qwen3-30B-A3B’s 80.00, even though it uses fewer active parameters.

In terms of training efficiency, the team reports a 360× reduction in tokens compared to training each variant from the ground up. This also represents a 7× improvement over previous top-tier compression techniques, which require separate distillation runs for each model size. On an H100 GPU using bfloat16 with identical input and output sequence lengths, the 12B variant delivers 2.4× the throughput of the 30B parent model.

Getting Started with NVIDIA Star Elastic

Step-by-Step Guide

Nemotron Nano v3 Elastic — 30B / 23B / 12B in one checkpoint · BF16 / FP8 / NVFP4

Star Elastic models are available through Hugging Face and work with both
Transformers (ideal for experimentation) and vLLM
(the go-to choice for production inference). Choose the option that best fits your needs.

bash

# Option A — vLLM (recommended for production serving)
pip install vllm

# Option B — Transformers (for local experimentation)
pip install transformers torch accelerate

# Optional: sign in to Hugging Face if needed
pip install huggingface_hub
huggingface-cli login

▸

Hardware note: The full 30B BF16 checkpoint needs roughly 60 GB of VRAM to run the entire nested family.
For H100/A100 or RTX-class GPUs, consider FP8 (~31 GB) or NVFP4 (~19 GB) instead.

A single checkpoint houses all three nested variants — 30B (3.6A),
23B (2.8A), and 12B (2.0A). Load it once and extract any variant
without additional training. Because of the hybrid Mamba–Transformer–MoE architecture, you’ll need
to set trust_remote_code=True.

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# The 30B BF16 elastic checkpoint — includes all 3 nested variants
model_id = "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"     # automatically spreads across available GPUs
)

print(f"Model loaded: {model_id}")

▸

Active vs. total parameters: “30B total / 3.6A active” means the model stores
30 billion weights but only activates 3.6 billion per token during each forward pass — that’s
how Mixture-of-Experts (MoE) architecture works.

The model uses a <think/> token to work through a reasoning chain before
delivering its final answer. You can manage the total token budget with max_new_tokens —
higher values let the model reason more deeply on challenging problems.

python

messages = [
    {
        "role": "user",
        "content": "What is the time complexity of QuickSort, and why?"
    }
]

# Apply the chat template and tokenize
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

# Generate — the model produces ... followed by the final answer
outputs = model.generate(
    **inputs,
    max_new_tokens=4096,    # combined thinking + answer budget
    temperature=0.6,
    top_p=0.95,
    do_sample=True
)

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[-1]:],
    skip_special_tokens=True
)
print(response)

▸

Thinking budget tip: For math or coding challenges, try setting max_new_tokens
between 8192 and 32768. For simpler questions, 2048–4096 works well and keeps latency lower.

For production environments, vLLM lets you serve the model through an
OpenAI-compatible REST API. This unlocks batched inference, continuous batching,
and significantly higher throughput — the 12B variant reaches 2.4× the throughput
of the 30B parent on an H100 GPU.

bash

# Launch the vLLM server (OpenAI-compatible)
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"

# --- In a separate terminal ---

# Send a request to the server using curl
curl -X POST " 
  -H "Content-Type: application/json" 
  --data '{
    "model": "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16",
    "messages": [
      {
        "role": "user",
        "content": "Explain gradient descent in 3 steps."
      }
    ],
    "max_tokens": 4096,
    "temperature": 0.6
  }'

# Or run via Docker
docker model run hf.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16

▸

SGLang alternative: SGLang is also supported —
run python3 -m sglang.launch_server --model-path "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16" --port 30000
as a drop-in replacement for vLLM.

Three quantized checkpoints are available. All maintain the nested structure
— the 23B and 12B submodels can be extracted zero-shot from whichever precision checkpoint
you load. NVFP4 leverages Quantization-Aware Distillation (QAD) to restore accuracy lost through PTQ.

bash

# BF16 — full precision, all nested variants in 58.9 GB
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"

# FP8 (E4M3) — roughly half the size, 30B fits in 31.4 GB
# Post-training quantization, 98.69% accuracy recovery on 30B
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-FP8"

# NVFP4 — most compact option, 30B fits in 18.7 GB
# 12B NVFP4 variant runs on RTX 5080 (BF16 exceeds memory)
# 12B NVFP4 on RTX Pro 6000: 7,426 tokens/s (3.4× vs 30B BF16)
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4"

Variant	30B memory	23B memory	12B memory	Best for
BF16 Full	58.9 GB	44.0 GB	23.2 GB	A100 / H100
FP8 PTQ	31.4 GB	23.7 GB	13.0 GB	H100 / A100 / RTX 5090
NVFP4 QAD	18.7 GB	14.1 GB	8.0 GB	RTX 5080 / 5090 / Pro 6000

Step 1 of 5

Key Takeaways

Star Elastic trains 30B, 23B, and 12B nested reasoning models from a single 160B-token post-training run, cutting token usage by 360× compared to pretraining from scratch.
Elastic budget control (23B for thinking, 30B for answering) pushes the accuracy–latency Pareto frontier forward by up to 16% accuracy and 1.9× latency improvements.
A learnable router with Gumbel-Softmax enables end-to-end trainable architecture selection, removing the need for separate compression runs per model size.
Nested QAD preserves zero-shot slicing across FP8 and NVFP4 quantized checkpoints, shrinking the 30B elastic checkpoint to just 18.7 GB in NVFP4.
All three precision variants (BF16, FP8, NVFP4) are publicly available on Hugging Face under nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B.

Check out the Paper, Elastic Models on Hugging Face in BF16, FP8, and NVFP4. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.

Need to partner with us for promoting your GitHub repo, Hugging Face page, product release, webinar, or more? Connect with us

Top Posts

Cisco’s Budget AI Guardians: Revolutionizing Source Code Security

Proof Of Node Doesn’t Change Bitcoin: The Immutable Truth

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

NVIDIA AI Unveils Star Elastic: A Single Checkpoint Powering 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Cisco’s Budget AI Guardians: Revolutionizing Source Code Security

Proof Of Node Doesn’t Change Bitcoin: The Immutable Truth

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Trending

Cisco’s Budget AI Guardians: Revolutionizing Source Code Security

Proof Of Node Doesn’t Change Bitcoin: The Immutable Truth

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

NVIDIA AI Unveils Star Elastic: A Single Checkpoint Powering 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

What “Nested” Truly Means in This Context

A Learnable Router, Not a Static Compression Method

Elastic Budget Control: Tailoring Models to Different Reasoning Phases

Quantization While Preserving the Nested Structure

Depth vs. Width: The Rationale Behind Star Elastic’s Width Compression

Getting Started with NVIDIA Star Elastic

Key Takeaways

Related Posts