Meet SANA-WM: NVIDIA's 2.6B-Parameter Open-Source World Model That Crafts Minute-Long 720p Video On One GPU

World models — systems that generate realistic video sequences from a starting image and a set of actions — are rapidly becoming essential to embodied AI, simulation, and robotics research. The central challenge lies in scaling these systems to produce minute-long, high-resolution video without demanding prohibitively large GPU clusters for both training and inference. Most competitive open-source baselines either require multi-GPU inference or compromise on resolution to stay within compute budgets.

NVIDIA’s SANA-WM directly addresses these bottlenecks. Built on the SANA-Video codebase and available through the NVlabs/Sana GitHub repository, it is a 2.6-billion-parameter Diffusion Transformer (DiT) trained natively for one-minute generation at 720p with metric-scale six-degree-of-freedom (6-DoF) camera control. It supports three single-GPU inference variants: a bidirectional generator for high-quality offline synthesis, a chunk-causal autoregressive generator for sequential rollout, and a few-step distilled autoregressive generator for faster deployment. The distilled variant can denoise a 60-second 720p clip in just 34 seconds on a single RTX 5090 with NVFP4 quantization.

The Architecture: Four Core Design Decisions

1. Hybrid Linear Attention with Gated DeltaNet (GDN)

Standard softmax attention scales quadratically in both memory and compute as sequence length grows — a serious obstacle when generating 961 latent frames for a 60-second 720p video. SANA-Video, the predecessor model, used cumulative ReLU-based linear attention, which maintains a constant-size recurrent state. However, this approach lacks any decay mechanism: all past frames accumulate with equal weight, leading to drift over minute-long sequences.

SANA-WM replaces most attention blocks with frame-wise Gated DeltaNet (GDN). Unlike the token-wise GDN used in language models, SANA-WM’s frame-wise variant processes one entire latent frame per recurrent step. The GDN update rule incorporates a decay gate γ (which progressively down-weights stale past frames) and a delta-rule correction (which updates only the residual between the target value and the current state prediction), keeping the recurrent state at a constant D×D size regardless of video length.

To stabilize training, the research team introduces an algebraic key-scaling approach: keys are scaled by 1/√(D·S), where D is the head dimension and S is the number of spatial tokens per frame. This ensures the spectral norm of the transition matrix remains bounded and eliminates the NaN divergence events observed with standard L2 key normalization (1/√D) or no scaling at all, both of which triggered NaN events at steps 16 and 1, respectively.

The final backbone interleaves 15 frame-wise GDN blocks with 5 softmax attention blocks (at layers 3, 7, 11, 15, and 19) across 20 total transformer blocks. The softmax blocks provide precise long-range recall in situations where GDN’s recurrence alone falls short.

2. Dual-Branch Camera Control

Camera-controlled world modeling requires the model to faithfully follow a continuous 6-DoF trajectory, not merely align with a text description of motion. SANA-WM uses two complementary branches that operate at different temporal rates:

Coarse branch (UCPE attention): Operates at the latent-frame rate. For each latent token, it computes a ray-local camera basis from the camera-to-world pose and intrinsics, then applies a Unified Camera Positional Encoding (UCPE) to the geometric channels of each attention head. This captures global trajectory structure across the full sequence.

Fine branch (Plücker mixing): Addresses a compression mismatch. Each latent token summarizes eight raw frames, each with its own distinct camera pose. The fine branch computes pixel-wise Plücker raymaps (a 6D representation: ray direction d and moment o×d) from all eight raw frames within one VAE temporal stride, packs

They convert these into a 48-channel tensor and add this embedding after each self-attention output using a zero-initialized projection. This recovers fine-grained camera motion within each stride that the coarse branch misses at latent-frame resolution.

Ablation studies on OmniWorld reveal that neither branch on its own performs as well as the combined dual-branch approach: using UCPE alone yields a Camera Motion Consistency (CamMC) score of 0.2453, whereas UCPE combined with Plücker mixing achieves 0.2047.

3. Two-Stage Generation Pipeline

Although Stage-1 SANA-WM outputs maintain spatiotemporal consistency, they may still exhibit structural artifacts across extended sequences. To address this, a second-stage refiner is employed, built on the 17B LTX-2 model with rank-384 LoRA adapters trained on paired synthetic and real video data. This refiner corrects such artifacts using truncated-σ flow matching: stage-1 latents are corrupted with a high initial noise level (σ_start = 0.9), and the refiner learns to transform this noisy version into a high-fidelity output. At inference time, only three Euler denoising steps are required. The refiner significantly reduces long-horizon visual drift (ΔIQ), bringing it down from 3.79 to 1.17 on the Simple-Trajectory split and from 3.09 to 0.31 on the Hard-Trajectory split.

4. Robust Data Annotation Pipeline

Training camera-controlled video generation demands large-scale 6-DoF pose annotations, which are absent from standard video datasets. The team enhanced VIPE (a camera-pose annotation engine) by swapping its depth backend with Pi3X—chosen for consistent depth estimation over long sequences—combined with MoGe-2 for precise per-frame metric scaling. They also upgraded the bundle adjustment process to treat focal lengths and principal points as per-frame variables instead of fixed global intrinsics, improving annotation robustness for internet videos with varying focal lengths.

The final pipeline processes seven training datasets sourced from multiple open-source repositories: SpatialVID-HQ (real, 10-second clips), DL3DV real clips (10s), DL3DV GS Refined synthetic clips (60s, rendered using 3D Gaussian Splatting), OmniWorld (synthetic, 60s), Sekai Game (synthetic, 60s), Sekai Walking-HQ (real, 60s), and MiraData (real, 60s). Together, these produce 212,975 clips with metric-scale pose annotations. The LTX2-VAE used for compression is 2.0× smaller than ST-DC-AE and 8.0× smaller than Wan2.1-VAE, directly boosting both training and inference efficiency.

For DL3DV—which consists of static 3D scene captures rather than native minute-long videos—the team created one FCGS 3D Gaussian Splatting reconstruction per scene, designed diverse one-minute camera trajectories, rendered long videos with known intrinsics and extrinsics, and then enhanced the outputs using DiFix3D to minimize splatting-related artifacts.

Training Strategy and Infrastructure

SANA-WM’s training runs on 64 H100 GPUs in two phases. Before DiT training begins, the LTX2 VAE is adapted to the SANA-Video SFT training data over approximately 50K steps, taking around 3.5 days. The main DiT training then proceeds through a four-stage progressive schedule spanning roughly 15 days:

Stage 1 (~2.75 days): The pre-trained SANA-Video model is adapted to the frame-wise GDN architecture using short (5-second) video clips. This phase replaces cumulative linear attention with recurrent GDN blocks in a simpler, short-horizon setting where issues are easier to identify and fix.
Stage 2 (~2 days): Hybrid attention is introduced by substituting every fourth GDN block with a standard softmax attention block, still on short clips, to better balance efficiency and output quality.
Stage 3 (~8 days): Training extends to 961-frame (60-second) sequences and incorporates Dual-Branch Camera Control. Context-Parallel (CP=2) sharding distributes latent sequences across GPUs via prefix-sum composition of GDN transition matrices—a mathematically exact parallelization method with minimal communication overhead.
Stage 4 (~2.5 days): A chunk-causal variant is fine-tuned for autoregressive rollout, followed by self-forcing distillation to reduce sampling to just four denoising steps. Attention-sink tokens and local temporal windows are integrated into softmax attention layers to maintain constant memory usage and per-chunk latency during extended rollouts.

Custom fused Triton kernels for GDN scan and gate operations deliver approximately 1.5× to 2× speedups throughout the training process.

Benchmark Results

The team introduces a custom 60-second world-model benchmark comprising 80 initial scenes generated by Nano Banana Pro across four scene categories—game, indoor, outdoor-city, and outdoor-nature (20 per category)—each paired with Simple and Hard camera trajectory splits. All evaluations are conducted using each model’s multi-step, undistilled autoregressive configuration.

In this benchmark, SANA-WM paired with its second-stage refiner delivers the following results across both evaluation splits:

Camera precision (Simple / Hard): Rotation error (RotErr) of 4.50° / 8.34°; Translation error (TransErr) of 1.39 / 1.39; CamMC of 1.41 / 1.44 — outperforming all competing methods, including LingBot-World (14B+14B parameters, 8 GPUs) and HY-WorldPlay (8B parameters, 8 GPUs).
Visual fidelity: VBench Overall scores of 80.62 / 81.89 on Simple / Hard splits, on par with LingBot-World (81.82 / 81.89) while producing 720p video on just one GPU per clip.
Processing speed: 22.0 videos per hour on 8 H100 GPUs using the complete pipeline (including the refiner), versus 0.6 videos per hour for LingBot-World — a 36× speed advantage.
Memory footprint: The entire pipeline requires 74.7 GB, fitting comfortably within the 80 GB H100 memory limit. Stage-1-only inference uses just 51.1 GB.
Temporal consistency: After refinement, ΔIQ (image quality drop between the first and last 10-second segments) falls to 1.17 on Simple and 0.31 on Hard, compared to 23.59 and 25.88 for HY-WorldPlay.

Marktechpost’s Visual Walkthrough

01 / 09 • Overview

What Exactly Is SANA-WM?

SANA-WM is NVIDIA’s open-source world model that accepts a single still image and a camera path as inputs, then generates a lifelike 60-second, 720p video that accurately follows that camera path. In simple terms: one photo — endless explorable environments.

Most existing world models either demand large multi-GPU clusters for inference or compromise on resolution to stay within hardware limits. SANA-WM makes minute-long, 720p, camera-guided video generation practical — trained on 64 H100 GPUs and running inference on just one GPU.

2.6B
Parameters (open-source)

720p
Native output resolution

60s
Native generation length

Core principle: SANA-WM treats computational efficiency as a foundational design goal — not a secondary consideration. Its distilled version can denoise an entire 60-second 720p video in just 34 seconds on a single RTX 5090 using NVFP4 quantization.

02 / 09 • The Problem

Where Current World Models Come Up Short

Producing a 60-second video at 720p requires modeling 961 latent frames. Conventional softmax attention — the standard mechanism in most video diffusion models — has memory and compute costs that increase quadratically with sequence length. At minute-scale durations, this exhausts the memory of any single GPU.

Model	Params	Res	GPUs	Throughput
LingBot-World	14B+14B	480p	8	0.6 vids/hr
HY-WorldPlay	8B	480p	8	1.1 vids/hr
Matrix-Game 3.0	5B	720p	8	3.1 vids/hr
SANA-WM	2.6B	720p	1	24.1 vids/hr

SANA-WM addresses this challenge through four key architectural innovations working in concert: hybrid linear attention, dual-branch camera control, a two-stage refinement pipeline, and a robust data annotation pipeline.

03 / 09 • Architecture

Innovation 1: Hybrid Linear Attention with Gated DeltaNet (GDN)

Standard softmax attention scales quadratically with context length. SANA-Video (the earlier version) relied on cumulative ReLU-based linear attention — constant memory usage, but no decay mechanism: all past frames accumulated with equal importance, leading to visual drift over minute-long sequences.

SANA-WM introduces frame-wise Gated DeltaNet (GDN). Unlike the token-wise GDN variant used in large language models, this version processes an entire latent frame in each recurrent step. It enhances the recurrent state through two correction mechanisms:

γDecay gate — selectively discards outdated information from prior frames by multiplying the previous state with a learned decay scalar.
βDelta-rule correction — updates only the difference (residual) between the target value and the current state prediction, rather than replacing the entire state.

The state dimension remains fixed at D×D regardless of how long the video is. To avoid gradient instability during training, keys are scaled by 1/√(D·S), where D is the head dimension and S is the number of spatial tokens per frame. Without this scaling, NaN errors occur as early as the very first training step.

Final backbone: 20 transformer blocks in total — 15 frame-wise GDN blocks combined with 5 softmax attention blocks placed at layers {3, 7, 11, 15, 19}. These softmax blocks serve to maintain long-range spatial coherence in cases where GDN alone falls short.

04 / 09 • Architecture

Design 2: Dual-Branch Camera Control

Camera-controlled world modeling demands precise adherence to a continuous 6-DoF camera trajectory — going well beyond what text descriptions of motion can convey. SANA-WM employs two complementary branches that operate at different temporal granularities:

🌎 Coarse Branch — UCPE

Runs at the latent-frame rate. It derives a ray-local camera basis from the camera-to-world pose and intrinsic parameters, then applies Unified Camera Positional Encoding (UCPE) to the geometric channels of each attention head. This captures the overall 6-DoF trajectory structure across the entire sequence.

📷 Fine Branch — Plücker Mixing

Solves a compression mismatch problem: each latent token represents a summary of 8 raw frames, each captured from a different camera pose. The branch computes pixel-wise Plücker raymaps (a 6D representation consisting of ray direction d and moment o×d) for all 8 raw frames within each VAE temporal stride, packs them into a 48-channel tensor, and injects this signal after each self-attention output through a zero-initialized projection layer.

Camera Encoding	RotErr ↓	TransErr ↓	CamMC ↓
No control	16.93	0.2347	0.4937
Plücker only	16.02	0.2340	0.4742
UCPE only	7.73	0.1350	0.2453
UCPE + Plücker	6.21	0.1162	0.2047

05 / 09 • Architecture

Design 3: Two-Stage Generation Pipeline

Stage-1 SANA-WM outputs are spatiotemporally coherent, but may still exhibit structural artifacts across long sequences. A dedicated second-stage refiner is designed to fix these issues.

1
Initialization: The refiner is built on the 17B LTX-2 model with rank-384 LoRA adapters applied to attention projections (Q/K/V/O) and feed-forward layers. Using LoRA-only fine-tuning keeps the approach lightweight compared to full 17B parameter optimization.
2
Truncated-σ flow matching: Stage-1 latents are corrupted with a high initial noise level (σ_start=0.9). The refiner is trained to map this noisy version toward the high-fidelity target — focusing on refinement rather than full reconstruction.
3
Inference: Only 3 Euler denoising steps are required. The LoRA adapters are merged into the distilled LTX-2 base model, adding minimal overhead to end-to-end throughput.

1.17
ΔIQ after refiner (Simple split) versus 3.79 before

0.31
ΔIQ after refiner (Hard split) versus 3.09 before

22.0
Videos per hour on 8 H100 GPUs (full pipeline)

ΔIQ = imaging-quality score in the first 10-second window minus the last 10-second window. A lower value means less quality degradation over the full minute.

06 / 09 • Architecture

Design 4: Robust Data Annotation Pipeline

Training camera-controlled generation models requires metric-scale 6-DoF pose annotations — data that standard video datasets do not provide. The team adapted the VIPE pose annotation engine with the following improvements:

Depth backend upgrade

Swapped out the single-frame Metric3D-Small for Pi3X (which provides long-sequence-consistent 3D structure) fused with MoGe-2 (which delivers accurate per-frame metric scale). The fusion is performed by solving for a per-frame scale factor that minimizes weighted depth error, with smoothing applied via exponential moving average (momentum 0.99).

Per-frame intrinsics

Extended bundle adjustment to treat focal lengths and principal points as per-frame variables instead of shared global intrinsics — enabling reliable pose annotation on internet videos that contain varying focal lengths.

Source	Type	Duration	Clips
SpatialVID-HQ	Real	10s	158,369
DL3DV (real)	Real	10s	5,691
DL3DV (GS Refined)	Synthetic	60s	14,881
OmniWorld	Synthetic	60s	1,720
Sekai Game	Synthetic	60s	3,560
Sekai Walking-HQ	Real	60s	9,767
MiraData	Real	60s	18,987
Total	—	—	212,975

07 / 09 • Training

Progressive Training Pipeline

The training process unfolds in two phases on 64 H100 GPUs. The first phase is a VAE pre-adaptation stage (~3.5 days, 50K steps), which fine-tunes the LTX2 VAE on the SANA-Video SFT dataset. After that, the main DiT training runs through four progressive stages (~15 days total):

1
Frame-wise GDN (~2.75 days): Adapt SANA-Video to the GDN recurrent architecture using short 5-second clips. The LTX2-VAE is 2.0× smaller than ST-DC-AE and 8.0× smaller than Wan2.1-VAE, which reduces the token count before any attention computation takes place.
2
Hybrid Attention (~2 days): Swap every 4th GDN block with softmax attention, still on 5-second clips, to balance efficiency and quality before moving to longer sequences.
3
Minute-Scale + CamCtrl (~8 days): Scale up to 961-frame (60s) sequences with Dual-Branch Camera Control. Context-Parallel (CP=2) sharding leverages prefix-sum composition of GDN transition matrices — mathematically exact with minimal communication overhead.
4
SFT + Distillation (~2.5 days): Fine-tune a chunk-causal autoregressive variant on ~50K high-quality clips. Apply self-forcing distillation to bring sampling down to 4 denoising steps. Introduce attention-sink tokens and local temporal windows to keep softmax memory usage constant during long rollouts.

Efficiency: Custom fused Triton kernels for GDN scan and gate operations deliver roughly 1.5× to 2× throughput improvements across all training stages.

08 / 09 • Results

Benchmark Results on the 60-Second World-Model Benchmark

Evaluation covers 80 scenes (game, indoor, outdoor-city, outdoor-nature) across Simple and Hard camera trajectory splits. The main table reports results using the multi-step, undistilled autoregressive setting.

Method	Res	GPUs	RotErr↓	TransErr↓	CamMC↓	VBench↑	Tput↑
LingBot-World	480p	8	10.47/18.99	2.01/1.65	2.05/1.81	81.82/81.89	0.6
HY-WorldPlay	480p	8	17.89/35.46	2.36/2.34	2.45/2.64	68.82/70.46	1.1
Matrix-Game 3.0	720p	8	12.96/18.79	1.83/1.67	1.92/1.82	78.53/78.79	3.1
SANA-WM+refiner	720p	1	4.50/8.34	1.39/1.39	1.41/1.44	80.62/81.89	22.0

Values shown as Simple/Hard split. RotErr in degrees. Tput = videos/hour on 8 H100s. Full pipeline memory: 74.7 GB — within the 80 GB H100 budget.

Best Camera Accuracy
36× Higher Throughput vs LingBot-World
720p on 1 GPU
Comparable Visual Quality

09 / 09 • Access

How to Access SANA-WM

SANA-WM is open-source and available through the NVlabs/Sana GitHub repository (Apache 2.0 license for code; individual dataset and weight licenses vary — see Table 11 of the paper). The repo also hosts SANA, SANA-1.5, SANA-Sprint, and SANA-Video.

# Clone the repo
git clone 
cd Sana && ./environment_setup.sh sana

Three Inference Variants

▶ Bidirectional — high-quality offline synthesis (best quality, 49.2 GB)

▶ Chunk-causal AR — sequential rollout for streaming (51.1 GB)

▶ Distilled AR + NVFP4 — 34s per 60s clip on RTX 5090

Resources

📄 Paper: arXiv:2605.15178

🌎 Project page: nvlabs.github.io/Sana/WM/

📊 GitHub: github.com/NVlabs/Sana

🤔 Limitations: no explicit 3D scene memory; can drift in dynamic scenes or rare viewpoints

Practical workflow suggested by the authors: Use the stage-1 model to efficiently search trajectories, then selectively refine the most promising rollouts with the second-stage refiner for higher fidelity.

Key Takeaways

Check out the Paper, GitHub Repo and Project Page. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Top Posts

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Meet SANA-WM: NVIDIA’s 2.6B-Parameter Open-Source World Model That Crafts Minute-Long 720p Video on One GPU

What Exactly Is SANA-WM?

Where Current World Models Come Up Short

Innovation 1: Hybrid Linear Attention with Gated DeltaNet (GDN)

Design 2: Dual-Branch Camera Control

Design 3: Two-Stage Generation Pipeline

Design 4: Robust Data Annotation Pipeline

Progressive Training Pipeline

Benchmark Results on the 60-Second World-Model Benchmark

How to Access SANA-WM

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Trending

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Meet SANA-WM: NVIDIA’s 2.6B-Parameter Open-Source World Model That Crafts Minute-Long 720p Video on One GPU

The Architecture: Four Core Design Decisions

1. Hybrid Linear Attention with Gated DeltaNet (GDN)

2. Dual-Branch Camera Control

3. Two-Stage Generation Pipeline

4. Robust Data Annotation Pipeline

Training Strategy and Infrastructure

Benchmark Results

Marktechpost’s Visual Walkthrough

What Exactly Is SANA-WM?

Where Current World Models Come Up Short

Innovation 1: Hybrid Linear Attention with Gated DeltaNet (GDN)

Design 2: Dual-Branch Camera Control

Design 3: Two-Stage Generation Pipeline

Design 4: Robust Data Annotation Pipeline

Progressive Training Pipeline

Benchmark Results on the 60-Second World-Model Benchmark

How to Access SANA-WM

Key Takeaways

Related Posts