World models — systems that generate realistic video sequences from a starting image and a set of actions — are rapidly becoming essential to embodied AI, simulation, and robotics research. The central challenge lies in scaling these systems to produce minute-long, high-resolution video without demanding prohibitively large GPU clusters for both training and inference. Most competitive open-source baselines either require multi-GPU inference or compromise on resolution to stay within compute budgets.
NVIDIA’s SANA-WM directly addresses these bottlenecks. Built on the SANA-Video codebase and available through the NVlabs/Sana GitHub repository, it is a 2.6-billion-parameter Diffusion Transformer (DiT) trained natively for one-minute generation at 720p with metric-scale six-degree-of-freedom (6-DoF) camera control. It supports three single-GPU inference variants: a bidirectional generator for high-quality offline synthesis, a chunk-causal autoregressive generator for sequential rollout, and a few-step distilled autoregressive generator for faster deployment. The distilled variant can denoise a 60-second 720p clip in just 34 seconds on a single RTX 5090 with NVFP4 quantization.

The Architecture: Four Core Design Decisions
1. Hybrid Linear Attention with Gated DeltaNet (GDN)
Standard softmax attention scales quadratically in both memory and compute as sequence length grows — a serious obstacle when generating 961 latent frames for a 60-second 720p video. SANA-Video, the predecessor model, used cumulative ReLU-based linear attention, which maintains a constant-size recurrent state. However, this approach lacks any decay mechanism: all past frames accumulate with equal weight, leading to drift over minute-long sequences.
SANA-WM replaces most attention blocks with frame-wise Gated DeltaNet (GDN). Unlike the token-wise GDN used in language models, SANA-WM’s frame-wise variant processes one entire latent frame per recurrent step. The GDN update rule incorporates a decay gate γ (which progressively down-weights stale past frames) and a delta-rule correction (which updates only the residual between the target value and the current state prediction), keeping the recurrent state at a constant D×D size regardless of video length.
To stabilize training, the research team introduces an algebraic key-scaling approach: keys are scaled by 1/√(D·S), where D is the head dimension and S is the number of spatial tokens per frame. This ensures the spectral norm of the transition matrix remains bounded and eliminates the NaN divergence events observed with standard L2 key normalization (1/√D) or no scaling at all, both of which triggered NaN events at steps 16 and 1, respectively.
The final backbone interleaves 15 frame-wise GDN blocks with 5 softmax attention blocks (at layers 3, 7, 11, 15, and 19) across 20 total transformer blocks. The softmax blocks provide precise long-range recall in situations where GDN’s recurrence alone falls short.
2. Dual-Branch Camera Control
Camera-controlled world modeling requires the model to faithfully follow a continuous 6-DoF trajectory, not merely align with a text description of motion. SANA-WM uses two complementary branches that operate at different temporal rates:
- Coarse branch (UCPE attention): Operates at the latent-frame rate. For each latent token, it computes a ray-local camera basis from the camera-to-world pose and intrinsics, then applies a Unified Camera Positional Encoding (UCPE) to the geometric channels of each attention head. This captures global trajectory structure across the full sequence.
- Fine branch (Plücker mixing): Addresses a compression mismatch. Each latent token summarizes eight raw frames, each with its own distinct camera pose. The fine branch computes pixel-wise Plücker raymaps (a 6D representation: ray direction d and moment o×d) from all eight raw frames within one VAE temporal stride, packs
- They convert these into a 48-channel tensor and add this embedding after each self-attention output using a zero-initialized projection. This recovers fine-grained camera motion within each stride that the coarse branch misses at latent-frame resolution.
Ablation studies on OmniWorld reveal that neither branch on its own performs as well as the combined dual-branch approach: using UCPE alone yields a Camera Motion Consistency (CamMC) score of 0.2453, whereas UCPE combined with Plücker mixing achieves 0.2047.
3. Two-Stage Generation Pipeline
Although Stage-1 SANA-WM outputs maintain spatiotemporal consistency, they may still exhibit structural artifacts across extended sequences. To address this, a second-stage refiner is employed, built on the 17B LTX-2 model with rank-384 LoRA adapters trained on paired synthetic and real video data. This refiner corrects such artifacts using truncated-σ flow matching: stage-1 latents are corrupted with a high initial noise level (σ_start = 0.9), and the refiner learns to transform this noisy version into a high-fidelity output. At inference time, only three Euler denoising steps are required. The refiner significantly reduces long-horizon visual drift (ΔIQ), bringing it down from 3.79 to 1.17 on the Simple-Trajectory split and from 3.09 to 0.31 on the Hard-Trajectory split.
4. Robust Data Annotation Pipeline
Training camera-controlled video generation demands large-scale 6-DoF pose annotations, which are absent from standard video datasets. The team enhanced VIPE (a camera-pose annotation engine) by swapping its depth backend with Pi3X—chosen for consistent depth estimation over long sequences—combined with MoGe-2 for precise per-frame metric scaling. They also upgraded the bundle adjustment process to treat focal lengths and principal points as per-frame variables instead of fixed global intrinsics, improving annotation robustness for internet videos with varying focal lengths.
The final pipeline processes seven training datasets sourced from multiple open-source repositories: SpatialVID-HQ (real, 10-second clips), DL3DV real clips (10s), DL3DV GS Refined synthetic clips (60s, rendered using 3D Gaussian Splatting), OmniWorld (synthetic, 60s), Sekai Game (synthetic, 60s), Sekai Walking-HQ (real, 60s), and MiraData (real, 60s). Together, these produce 212,975 clips with metric-scale pose annotations. The LTX2-VAE used for compression is 2.0× smaller than ST-DC-AE and 8.0× smaller than Wan2.1-VAE, directly boosting both training and inference efficiency.
For DL3DV—which consists of static 3D scene captures rather than native minute-long videos—the team created one FCGS 3D Gaussian Splatting reconstruction per scene, designed diverse one-minute camera trajectories, rendered long videos with known intrinsics and extrinsics, and then enhanced the outputs using DiFix3D to minimize splatting-related artifacts.
Training Strategy and Infrastructure
SANA-WM’s training runs on 64 H100 GPUs in two phases. Before DiT training begins, the LTX2 VAE is adapted to the SANA-Video SFT training data over approximately 50K steps, taking around 3.5 days. The main DiT training then proceeds through a four-stage progressive schedule spanning roughly 15 days:
- Stage 1 (~2.75 days): The pre-trained SANA-Video model is adapted to the frame-wise GDN architecture using short (5-second) video clips. This phase replaces cumulative linear attention with recurrent GDN blocks in a simpler, short-horizon setting where issues are easier to identify and fix.
- Stage 2 (~2 days): Hybrid attention is introduced by substituting every fourth GDN block with a standard softmax attention block, still on short clips, to better balance efficiency and output quality.
- Stage 3 (~8 days): Training extends to 961-frame (60-second) sequences and incorporates Dual-Branch Camera Control. Context-Parallel (CP=2) sharding distributes latent sequences across GPUs via prefix-sum composition of GDN transition matrices—a mathematically exact parallelization method with minimal communication overhead.
- Stage 4 (~2.5 days): A chunk-causal variant is fine-tuned for autoregressive rollout, followed by self-forcing distillation to reduce sampling to just four denoising steps. Attention-sink tokens and local temporal windows are integrated into softmax attention layers to maintain constant memory usage and per-chunk latency during extended rollouts.
Custom fused Triton kernels for GDN scan and gate operations deliver approximately 1.5× to 2× speedups throughout the training process.
Benchmark Results
The team introduces a custom 60-second world-model benchmark comprising 80 initial scenes generated by Nano Banana Pro across four scene categories—game, indoor, outdoor-city, and outdoor-nature (20 per category)—each paired with Simple and Hard camera trajectory splits. All evaluations are conducted using each model’s multi-step, undistilled autoregressive configuration.


In this benchmark, SANA-WM paired with its second-stage refiner delivers the following results across both evaluation splits:
- Camera precision (Simple / Hard): Rotation error (RotErr) of 4.50° / 8.34°; Translation error (TransErr) of 1.39 / 1.39; CamMC of 1.41 / 1.44 — outperforming all competing methods, including LingBot-World (14B+14B parameters, 8 GPUs) and HY-WorldPlay (8B parameters, 8 GPUs).
- Visual fidelity: VBench Overall scores of 80.62 / 81.89 on Simple / Hard splits, on par with LingBot-World (81.82 / 81.89) while producing 720p video on just one GPU per clip.
- Processing speed: 22.0 videos per hour on 8 H100 GPUs using the complete pipeline (including the refiner), versus 0.6 videos per hour for LingBot-World — a 36× speed advantage.
- Memory footprint: The entire pipeline requires 74.7 GB, fitting comfortably within the 80 GB H100 memory limit. Stage-1-only inference uses just 51.1 GB.
- Temporal consistency: After refinement, ΔIQ (image quality drop between the first and last 10-second segments) falls to 1.17 on Simple and 0.31 on Hard, compared to 23.59 and 25.88 for HY-WorldPlay.
Marktechpost’s Visual Walkthrough
Key Takeaways
- NVIDIA’s SANA-WM generates 60-second, 720p, camera-controlled videos on a single GPU — trained in ~18.5 days on 64 H100s using only 212,975 public video clips.
- A hybrid Gated DeltaNet + softmax attention backbone keeps the recurrent state at a fixed D×D size regardless of video length, solving the memory explosion problem that makes minute-scale generation impractical with standard softmax attention.
- Dual-branch camera control — UCPE at the latent-frame rate and Plücker mixing at the raw-frame rate — brings CamMC down to 0.2047, the best result among all compared methods including models 5× larger.
- A second-stage refiner initialized from 17B LTX-2 with rank-384 LoRA reduces long-horizon visual drift (ΔIQ) from 3.09 to 0.31 on Hard trajectories using just 3 Euler denoising steps.
- At 22.0 videos/hour on 8 H100s, SANA-WM + refiner delivers 36× higher throughput than LingBot-World (14B+14B, 8 GPUs) while achieving comparable VBench visual quality scores.
Check out the Paper, GitHub Repo and Project Page. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us




