NVIDIA's Gated DeltaNet-2: The Linear Attention Layer Redefining Delta Rule Dynamics By Uncoupling Erasure From Updates

Linear attention replaces the ever-growing key-value cache of softmax attention with a fixed-size recurrent state. This reduces sequence mixing to linear time and decoding to constant memory. The real challenge isn’t deciding what to forget. It’s figuring out how to update a compressed memory without disrupting existing associations.

NVIDIA has released Gated DeltaNet-2, a linear attention layer designed to tackle that specific bottleneck. The model splits the active memory update into two separate channel-wise gates. It was trained at 1.3B parameters on 100B FineWeb-Edu tokens. It outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 across the research benchmark suite.

The scalar gate problem in delta-rule models

A recurrent linear attention layer maintains a matrix state S_t and retrieves it using the query. DeltaNet introduces an active update by subtracting the value currently linked to the current key. It uses a scalar step size β_t to control how much to overwrite. Mamba-2 adds a data-dependent scalar decay α_t for global forgetting. Gated DeltaNet combined both operations, but both gates remained scalar per head.

Kimi Delta Attention (KDA) improves the decay side. It replaces the scalar α_t with a channel-wise vector. KDA still uses a single scalar β_t for the active update. That scalar handles two different tasks at once. It determines how much old content to erase on the key side. It also determines how much new content to commit on the value side. These two operations act on different dimensions of the state. Binding them together is a modeling limitation, not a requirement of the delta rule.

Gated Delta Rule-2: two gates instead of one

Gated DeltaNet-2 separates these two decisions through Gated Delta Rule-2. It introduces a channel-wise erase gate b_t ∈ [0,1]^d_k on the key axis. It also introduces a channel-wise write gate w_t ∈ [0,1]^d_v on the value axis. Both gates are generated by sigmoid projections of the token representation. The update applies decay before the active edit.

Written compactly, the recurrence is:

S_t = (I − k_t (b_t ⊙ k_t)^⊤) D_t S_t−1 + k_t (w_t ⊙ v_t)^⊤

Here D_t = Diag(α_t) is the channel-wise decay inherited from KDA. The left factor of the erase matrix remains k_t, preserving the delta-rule write direction. The right factor becomes b_t ⊙ k_t, making the read direction channel-selective. The write term k_t z_t^⊤ uses z_t = w_t ⊙ v_t, making the value update channel-selective.

When both gates collapse to the same scalar β_t, the update recovers KDA exactly. When the decay α_t also collapses to a scalar, it recovers Gated DeltaNet. Both prior models are preserved as tied subspaces of the new update.

From the fast-weight perspective, Gated Delta Rule-2 is equivalent to one online gradient step on a local regression loss. The decayed state stays close to memory, while the residual edit uses gated read and gated write targets.

Chunkwise training and gate-aware backward

The recurrence admits a chunkwise WY form that matches the structure used by KDA.

The cumulative decay applied across each channel is folded into the two components of every rank-one erase operation. Each chunk update is expressed as a product of asymmetric matrices structured as I − k̄_r ē_r^⊤. The implementation operates with a chunk size of C = 64, leveraging fused Triton kernels for efficiency.

During the backward pass, the scalar shortcut relied upon by KDA no longer holds. On the write path, a distinct diagonal gate spans the value channels, while on the erase path, a separate diagonal gate spans the key channels. As a result, these gate factors must be incorporated within the dot products used to accumulate gradients. The paper provides an explicit derivation of this gate-aware vector-Jacobian product. On Hopper GPUs, the fused WY backward kernel is limited to two and four warps to sidestep a Triton WGMMA layout assertion.

Block design and hybrid model

Gated DeltaNet-2 serves as the recurrent token mixer within a standard Transformer-style block. The query and key branches pass through a linear projection, a short causal convolution, a SiLU activation, and L2 normalization. The value branch passes through a linear projection, a short convolution, and a SiLU activation. The decay parameter α_t, erase gate b_t, and write gate w_t are each produced by independent linear branches. The recurrent output undergoes RMS normalization, is scaled by a SiLU output gate, and is projected back to the model dimension.

A hybrid configuration adds Sliding-Window Attention (SWA) immediately after the recurrent mixer. A repeating cell stack consists of Gated DeltaNet-2, an MLP, SWA, and a second MLP. SWA captures precise local interactions, while the recurrent mixer summarizes extended context history. The hybrid design maintains linear scaling with sequence length and keeps the attention cache size bounded.

Results at 1.3B parameters

Every model in the study uses 1.3B parameters and is trained on 100B tokens from FineWeb-Edu. Parameter counts and recurrent state sizes are held constant across compared models. Each layer stores 262,144 floating-point values in its recurrent state per batch element. Training sequences are 4K tokens long, and hybrid models use a 2K sliding-window attention span. The Mamba-3 MIMO baseline uses a rank of R = 4.

For language modeling and commonsense reasoning tasks, Gated DeltaNet-2 delivers the highest average in both experimental configurations. In the recurrent-only setup, the model scores 53.11 on average across LAMBADA and the reasoning benchmark suite, surpassing Mamba-3 MIMO at 52.39 and KDA at 52.28. In the hybrid setup, Gated DeltaNet-2 averages 53.97, ahead of Mamba-3 MIMO at 52.72. Because the recurrent state size is identical across models, the improvement is attributable to the update rule itself rather than increased memory capacity.

The most pronounced improvements show up on long-context retrieval benchmarks from the RULER suite. In the recurrent-only setup, S-NIAH-2 at 4K improves from 89.0 (KDA) to 93.0. S-NIAH-3 at 2K rises sharply from 63.2 (KDA) to 89.8. MK-NIAH-1 at 4K increases from 28.0 (KDA) to 37.8.

On practical retrieval benchmarks (SWDE, SQuAD, FDA, TriviaQA, NQ, DROP), Gated DeltaNet-2 also leads across both configurations. The recurrent average comes to 29.88, and the hybrid average reaches 42.28.

Marktechpost’s Visual Explainer

NVIDIA · 2026

Gated DeltaNet-2

Separating Erase and Write Operations in Linear Attention. A delta-rule recurrent attention layer featuring independent channel-wise gates for erasing and writing.

PyTorch
Triton kernels
1.3B params
100B FineWeb-Edu tokens

Step 01 · The Idea

Two independent gates instead of a single scalar

Linear attention compresses a potentially unlimited KV cache into a recurrent state of fixed size. The challenge lies in editing this memory without disrupting the associations already stored.

The Problem

Earlier delta-rule architectures (Gated DeltaNet, KDA) couple the removal of old content and the insertion of new content to a single scalar gate β_t.

The Solution

Decouple them: use a channel-wise erase gate b_t along the key dimension and a separate channel-wise write gate w_t along the value dimension.

Erase gate determines which key-axis coordinates of the decayed state get read and cleared.
Write gate determines which value-axis coordinates of the incoming content are committed to memory.
Channel-wise decay is adopted from KDA, enabling fine-grained forgetting across each dimension.

Step 02 · The Update Rule

The Gated Delta Rule-2

With an erase gate b_t ∈ [0,1]^{d_k}, a write gate w_t ∈ [0,1]^{d_v}, and channel-wise decay D_t = Diag(α_t), the recurrent state updates as:

S_t = (I − k_t (b_t &odot; k_t)^⊤) D_t S_{t−1} + k_t (w_t &odot; v_t)^⊤

Produces the KDA formulation exactly when both gates reduce to the same scalar value.
Produces the Gated DeltaNet formulation when the decay is also reduced to a scalar.
Supports efficient training through a chunkwise WY factorization with channel-wise decay folded into asymmetric erase factors.

Step 03 · Get the Code

Clone the repo and set up the environment

The official PyTorch implementation includes a Dockerfile, training scripts, and the lit_gpt model codebase.

git clone 
cd GatedDeltaNet-2

# build the environment from the provided Dockerfile
docker build -t gdn2 .
docker run --gpus all -it —ipc=host -v $PWD:/workspace gdn2

Repo layout

lit_gpt/ model code · scripts/ launchers · pretrain.py training entry · data.py, cache.py data & KV cache · paper/ arXiv PDF

Step 04 · Launch Training

Run `pretrain.py`

The streamlined command taken from the official README. Swap the placeholders with your dataset paths and configuration name.

python ../pretrain.py 
  --train_data_dir ${TRAIN_DATA} 
  --val_data_dir ${VALIDATION_DATA} 
  --output_root ${SAVE_DIR} 
  --exp_name ${NAME} 
  --model_name ${MODEL} 
  --train_config ${CONFIG} 
  --eval_iters ${EVAL_ITERS} 
  --learning_rate ${LR} 
  --micro_batch_size ${MICRO_BATCH_SIZE}

Pro tip

Include --interactive_job --code-debug for a real-time debugging session.

Step 05 · Default Recipe

The 1.3B / 100B FineWeb-Edu Configuration

Benchmarked side by side with Mamba-2, Gated DeltaNet, KDA, and Mamba-3 using the same optimizer setup and recurrent state size.

Optimizer

AdamW · peak LR 4e-4 · weight decay 0.1 · gradient clip 1.0 · cosine schedule · 1B token warmup.

Batch & Sequence

Global batch 0.5M tokens · sequence length 4K · hybrid models use 2K sliding-window attention.

Model Shape

16 heads · dk = dv = 128 · per-layer recurrent state 262,144 floats, consistent with Mamba-2/3.

Hybrid Block

Sequence: Gated DeltaNet-2 → MLP → SWA → MLP. The recurrent mixer summarizes distant context; SWA captures nearby interactions.

Step 06 · Results

Metrics ready for side-by-side comparison

Top average scores across language modeling and commonsense reasoning, with especially strong improvements on long-context retrieval.

Setting · Metric	KDA	Mamba-3 MIMO	GDN-2
Recurrent avg. (LMB + reasoning)	52.28	52.39	53.11
Hybrid avg. (LMB + reasoning)	52.68	52.72	53.97
S-NIAH-3 @2K (recurrent)	63.2	72.4	89.8
MK-NIAH-1 @4K (recurrent)	28.0	18.0	37.8
Real-world recall, recurrent avg.	28.67	28.35	29.88
Real-world recall, hybrid avg.	40.14	40.11	42.28

Step 07 · Resources

Paper, Code, and Citation

All the materials needed to review, run, and reference Gated DeltaNet-2, gathered in one spot.

@article{hatamizadeh2026gdn2,
  title   = {Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention},
  author  = {Hatamizadeh, Ali and Choi, Yejin and Kautz, Jan},
  journal = {arXiv preprint},
  year    = {2026}
}

MARKTECHPOST — Your destination for AI research, developer tools, and new model releases

Key Takeaways

Gated DeltaNet-2 separates the scalar βt into a channel-wise erase gate bt (along the key axis) and a channel-wise write gate wt (along the value axis).
The formulation reverts to KDA when both gates reduce to a single scalar, and to Gated DeltaNet when decay is removed as well.
Parallel training is maintained through a chunkwise WY representation, with channel-wise decay folded into asymmetric erase factors and a gate-aware backward pass fused using Triton.
At 1.3B parameters on 100B FineWeb-Edu tokens with matched state capacity, it outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 in both recurrent and hybrid configurations.
The biggest improvements show up on RULER long-context retrieval — S-NIAH-3 at 2K improves from 63.2 to 89.8 and MK-NIAH-1 at 4K improves from 28.0 to 37.8 over KDA (recurrent).

Explore the Paper and Repo. Also, connect with us on Twitter, subscribe to our 150k+ ML SubReddit, and sign up for our Newsletter. And if you’re on Telegram, we have a channel there too.

Looking to collaborate with us to promote your GitHub repo, Hugging Face page, product launch, webinar, or similar? Get in touch

Top Posts

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

NVIDIA’s Gated DeltaNet-2: The Linear Attention Layer Redefining Delta Rule Dynamics by Uncoupling Erasure from Updates

The Problem

The Solution

Optimizer

Batch & Sequence

Model Shape

Hybrid Block

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

5 No-Cost Courses to Transform from AI Newbie to Pro

The System76 Thelio Mira: My Dream Linux Desktop Come True

Google’s Gemini 3.6 Flash: Slashing Enterprise Agent Token Costs

Stop ML Chaos: Your Blueprint for Experiment Order

NVIDIA Cosmos 3 Edge: 4B-Power Robot Brains Thinking and Acting on Your Device

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

Trending

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

NVIDIA’s Gated DeltaNet-2: The Linear Attention Layer Redefining Delta Rule Dynamics by Uncoupling Erasure from Updates

The scalar gate problem in delta-rule models

Gated Delta Rule-2: two gates instead of one

Chunkwise training and gate-aware backward

Block design and hybrid model

Results at 1.3B parameters

Marktechpost’s Visual Explainer

Gated DeltaNet-2

Two independent gates instead of a single scalar

The Problem

The Solution

The Gated Delta Rule-2

Clone the repo and set up the environment

Run pretrain.py

The 1.3B / 100B FineWeb-Edu Configuration

Optimizer

Batch & Sequence

Model Shape

Hybrid Block

Metrics ready for side-by-side comparison

Paper, Code, and Citation

Key Takeaways

Related Posts

Run `pretrain.py`