Linear attention replaces the ever-growing key-value cache of softmax attention with a fixed-size recurrent state. This reduces sequence mixing to linear time and decoding to constant memory. The real challenge isn’t deciding what to forget. It’s figuring out how to update a compressed memory without disrupting existing associations.
NVIDIA has released Gated DeltaNet-2, a linear attention layer designed to tackle that specific bottleneck. The model splits the active memory update into two separate channel-wise gates. It was trained at 1.3B parameters on 100B FineWeb-Edu tokens. It outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 across the research benchmark suite.
The scalar gate problem in delta-rule models
A recurrent linear attention layer maintains a matrix state St and retrieves it using the query. DeltaNet introduces an active update by subtracting the value currently linked to the current key. It uses a scalar step size βt to control how much to overwrite. Mamba-2 adds a data-dependent scalar decay αt for global forgetting. Gated DeltaNet combined both operations, but both gates remained scalar per head.
Kimi Delta Attention (KDA) improves the decay side. It replaces the scalar αt with a channel-wise vector. KDA still uses a single scalar βt for the active update. That scalar handles two different tasks at once. It determines how much old content to erase on the key side. It also determines how much new content to commit on the value side. These two operations act on different dimensions of the state. Binding them together is a modeling limitation, not a requirement of the delta rule.

Gated Delta Rule-2: two gates instead of one
Gated DeltaNet-2 separates these two decisions through Gated Delta Rule-2. It introduces a channel-wise erase gate bt ∈ [0,1]dk on the key axis. It also introduces a channel-wise write gate wt ∈ [0,1]dv on the value axis. Both gates are generated by sigmoid projections of the token representation. The update applies decay before the active edit.
Written compactly, the recurrence is:
St = (I − kt (bt ⊙ kt)⊤) Dt St−1 + kt (wt ⊙ vt)⊤
Here Dt = Diag(αt) is the channel-wise decay inherited from KDA. The left factor of the erase matrix remains kt, preserving the delta-rule write direction. The right factor becomes bt ⊙ kt, making the read direction channel-selective. The write term kt zt⊤ uses zt = wt ⊙ vt, making the value update channel-selective.
When both gates collapse to the same scalar βt, the update recovers KDA exactly. When the decay αt also collapses to a scalar, it recovers Gated DeltaNet. Both prior models are preserved as tied subspaces of the new update.
From the fast-weight perspective, Gated Delta Rule-2 is equivalent to one online gradient step on a local regression loss. The decayed state stays close to memory, while the residual edit uses gated read and gated write targets.
Chunkwise training and gate-aware backward
The recurrence admits a chunkwise WY form that matches the structure used by KDA.
The cumulative decay applied across each channel is folded into the two components of every rank-one erase operation. Each chunk update is expressed as a product of asymmetric matrices structured as I − k̄r ēr⊤. The implementation operates with a chunk size of C = 64, leveraging fused Triton kernels for efficiency.
During the backward pass, the scalar shortcut relied upon by KDA no longer holds. On the write path, a distinct diagonal gate spans the value channels, while on the erase path, a separate diagonal gate spans the key channels. As a result, these gate factors must be incorporated within the dot products used to accumulate gradients. The paper provides an explicit derivation of this gate-aware vector-Jacobian product. On Hopper GPUs, the fused WY backward kernel is limited to two and four warps to sidestep a Triton WGMMA layout assertion.
Block design and hybrid model
Gated DeltaNet-2 serves as the recurrent token mixer within a standard Transformer-style block. The query and key branches pass through a linear projection, a short causal convolution, a SiLU activation, and L2 normalization. The value branch passes through a linear projection, a short convolution, and a SiLU activation. The decay parameter αt, erase gate bt, and write gate wt are each produced by independent linear branches. The recurrent output undergoes RMS normalization, is scaled by a SiLU output gate, and is projected back to the model dimension.
A hybrid configuration adds Sliding-Window Attention (SWA) immediately after the recurrent mixer. A repeating cell stack consists of Gated DeltaNet-2, an MLP, SWA, and a second MLP. SWA captures precise local interactions, while the recurrent mixer summarizes extended context history. The hybrid design maintains linear scaling with sequence length and keeps the attention cache size bounded.
Results at 1.3B parameters
Every model in the study uses 1.3B parameters and is trained on 100B tokens from FineWeb-Edu. Parameter counts and recurrent state sizes are held constant across compared models. Each layer stores 262,144 floating-point values in its recurrent state per batch element. Training sequences are 4K tokens long, and hybrid models use a 2K sliding-window attention span. The Mamba-3 MIMO baseline uses a rank of R = 4.
For language modeling and commonsense reasoning tasks, Gated DeltaNet-2 delivers the highest average in both experimental configurations. In the recurrent-only setup, the model scores 53.11 on average across LAMBADA and the reasoning benchmark suite, surpassing Mamba-3 MIMO at 52.39 and KDA at 52.28. In the hybrid setup, Gated DeltaNet-2 averages 53.97, ahead of Mamba-3 MIMO at 52.72. Because the recurrent state size is identical across models, the improvement is attributable to the update rule itself rather than increased memory capacity.
The most pronounced improvements show up on long-context retrieval benchmarks from the RULER suite. In the recurrent-only setup, S-NIAH-2 at 4K improves from 89.0 (KDA) to 93.0. S-NIAH-3 at 2K rises sharply from 63.2 (KDA) to 89.8. MK-NIAH-1 at 4K increases from 28.0 (KDA) to 37.8.
On practical retrieval benchmarks (SWDE, SQuAD, FDA, TriviaQA, NQ, DROP), Gated DeltaNet-2 also leads across both configurations. The recurrent average comes to 29.88, and the hybrid average reaches 42.28.
Marktechpost’s Visual Explainer
MARKTECHPOST — Your destination for AI research, developer tools, and new model releases
Key Takeaways
- Gated DeltaNet-2 separates the scalar βt into a channel-wise erase gate
bt(along the key axis) and a channel-wise write gatewt(along the value axis). - The formulation reverts to KDA when both gates reduce to a single scalar, and to Gated DeltaNet when decay is removed as well.
- Parallel training is maintained through a chunkwise WY representation, with channel-wise decay folded into asymmetric erase factors and a gate-aware backward pass fused using Triton.
- At 1.3B parameters on 100B FineWeb-Edu tokens with matched state capacity, it outperforms Mamba-2, Gated DeltaNet, KDA, and Mamba-3 in both recurrent and hybrid configurations.
- The biggest improvements show up on RULER long-context retrieval — S-NIAH-3 at 2K improves from 63.2 to 89.8 and MK-NIAH-1 at 4K improves from 28.0 to 37.8 over KDA (recurrent).
Explore the Paper and Repo. Also, connect with us on Twitter, subscribe to our 150k+ ML SubReddit, and sign up for our Newsletter. And if you’re on Telegram, we have a channel there too.
Looking to collaborate with us to promote your GitHub repo, Hugging Face page, product launch, webinar, or similar? Get in touch



