The developers at Kimi.ai (Moonshot AI) have made a major contribution to open-source AI infrastructure. Their team has released FlashKDA (Flash Kimi Delta Attention), a high-performance CUDA kernel built on CUTLASS that implements the Kimi Delta Attention (KDA) mechanism. Now available on GitHub under the MIT license, FlashKDA delivers prefill speedups of 1.72× to 2.22× compared to the standard flash-linear-attention implementation on NVIDIA H20 GPUs. It also works as a seamless drop-in backend within the widely-used flash-linear-attention library.
Exploring Kimi Delta Attention and Its Importance
Before diving into FlashKDA’s technical details, it is useful to understand the broader context of attention mechanisms in large language models.
Traditional softmax attention scales quadratically with sequence length. This means that as longer contexts are processed, computational costs increase rapidly. This limitation has motivated extensive research into linear attention methods, which approximate the softmax operation to achieve linear scaling. Kimi Delta Attention (KDA) represents Moonshot AI’s innovation in this area. It enhances the Gated DeltaNet mechanism with a more refined, channel-wise gating approach, allowing better utilization of finite-state RNN memory.
KDA isn’t merely an experimental concept. It serves as the core attention mechanism within Kimi Linear, Moonshot AI’s hybrid model featuring 48B total parameters but only 3B activated at any time. This model uses a 3:1 ratio of KDA to MLA (Multi-Head Latent Attention) layers — three KDA layers for every global attention layer. This design reduces KV cache requirements by up to 75% during long-sequence generation, while achieving up to 6× greater decoding throughput at 1 million token context compared to full attention. FlashKDA is the production-ready CUDA kernel that makes this architecture efficient during the prefill phase.
In practical terms, the KDA forward pass accepts queries (q), keys (k), values (v), a pre-activation gate (g), and beta logits (beta), along with a scale factor, an output tensor (out), and gate parameters: A_log (log-gate value per head), dt_bias (gate bias), and lower_bound (gate lower bound, between -5.0 and 0). The sigmoid function is applied to beta within the kernel itself. The mechanism also supports optional initial and final recurrent states, which is particularly valuable for multi-turn conversations where context needs to persist across requests.
The recurrent architecture allows the model to handle long sequences efficiently during text generation. However, making the prefill stage fast still demands heavily optimized GPU kernels — which is precisely what FlashKDA provides.
Inside the Engine: CUTLASS on Hopper Architecture
FlashKDA is built on CUTLASS, NVIDIA’s open-source collection of CUDA C++ templates designed for high-performance linear algebra and custom kernel creation. CUTLASS enables developers to fully leverage NVIDIA Tensor Core hardware, and it serves as the foundation for other libraries like FlashAttention-3.
The library targets SM90 and newer GPUs — specifically NVIDIA’s Hopper series (H100, H20) and beyond. Minimum requirements include CUDA 12.9 and PyTorch 2.4. The codebase is primarily CUDA (56.4%), with Python (36.2%) bindings and C++ (6.7%) glue code.
The primary API is flash_kda.fwd, which accepts the following inputs:
q,k,v,g: all in bf16 format with shape[B, T, H, K]or[B, T, H, V](wheregrepresents the gate before activation)beta: bf16 beta logits with shape[B, T, H](sigmoid is applied internally)scale: fp32 scalar scaling factorout: bf16 output tensor with shape[B, T, H, V]A_log,dt_bias,lower_bound: fp32 gate parametersinitial_state,final_state: optional bf16 or fp32 recurrent statescu_seqlens: optional int64 cumulative sequence lengths for variable-length batching
One current limitation: the kernel requires the head dimension K and V to be exactly 128.
Support for variable-length batching via cu_seqlens is especially important in production environments. In real-world inference serving, different requests in a batch typically have varying sequence lengths. The ability to combine multiple sequences of different lengths into a single kernel call is essential for achieving high-throughput serving.
Benchmark Results: 1.72× to 2.22× Performance Gains on H20
Benchmark results (as of April 20, 2026) compare flash_kda against fla_chunk_kda (the existing flash-linear-attention implementation) across a sequence length of T=8192, head dimension D=128, and two head count configurations: H=96 and H=64. Each benchmark was run with 30 warmup iterations, 200 measurement iterations, and 5 repeats.
For H=96:
| Case | flash_kda (ms) | fla_chunk_kda (ms) | Speedup |
|---|---|---|---|
| Fixed | 2.6219 | 4.5052 | 1.72× |
Varlen, seq_lens=[1300, 547, 2048, 963, 271, 3063] | 2.3420 | 4.5717 | 1.95× |
Varlen, seq_lens=1024 × 8 | 2.0100 | 4.4668 | 2.22× |
For H=64:
| Case | flash_kda (ms) | fla_chunk_kda (ms) | Speedup |
|---|---|---|---|
| Fixed | 1.6199 | 2.9587 | 1.83× |
Varlen, seq_lens=[1300, 547, 2048, 963, 271, 3063] | 1.7027 | 3.0595 | 1.80× |
Varlen, seq_lens=1024 × 8 | 1.3930 | 3.0412 | 2.18× |
The highest speedup of 2.22× occurs in the uniform variable-length scenario (eight sequences of length 1024, totaling T=8192). The fixed-length case represents the lower end of the range at 1.72×. Across both head configurations and all three sequence scenarios, FlashKDA consistently outperforms the flash-linear-attention baseline by a significant margin.
Seamless Integration with flash-linear-attention
One of the most
A standout feature of FlashKDA is how seamlessly it integrates into existing workflows. After installation, FlashKDA is automatically invoked by the chunk_kda function from the flash-linear-attention library — meaning projects already using flash-linear-attention require no manual configuration to benefit from the accelerated kernel. This integration effort is documented under flash-linear-attention PR #852.
Getting started is simple:
git clone flash-kda
cd flash-kda
git submodule update --init --recursive
pip install -v .The validation suite (tests/test_fwd.py) performs exact-match comparisons against a PyTorch reference implementation and cross-checks results with flash-linear-attention. This provides developers with a dependable foundation for verifying kernel correctness before rolling it out in production environments.
Key Takeaways
- FlashKDA is Moonshot AI’s open-source CUDA kernel built with CUTLASS designed for Kimi Delta Attention (KDA), achieving a prefill speed improvement of 1.72× to 2.22× compared to the
flash-linear-attentionimplementation on NVIDIA H20 GPUs. - KDA enhances Gated DeltaNet by introducing fine-grained, per-channel gating — it serves as the core attention mechanism powering Kimi Linear, a hybrid architecture with 48B total parameters and only 3B active at any time, cutting KV cache consumption by up to 75% and boosting decoding throughput by as much as 6× at 1M context length.
- The kernel is optimized for SM90+ architecture (NVIDIA Hopper — including H100, H20, and newer), demands CUDA 12.9 or higher and PyTorch 2.4+, and presently supports a fixed head dimensionality of
K = V = 128. - Handling variable-length sequences is built-in natively through the
cu_seqlensparameter, which lets multiple sequences of differing lengths be batched into a single kernel execution — a vital capability for low-latency, high-throughput inference serving. - After setup, FlashKDA is automatically triggered via
chunk_kdainflash-linear-attention, making it a plug-and-play performance boost for any project already relying on theflash-linear-attentionframework — with no need to restructure your pipeline.
Explore the GitHub Repo. Also, feel free to follow us on Twitter, join our 130k+ ML SubReddit, and subscribe to our Newsletter. On Telegram? You can find us there too!
Interested in collaborating with us to spotlight your GitHub Repo, Hugging Face page, product launch, webinar, or similar? Get in touch



