Moonshot AI Unleashes FlashKDA: Cutting-Edge Kernels For Kimi Delta Attention With Variable-Length Batching On H20

The developers at Kimi.ai (Moonshot AI) have made a major contribution to open-source AI infrastructure. Their team has released FlashKDA (Flash Kimi Delta Attention), a high-performance CUDA kernel built on CUTLASS that implements the Kimi Delta Attention (KDA) mechanism. Now available on GitHub under the MIT license, FlashKDA delivers prefill speedups of 1.72× to 2.22× compared to the standard flash-linear-attention implementation on NVIDIA H20 GPUs. It also works as a seamless drop-in backend within the widely-used flash-linear-attention library.

Exploring Kimi Delta Attention and Its Importance

Before diving into FlashKDA’s technical details, it is useful to understand the broader context of attention mechanisms in large language models.

Traditional softmax attention scales quadratically with sequence length. This means that as longer contexts are processed, computational costs increase rapidly. This limitation has motivated extensive research into linear attention methods, which approximate the softmax operation to achieve linear scaling. Kimi Delta Attention (KDA) represents Moonshot AI’s innovation in this area. It enhances the Gated DeltaNet mechanism with a more refined, channel-wise gating approach, allowing better utilization of finite-state RNN memory.

KDA isn’t merely an experimental concept. It serves as the core attention mechanism within Kimi Linear, Moonshot AI’s hybrid model featuring 48B total parameters but only 3B activated at any time. This model uses a 3:1 ratio of KDA to MLA (Multi-Head Latent Attention) layers — three KDA layers for every global attention layer. This design reduces KV cache requirements by up to 75% during long-sequence generation, while achieving up to 6× greater decoding throughput at 1 million token context compared to full attention. FlashKDA is the production-ready CUDA kernel that makes this architecture efficient during the prefill phase.

In practical terms, the KDA forward pass accepts queries (q), keys (k), values (v), a pre-activation gate (g), and beta logits (beta), along with a scale factor, an output tensor (out), and gate parameters: A_log (log-gate value per head), dt_bias (gate bias), and lower_bound (gate lower bound, between -5.0 and 0). The sigmoid function is applied to beta within the kernel itself. The mechanism also supports optional initial and final recurrent states, which is particularly valuable for multi-turn conversations where context needs to persist across requests.

The recurrent architecture allows the model to handle long sequences efficiently during text generation. However, making the prefill stage fast still demands heavily optimized GPU kernels — which is precisely what FlashKDA provides.

Inside the Engine: CUTLASS on Hopper Architecture

FlashKDA is built on CUTLASS, NVIDIA’s open-source collection of CUDA C++ templates designed for high-performance linear algebra and custom kernel creation. CUTLASS enables developers to fully leverage NVIDIA Tensor Core hardware, and it serves as the foundation for other libraries like FlashAttention-3.

The library targets SM90 and newer GPUs — specifically NVIDIA’s Hopper series (H100, H20) and beyond. Minimum requirements include CUDA 12.9 and PyTorch 2.4. The codebase is primarily CUDA (56.4%), with Python (36.2%) bindings and C++ (6.7%) glue code.

The primary API is flash_kda.fwd, which accepts the following inputs:

q, k, v, g: all in bf16 format with shape [B, T, H, K] or [B, T, H, V] (where g represents the gate before activation)
beta: bf16 beta logits with shape [B, T, H] (sigmoid is applied internally)
scale: fp32 scalar scaling factor
out: bf16 output tensor with shape [B, T, H, V]
A_log, dt_bias, lower_bound: fp32 gate parameters
initial_state, final_state: optional bf16 or fp32 recurrent states
cu_seqlens: optional int64 cumulative sequence lengths for variable-length batching

One current limitation: the kernel requires the head dimension K and V to be exactly 128.

Support for variable-length batching via cu_seqlens is especially important in production environments. In real-world inference serving, different requests in a batch typically have varying sequence lengths. The ability to combine multiple sequences of different lengths into a single kernel call is essential for achieving high-throughput serving.

Benchmark Results: 1.72× to 2.22× Performance Gains on H20

Benchmark results (as of April 20, 2026) compare flash_kda against fla_chunk_kda (the existing flash-linear-attention implementation) across a sequence length of T=8192, head dimension D=128, and two head count configurations: H=96 and H=64. Each benchmark was run with 30 warmup iterations, 200 measurement iterations, and 5 repeats.

For H=96:

Case	`flash_kda` (ms)	`fla_chunk_kda` (ms)	Speedup
Fixed	2.6219	4.5052	1.72×
Varlen, `seq_lens`=[1300, 547, 2048, 963, 271, 3063]	2.3420	4.5717	1.95×
Varlen, `seq_lens`=`1024 × 8`	2.0100	4.4668	2.22×

For H=64:

Case	`flash_kda` (ms)	`fla_chunk_kda` (ms)	Speedup
Fixed	1.6199	2.9587	1.83×
Varlen, `seq_lens`=[1300, 547, 2048, 963, 271, 3063]	1.7027	3.0595	1.80×
Varlen, `seq_lens`=`1024 × 8`	1.3930	3.0412	2.18×

The highest speedup of 2.22× occurs in the uniform variable-length scenario (eight sequences of length 1024, totaling T=8192). The fixed-length case represents the lower end of the range at 1.72×. Across both head configurations and all three sequence scenarios, FlashKDA consistently outperforms the flash-linear-attention baseline by a significant margin.

Seamless Integration with flash-linear-attention

One of the most

A standout feature of FlashKDA is how seamlessly it integrates into existing workflows. After installation, FlashKDA is automatically invoked by the chunk_kda function from the flash-linear-attention library — meaning projects already using flash-linear-attention require no manual configuration to benefit from the accelerated kernel. This integration effort is documented under flash-linear-attention PR #852.

Getting started is simple:

git clone flash-kda
cd flash-kda
git submodule update --init --recursive
pip install -v .

The validation suite (tests/test_fwd.py) performs exact-match comparisons against a PyTorch reference implementation and cross-checks results with flash-linear-attention. This provides developers with a dependable foundation for verifying kernel correctness before rolling it out in production environments.

Key Takeaways

FlashKDA is Moonshot AI’s open-source CUDA kernel built with CUTLASS designed for Kimi Delta Attention (KDA), achieving a prefill speed improvement of 1.72× to 2.22× compared to the flash-linear-attention implementation on NVIDIA H20 GPUs.
KDA enhances Gated DeltaNet by introducing fine-grained, per-channel gating — it serves as the core attention mechanism powering Kimi Linear, a hybrid architecture with 48B total parameters and only 3B active at any time, cutting KV cache consumption by up to 75% and boosting decoding throughput by as much as 6× at 1M context length.
The kernel is optimized for SM90+ architecture (NVIDIA Hopper — including H100, H20, and newer), demands CUDA 12.9 or higher and PyTorch 2.4+, and presently supports a fixed head dimensionality of K = V = 128.
Handling variable-length sequences is built-in natively through the cu_seqlens parameter, which lets multiple sequences of differing lengths be batched into a single kernel execution — a vital capability for low-latency, high-throughput inference serving.
After setup, FlashKDA is automatically triggered via chunk_kda in flash-linear-attention, making it a plug-and-play performance boost for any project already relying on the flash-linear-attention framework — with no need to restructure your pipeline.

Explore the GitHub Repo. Also, feel free to follow us on Twitter, join our 130k+ ML SubReddit, and subscribe to our Newsletter. On Telegram? You can find us there too!

Interested in collaborating with us to spotlight your GitHub Repo, Hugging Face page, product launch, webinar, or similar? Get in touch

Top Posts

Why AI Sandbox Adoption Is Experiencing a Kubernetes Boom

Rising Equipment Theft Spurs Companies Embrace Asset Tracking

Why UV Intensity Is Only Half of the Curing Performance Equation

Moonshot AI Unleashes FlashKDA: Cutting-Edge Kernels for Kimi Delta Attention with Variable-Length Batching on H20

Small Business Hidden Tax: Stop Silent Payroll Errors from Draining Thousands

LG and NVIDIA’s Secret Conversations About Physical AI’s Future

## Beyond LangChain: The Shift to Native Agent Architectures Among AI Engineers

IBM Debuts Granite Speech 4.1 2B: Dual-Model Architecture Uniting Autoregressive ASR, Translation, and Ultra-Fast Non-Autoregressive Editing in a Compact Billion-Parameter Package

Proton’s CEO Fights for Privacy in a AI World: His Biggest Fear

Ignite Your AI Journey: A Bold Playbook for EMEA CIOs to Accelerate AI Deployment

Why AI Sandbox Adoption Is Experiencing a Kubernetes Boom

Rising Equipment Theft Spurs Companies Embrace Asset Tracking

Why UV Intensity Is Only Half of the Curing Performance Equation

Moonshot AI Unleashes FlashKDA: Cutting-Edge Kernels for Kimi Delta Attention with Variable-Length Batching on H20

Michael Saylor Declares STRC Is ‘Going Viral’ as Strategy’s Stock Skyrockets $8.5 Billion

5 Surprise Python Decorators That Spark Clean AI Code

What Happens in the First 24 Hours After a New Asset Goes Live

“Microsoft Tops IDC’s 2026 API Management Vendor Rankings as undisputed Leader”

Trending

Why AI Sandbox Adoption Is Experiencing a Kubernetes Boom

Rising Equipment Theft Spurs Companies Embrace Asset Tracking

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Moonshot AI Unleashes FlashKDA: Cutting-Edge Kernels for Kimi Delta Attention with Variable-Length Batching on H20

Exploring Kimi Delta Attention and Its Importance

Inside the Engine: CUTLASS on Hopper Architecture

Benchmark Results: 1.72× to 2.22× Performance Gains on H20

Seamless Integration with flash-linear-attention

Key Takeaways

Related Posts