Inference efficiency has quietly become one of the most critical bottlenecks in deploying AI systems. As agentic coding platforms like Claude Code, Codex, and Cursor evolve from niche developer tools into foundational infrastructure powering large-scale software development, the inference engines behind these systems face mounting performance pressure. Researchers at the LightSeek Foundation have responded with TokenSpeed—an open-source LLM inference engine released under the MIT license, purpose-built for the unique demands of agentic workloads. TokenSpeed is currently available in a preview release.
Why Agentic Inference Presents Unique Challenges
Grasping TokenSpeed’s design significance requires first understanding what makes agentic inference particularly demanding. Unlike standard chatbot interactions, coding agent sessions routinely involve contexts exceeding 50K tokens across dozens of conversational turns. This creates concurrent pressure on two distinct performance dimensions: per-GPU TPM (tokens per minute), which determines user throughput capacity per hardware unit, and per-user TPS (tokens per second), which directly impacts perceived responsiveness. Conventional public benchmarks rarely capture this dual-pressure dynamic.
TokenSpeed was architected to excel at both simultaneously. Its core objective is maximizing per-GPU TPM while preserving a minimum per-user TPS threshold—typically 70 TPS, with capability scaling to 200+ TPS when required.
Architecture: Five Integrated Subsystems
TokenSpeed’s architecture rests on five foundational pillars: a compiler-based parallel modeling framework, a high-performance scheduling engine, a safe KV resource reuse mechanism, a pluggable layered kernel system supporting heterogeneous accelerators, and SMG integration providing a lightweight CPU-side request entrypoint.
The modeling layer employs a local SPMD (Single Program, Multiple Data) execution model. In this distributed deep learning pattern, multiple processes execute identical programs across different data subsets. Instead of requiring manual implementation of inter-process communication logic, TokenSpeed lets developers add I/O placement annotations at module boundaries. A lightweight static compiler then automatically generates all necessary collective operations during model construction, removing the burden of hand-coding communication logic.
The scheduler implements a fundamental architectural separation between control plane and execution plane. The control plane, built in C++ as a finite-state machine, works with the type system to enforce safe resource management—including KV cache state transfers and usage—at compile time rather than runtime. Request lifecycles, KV cache resources, and scheduling timing are tracked through explicit FSM transitions and ownership semantics. Correctness is therefore guaranteed by a verifiable control system instead of developer convention. By encoding these safety constraints into the type system, potential KV cache management errors—among the most bug-prone areas in LLM serving—are caught during compilation. Meanwhile, the execution plane remains in Python for developer productivity, accelerating feature iteration and reducing implementation complexity.
The kernel layer elevates GPU kernels to a modular first-class subsystem rather than embedding them within the engine core. It provides a portable public API, centralized registry and selection framework, and extensible plugin architecture supporting heterogeneous accelerators—meaning it isn’t restricted to NVIDIA hardware. The development team has also built one of the highest-performing MLA (Multi-head Latent Attention) kernels optimized for agentic workloads on NVIDIA Blackwell. In its decode kernel, q_seqlen and num_heads are grouped to maximize Tensor Core utilization, addressing cases where num_heads are relatively small. The binary prefill kernel incorporates a fine-tuned softmax implementation. Notably, TokenSpeed’s MLA kernel has been integrated into vLLM.

Lastly, TokenSpeed features SMG — a PyTorch-native module — that provides a lightweight CPU-side request entrypoint, minimizing the overhead when switching between CPU orchestration and GPU computation.
Benchmark Results Compared to TensorRT-LLM on NVIDIA B200
Before diving in, it’s important to note that these benchmarks focus solely on single (non-disaggregated) deployments. PD disaggregation support is still being refined and may be addressed in a future deep-dive post from the TokenSpeed team.
In collaboration with the EvalScope team, TokenSpeed was tested using SWE-smith traces — datasets that closely replicate real-world coding-agent workloads — and compared against TensorRT-LLM, the current industry-leading solution on NVIDIA Blackwell. The model used for testing was Kimi K2.5.
For code-generation agents operating above 70 TPS per user, the optimal setup is Attention TP4 + MoE TP4, where TokenSpeed outperforms TensorRT-LLM across every point on the Pareto frontier: about 9% lower latency at minimum-latency settings (batch size 1), and roughly 11% greater throughput near 100 TPS per user. Here, TP4 means tensor parallelism spread across 4 GPUs — a method that distributes model weight matrices across multiple devices to ease per-device memory demands and cut latency.
Regarding the MLA kernel, the improvements are most significant during the decode stage. The decode kernel merges the query-sequence axis with the head axis to maximize the BMM1 M tile size, boosting Tensor Core efficiency. The binary prefill kernel leverages NVIDIA-internal tuning parameters to optimize the softmax operation, surpassing TensorRT-LLM’s MLA on all five standard prefill workloads for coding agents using extended prefix KV cache. Paired with additional optimizations, this nearly cuts latency in half compared to TensorRT-LLM on common decode workloads with speculative decoding at batch sizes 4, 8, and 16 with extended prefix KV cache.
Key Takeaways
- TokenSpeed is a newly released MIT-licensed, open-source LLM inference engine created by LightSeek Foundation, purpose-built for agentic workloads. (Currently available in preview)
- Its scheduler employs a C++ finite-state machine design to guarantee KV cache safety during compilation, while keeping the execution layer in Python for developer convenience.
- On NVIDIA B200 GPUs, TokenSpeed surpasses TensorRT-LLM by approximately 9% in minimum-latency performance and approximately 11% in throughput at 100 TPS per user on Kimi K2.5.
- The TokenSpeed MLA kernel nearly reduces decode latency by half compared to TensorRT-LLM on speculative decoding workloads and has already been integrated into vLLM.
Check out the technical deep dive and GitHub repo. Also, feel free to follow us on Twitter, join our 150k+ ML SubReddit, and subscribe to our newsletter. Already on Telegram? Come join us there as well.
Looking to collaborate with us to showcase your GitHub repo, Hugging Face page, product launch, or webinar? Reach out to us



