LightSeek Foundation Unveils TokenSpeed: Open-Source LLM Inference Engine Built To Match TensorRT-LLM Performance For Agentic Workloads

Inference efficiency has quietly become one of the most critical bottlenecks in deploying AI systems. As agentic coding platforms like Claude Code, Codex, and Cursor evolve from niche developer tools into foundational infrastructure powering large-scale software development, the inference engines behind these systems face mounting performance pressure. Researchers at the LightSeek Foundation have responded with TokenSpeed—an open-source LLM inference engine released under the MIT license, purpose-built for the unique demands of agentic workloads. TokenSpeed is currently available in a preview release.

Why Agentic Inference Presents Unique Challenges

Grasping TokenSpeed’s design significance requires first understanding what makes agentic inference particularly demanding. Unlike standard chatbot interactions, coding agent sessions routinely involve contexts exceeding 50K tokens across dozens of conversational turns. This creates concurrent pressure on two distinct performance dimensions: per-GPU TPM (tokens per minute), which determines user throughput capacity per hardware unit, and per-user TPS (tokens per second), which directly impacts perceived responsiveness. Conventional public benchmarks rarely capture this dual-pressure dynamic.

TokenSpeed was architected to excel at both simultaneously. Its core objective is maximizing per-GPU TPM while preserving a minimum per-user TPS threshold—typically 70 TPS, with capability scaling to 200+ TPS when required.

Architecture: Five Integrated Subsystems

TokenSpeed’s architecture rests on five foundational pillars: a compiler-based parallel modeling framework, a high-performance scheduling engine, a safe KV resource reuse mechanism, a pluggable layered kernel system supporting heterogeneous accelerators, and SMG integration providing a lightweight CPU-side request entrypoint.

The modeling layer employs a local SPMD (Single Program, Multiple Data) execution model. In this distributed deep learning pattern, multiple processes execute identical programs across different data subsets. Instead of requiring manual implementation of inter-process communication logic, TokenSpeed lets developers add I/O placement annotations at module boundaries. A lightweight static compiler then automatically generates all necessary collective operations during model construction, removing the burden of hand-coding communication logic.

The scheduler implements a fundamental architectural separation between control plane and execution plane. The control plane, built in C++ as a finite-state machine, works with the type system to enforce safe resource management—including KV cache state transfers and usage—at compile time rather than runtime. Request lifecycles, KV cache resources, and scheduling timing are tracked through explicit FSM transitions and ownership semantics. Correctness is therefore guaranteed by a verifiable control system instead of developer convention. By encoding these safety constraints into the type system, potential KV cache management errors—among the most bug-prone areas in LLM serving—are caught during compilation. Meanwhile, the execution plane remains in Python for developer productivity, accelerating feature iteration and reducing implementation complexity.

The kernel layer elevates GPU kernels to a modular first-class subsystem rather than embedding them within the engine core. It provides a portable public API, centralized registry and selection framework, and extensible plugin architecture supporting heterogeneous accelerators—meaning it isn’t restricted to NVIDIA hardware. The development team has also built one of the highest-performing MLA (Multi-head Latent Attention) kernels optimized for agentic workloads on NVIDIA Blackwell. In its decode kernel, q_seqlen and num_heads are grouped to maximize Tensor Core utilization, addressing cases where num_heads are relatively small. The binary prefill kernel incorporates a fine-tuned softmax implementation. Notably, TokenSpeed’s MLA kernel has been integrated into vLLM.

Lastly, TokenSpeed features SMG — a PyTorch-native module — that provides a lightweight CPU-side request entrypoint, minimizing the overhead when switching between CPU orchestration and GPU computation.

Benchmark Results Compared to TensorRT-LLM on NVIDIA B200

Before diving in, it’s important to note that these benchmarks focus solely on single (non-disaggregated) deployments. PD disaggregation support is still being refined and may be addressed in a future deep-dive post from the TokenSpeed team.

In collaboration with the EvalScope team, TokenSpeed was tested using SWE-smith traces — datasets that closely replicate real-world coding-agent workloads — and compared against TensorRT-LLM, the current industry-leading solution on NVIDIA Blackwell. The model used for testing was Kimi K2.5.

For code-generation agents operating above 70 TPS per user, the optimal setup is Attention TP4 + MoE TP4, where TokenSpeed outperforms TensorRT-LLM across every point on the Pareto frontier: about 9% lower latency at minimum-latency settings (batch size 1), and roughly 11% greater throughput near 100 TPS per user. Here, TP4 means tensor parallelism spread across 4 GPUs — a method that distributes model weight matrices across multiple devices to ease per-device memory demands and cut latency.

Regarding the MLA kernel, the improvements are most significant during the decode stage. The decode kernel merges the query-sequence axis with the head axis to maximize the BMM1 M tile size, boosting Tensor Core efficiency. The binary prefill kernel leverages NVIDIA-internal tuning parameters to optimize the softmax operation, surpassing TensorRT-LLM’s MLA on all five standard prefill workloads for coding agents using extended prefix KV cache. Paired with additional optimizations, this nearly cuts latency in half compared to TensorRT-LLM on common decode workloads with speculative decoding at batch sizes 4, 8, and 16 with extended prefix KV cache.

Key Takeaways

Check out the technical deep dive and GitHub repo. Also, feel free to follow us on Twitter, join our 150k+ ML SubReddit, and subscribe to our newsletter. Already on Telegram? Come join us there as well.

Looking to collaborate with us to showcase your GitHub repo, Hugging Face page, product launch, or webinar? Reach out to us

Top Posts

Orchestrating Infinite Scale: Mastering Multi-Cluster Databases on Kubernetes

Grab the Galaxy Z Fold 8 or Z Flip 8 Early at Verizon—Zero Fees, Full Flexibility

Code-Wielding Catalyst: Forge an LLM Agent That Architects and Executes Software

LightSeek Foundation Unveils TokenSpeed: Open-Source LLM Inference Engine Built to Match TensorRT-LLM Performance for Agentic Workloads

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Orchestrating Infinite Scale: Mastering Multi-Cluster Databases on Kubernetes

Grab the Galaxy Z Fold 8 or Z Flip 8 Early at Verizon—Zero Fees, Full Flexibility

Code-Wielding Catalyst: Forge an LLM Agent That Architects and Executes Software

Cisco’s Budget AI Guardians: Revolutionizing Source Code Security

Proof Of Node Doesn’t Change Bitcoin: The Immutable Truth

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Trending

Orchestrating Infinite Scale: Mastering Multi-Cluster Databases on Kubernetes

Grab the Galaxy Z Fold 8 or Z Flip 8 Early at Verizon—Zero Fees, Full Flexibility

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

LightSeek Foundation Unveils TokenSpeed: Open-Source LLM Inference Engine Built to Match TensorRT-LLM Performance for Agentic Workloads

Why Agentic Inference Presents Unique Challenges

Architecture: Five Integrated Subsystems

Benchmark Results Compared to TensorRT-LLM on NVIDIA B200

Key Takeaways

Related Posts