For years, the best way giant language fashions deal with inference has been caught inside a field — actually. The high-bandwidth RDMA networks that make fashionable LLM serving work have confined each prefill and decode to the identical datacenter, generally even the identical rack. A staff of researchers at Moonshot AI and Tsinghua College is making the case that this constraint is about to interrupt down — and that the appropriate structure can already exploit that shift.
The analysis staff introduces Prefill-as-a-Service (PrfaaS), a cross-datacenter serving structure that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the ensuing KVCache over commodity Ethernet to native PD clusters for decode. The end result, in a case research utilizing an inside 1T-parameter hybrid mannequin, is 54% larger serving throughput than a homogeneous PD baseline and 32% larger than a naive heterogeneous setup — whereas consuming solely a fraction of accessible cross-datacenter bandwidth. The analysis staff be aware that compared at equal {hardware} value, the throughput achieve is roughly 15%, reflecting that the complete 54% benefit comes partly from pairing higher-compute H200 GPUs for prefill with H20 GPUs for decode.

Why the Present Structure Has Hit a Wall
To know what PrfaaS solves, it helps to know why LLM serving is break up into two phases within the first place. Prefill is the step the place the mannequin processes the entire enter tokens and generates the KVCache — it’s compute-intensive. Decode is the place the mannequin generates output tokens separately — it’s memory-bandwidth-intensive. Prefill-decode (PD) disaggregation separates these two phases onto completely different {hardware}, which improves utilization and permits every section to be independently optimized.
The issue is that separating prefill from decode creates a transport drawback. As soon as prefill runs on one set of machines and decode runs on one other, the KVCache produced by prefill should be transferred to the decode facet earlier than output era can start. In typical dense-attention fashions — these utilizing Grouped Question Consideration (GQA) — this KVCache is big. The analysis staff benchmarks MiniMax-M2.5, a consultant dense mannequin with GQA, producing KVCache at roughly 60 Gbps for a 32K-token request on a single 8×H200 occasion. That quantity of knowledge requires RDMA-class interconnects to switch with out stalling compute, which is why typical PD disaggregation is tightly sure to a single datacenter-scale community cloth. Transferring prefill and decode to separate clusters, not to mention throughout datacenters, has merely not been possible.
Hybrid Consideration Modifications the Math
What makes PrfaaS well timed is an architectural shift taking place on the mannequin degree. A rising class of fashions — together with Kimi Linear, MiMo-V2-Flash, Qwen3.5-397B, and Ring-2.5-1T — undertake hybrid consideration stacks that interleave a small variety of full-attention layers with a bigger variety of linear-complexity or bounded-state layers similar to Kimi Delta Consideration (KDA), Multi-head Latent Consideration (MLA), and Sliding Window Consideration (SWA). In these architectures, solely the full-attention layers produce KVCache that scales with sequence size. The linear-complexity layers preserve fixed-size recurrent states whose footprint is negligible at lengthy context.
The KV throughput numbers — outlined as KVCache dimension divided by prefill latency — inform the story clearly. At 32K tokens, MiMo-V2-Flash produces KVCache at 4.66 Gbps versus 59.93 Gbps for MiniMax-M2.5, a 13× discount. Qwen3.5-397B reaches 8.25 Gbps versus 33.35 Gbps for Qwen3-235B, a 4× discount. For Ring-2.5-1T particularly, the paper decomposes the financial savings: MLA contributes roughly a 4.5× compression over GQA, and the 7:1 hybrid ratio contributes one other roughly 8× discount, yielding an general KV reminiscence saving of roughly 36×. For the interior 1T mannequin used within the case research, KV throughput at 32K tokens is simply 3.19 Gbps — a degree that fashionable inter-datacenter Ethernet hyperlinks can truly maintain.
However the analysis staff is cautious to make a distinction that issues for AI devs constructing actual methods: a smaller KVCache is important however not enough to make cross-datacenter PD disaggregation sensible. Actual workloads are bursty, request lengths are skewed, prefix caches are distributed erratically throughout nodes, and inter-cluster bandwidth fluctuates. A naive design that routes each prefill to a distant cluster nonetheless runs into congestion and unstable queuing.


What PrfaaS Really Does
The PrfaaS-PD structure sits on high of three subsystems: compute, community, and storage. The compute subsystem separates clusters into two sorts — native PD clusters that deal with end-to-end inference for brief requests, and PrfaaS clusters with high-compute-throughput accelerators devoted to long-context prefill. The community subsystem makes use of intra-cluster RDMA for quick native transfers and commodity Ethernet for cross-cluster KVCache transport. The storage subsystem builds a distributed hybrid prefix cache pool that handles linear consideration recurrent states (request-level, fixed-size, exact-match solely) and full-attention KVCache blocks (block-level, rising linearly with enter size, supporting partial prefix matching) in separate teams backed by a unified block pool.
The important thing routing mechanism is length-based threshold routing. Let l denote the incremental prefill size of a request after subtracting any cached prefix, and t a routing threshold. If l > t, the request goes to the PrfaaS cluster and its KVCache is shipped over Ethernet to a decode node. If l ≤ t, it stays on the native PD path. Within the case research, the optimum threshold is t = 19.4K tokens, which routes roughly 50% of all requests — the longer ones — to the PrfaaS cluster.
Making the Ethernet path dependable in follow requires extra than simply low KV throughput. The analysis staff specifies three concrete transport mechanisms: layer-wise prefill pipelining to overlap KVCache era with transmission, multi-connection TCP transport to completely make the most of out there bandwidth, and congestion monitoring built-in with the scheduler to detect loss and retransmission alerts early and stop congestion accumulation.
On high of this, the analysis staff introduces a dual-timescale scheduler. At brief timescales, it displays PrfaaS egress utilization and queue depth, adjusting routing when the hyperlink approaches its bandwidth ceiling. It additionally handles cache-affine routing: when bandwidth is scarce, every cluster’s prefix cache is evaluated independently; when bandwidth is ample, the scheduler considers the perfect cached prefix throughout all clusters and performs a cross-cluster cache switch if it reduces redundant computation. At longer timescales, the scheduler rebalances prefill and decode node counts inside the native PD cluster as visitors patterns shift, holding the system close to the throughput-optimal working level.
The Numbers
Within the case research, a PrfaaS cluster of 32 H200 GPUs is paired with an area PD cluster of 64 H20 GPUs, linked by a VPC community offering roughly 100 Gbps of cross-cluster bandwidth. The mixture PrfaaS egress load beneath the optimum configuration is roughly 13 Gbps — simply 13% of accessible Ethernet capability — and the paper notes that the PrfaaS cluster stays compute-bound with substantial bandwidth headroom to spare. The analysis additionally initiatives this to bigger deployments: even on the scale of a ten,000-GPU datacenter, the combination egress bandwidth required for KVCache switch totals solely about 1.8 Tbps, properly inside the capability of contemporary inter-datacenter hyperlinks.
Imply Time to First Token (TTFT) drops by 50% and P90 TTFT drops by 64% in comparison with the homogeneous baseline. The naive heterogeneous configuration — all prefill on H200, all decode on H20, with no routing or scheduling logic — achieves just one.16× throughput over the homogeneous baseline, in comparison with 1.54× for the complete PrfaaS-PD system. The hole between 1.16× and 1.54× isolates the contribution of the scheduling layer and exhibits it accounts for almost all of the sensible achieve.
The analysis staff positions PrfaaS not as a near-future idea however as a design that’s viable in the present day for hybrid-architecture fashions — and argues that as context home windows develop, KVCache compression methods mature, and phase-specialized {hardware} similar to NVIDIA’s Rubin CPX for prefill and LPU-style chips for decode develop into extra broadly out there, the case for cross-datacenter PD disaggregation will solely strengthen.
Try the Paper right here. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you possibly can be a part of us on telegram as properly.
Must companion with us for selling your GitHub Repo OR Hugging Face Web page OR Product Launch OR Webinar and so on.? Join with us



