We’re making Cloudflare one of the best place for constructing and deploying brokers. However dependable brokers aren’t constructed on prompts alone; they require a sturdy, coordinated infrastructure of underlying primitives.
At Cloudflare, we’ve got been constructing these primitives for years: Sturdy Objects for state persistence, Workflows for lengthy operating duties, and Dynamic Staff or Sandbox containers for safe execution. Highly effective abstractions just like the Brokers SDK are designed that will help you construct brokers on prime of Cloudflare’s Developer Platform.
However these primitives solely offered the execution setting. The agent nonetheless wanted a mannequin able to powering it.
Beginning right this moment, Staff AI is formally within the massive fashions recreation. We now supply frontier open-source fashions on our AI inference platform. We’re beginning by releasing Moonshot AI’s Kimi K2.5 mannequin on Staff AI. With a full 256k context window and assist for multi-turn instrument calling, imaginative and prescient inputs, and structured outputs, the Kimi K2.5 mannequin is great for all types of agentic duties. By bringing a frontier-scale mannequin straight into the Cloudflare Developer Platform, we’re making it potential to run the whole agent lifecycle on a single, unified platform.
The guts of an agent is the AI mannequin that powers it, and that mannequin must be sensible, with excessive reasoning capabilities and a big context window. Staff AI now runs these fashions.
The value-performance candy spot
We spent the previous couple of weeks testing Kimi K2.5 because the engine for our inside improvement instruments. Inside our OpenCode setting, Cloudflare engineers use Kimi as a day by day driver for agentic coding duties. We’ve got additionally built-in the mannequin into our automated code evaluation pipeline; you may see this in motion by way of our public code evaluation agent, Bonk, on Cloudflare GitHub repos. In manufacturing, the mannequin has confirmed to be a quick, environment friendly different to bigger proprietary fashions with out sacrificing high quality.
Serving Kimi K2.5 started as an experiment, however it shortly grew to become vital after reviewing how the mannequin performs and the way cost-efficient it’s. As an illustrative instance: we’ve got an agent that does safety critiques of Cloudflare’s codebases. This agent processes over 7B tokens per day, and utilizing Kimi, it has caught greater than 15 confirmed points in a single codebase. Doing a little tough math, if we had run this agent on a mid-tier proprietary mannequin, we’d have spent $2.4M a 12 months for this single use case, on a single codebase. Operating this agent with Kimi K2.5 value only a fraction of that: we lower prices by 77% just by making the swap to Staff AI.
As AI adoption will increase, we’re seeing a basic shift not solely in how engineering groups are working, however how people are working. It’s turning into more and more widespread for folks to have a private agent like OpenClaw operating 24/7. The quantity of inference is skyrocketing.
This new rise in private and coding brokers implies that value is now not a secondary concern; it’s the major blocker to scaling. When each worker has a number of brokers processing a whole lot of hundreds of tokens per hour, the maths for proprietary fashions stops working. Enterprises will look to transition to open-source fashions that provide frontier-level reasoning with out the proprietary price ticket. Staff AI is right here to facilitate this shift, offering all the pieces from serverless endpoints for a private agent to devoted cases powering autonomous brokers throughout a complete group.
The big mannequin inference stack
Staff AI has served fashions, together with LLMs, since its launch two years in the past, however we’ve traditionally prioritized smaller fashions. A part of the explanation was that for a while, open-source LLMs fell far behind the fashions from frontier mannequin labs. This modified with fashions like Kimi K2.5, however to serve the sort of very giant LLM, we needed to make modifications to our inference stack. We needed to share with you a few of what goes on behind the scenes to assist a mannequin like Kimi.
We’ve been engaged on customized kernels for Kimi K2.5 to optimize how we serve the mannequin, which is constructed on prime of our proprietary Infire inference engine. Customized kernels enhance the mannequin’s efficiency and GPU utilization, unlocking beneficial properties that might in any other case go unclaimed in the event you have been simply operating the mannequin out of the field. There are additionally a number of methods and {hardware} configurations that may be leveraged to serve a big mannequin. Builders usually use a mixture of knowledge, tensor, and knowledgeable parallelization methods to optimize mannequin efficiency. Methods like disaggregated prefill are additionally necessary, wherein you separate the prefill and technology phases onto totally different machines with a view to get higher throughput or larger GPU utilization. Implementing these methods and incorporating them into the inference stack takes plenty of devoted expertise to get proper.
Staff AI has already achieved the experimentation with serving methods to yield glorious throughput on Kimi K2.5. Quite a lot of this doesn’t come out of the field whenever you self-host an open-source mannequin. The good thing about utilizing a platform like Staff AI is that you simply don’t must be a Machine Studying Engineer, a DevOps knowledgeable, or a Website Reliability Engineer to do the optimizations required to host it: we’ve already achieved the arduous half, you simply have to name an API.
Past the mannequin — platform enhancements for agentic workloads
In live performance with this launch, we’ve additionally improved our platform and are releasing a number of new options that will help you construct higher brokers.
Prefix caching and surfacing cached tokens
While you work with brokers, you’re doubtless sending numerous enter tokens as a part of the context: this could possibly be detailed system prompts, instrument definitions, MCP server instruments, or total codebases. Inputs will be as giant because the mannequin context window, so in idea, you might be sending requests with nearly 256k enter tokens. That’s plenty of tokens.
When an LLM processes a request, the request is damaged down into two phases: the prefill stage processes enter tokens and the output stage generates output tokens. These phases are normally sequential, the place enter tokens should be totally processed earlier than you may generate output tokens. Which means generally the GPU is just not totally utilized whereas the mannequin is doing prefill.
With multi-turn conversations, whenever you ship a brand new immediate, the consumer sends all of the earlier prompts, instruments, and context from the session to the mannequin as effectively. The delta between consecutive requests is normally only a few new strains of enter; all the opposite context has already gone by the prefill stage throughout a earlier request. That is the place prefix caching helps. As an alternative of doing prefill on the whole request, we are able to cache the enter tensors from a earlier request, and solely do prefill on the brand new enter tokens. This protects plenty of time and compute from the prefill stage, which implies a sooner Time to First Token (TTFT) and the next Tokens Per Second (TPS) throughput as you’re not blocked on prefill.
Staff AI has all the time achieved prefix caching, however we at the moment are surfacing cached tokens as a utilization metric and providing a reduction on cached tokens in comparison with enter tokens. (Pricing will be discovered on the mannequin web page.) We even have new methods so that you can leverage with a view to get the next prefix cache hit charge, decreasing your prices.
With a view to path to the identical mannequin occasion and reap the benefits of prefix caching, we use a brand new x-session-affinity header. While you ship this header, you’ll enhance your cache hit ratio, resulting in extra cached tokens and subsequently, sooner TTFT, TPS, and decrease inference prices.
You’ll be able to go the brand new header like under, with a novel string per session or per agent. Some shoppers like OpenCode implement this routinely out of the field. Our Brokers SDK starter has already arrange the wiring to do that for you, too.
curl -X POST
"
-H "Authorization: Bearer {API_TOKEN}"
-H "Content-Type: application/json"
-H "x-session-affinity: ses_12345678"
-d '{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is prefix caching and why does it matter?"
}
],
"max_tokens": 2400,
"stream": true
}'
Serverless inference is absolutely arduous. With a pay-per-token enterprise mannequin, it’s cheaper on a single request foundation since you don’t have to pay for total GPUs to service your requests. However there’s a trade-off: it’s important to take care of different folks’s site visitors and capability constraints, and there’s no strict assure that your request can be processed. This isn’t distinctive to Staff AI — it’s evidently the case throughout serverless mannequin suppliers, given the frequent information reviews of overloaded suppliers and repair disruptions. Whereas we all the time try to serve your request and have built-in autoscaling and rebalancing, there are arduous limitations (like {hardware}) that make this a problem.
For volumes of requests that might exceed synchronous charge limits, you may submit batches of inferences to be accomplished asynchronously. We’re introducing a revamped Asynchronous API, which implies that for asynchronous use instances, you gained’t run into Out of Capability errors and inference will execute durably in some unspecified time in the future. Our async API seems extra like flex processing than a batch API, the place we course of requests within the async queue so long as we’ve got headroom in our mannequin cases. With inside testing, our async requests normally execute inside 5 minutes, however it will rely on what stay site visitors seems like. As we deliver Kimi to the general public, we’ll tune our scaling accordingly, however the async API is one of the simplest ways to ensure you don’t run into capability errors in sturdy workflows. That is excellent to be used instances that aren’t real-time, similar to code scanning brokers or analysis brokers.
Staff AI beforehand had an asynchronous API, however we’ve just lately revamped the methods below the hood. We now depend on a pull-based system versus the historic push-based system, permitting us to tug in queued requests as quickly as we’ve got capability. We’ve additionally added higher controls to tune the throughput of async requests, monitoring GPU utilization in real-time and pulling in async requests when utilization is low, in order that vital synchronous requests get precedence whereas nonetheless processing asynchronous requests effectively.
To make use of the asynchronous API, you’ll ship your requests as seen under. We even have a approach to arrange occasion notifications in an effort to know when the inference is full as a substitute of polling for the request.
// (1.) Push a request in queue
// go queueRequest: true
let res = await env.AI.run("@cf/moonshotai/kimi-k2.5", {
"requests": [{
"messages": [{
"role": "user",
"content": "Tell me a joke"
}]
}, {
"messages": [{
"role": "user",
"content": "Explain the Pythagoras theorem"
}]
}, ...{} ];
}, {
queueRequest: true,
});
// (2.) seize the request id
let request_id;
if(res && res.request_id){
request_id = res.request_id;
}
// (3.) ballot the standing
let res = await env.AI.run("@cf/moonshotai/kimi-k2.5", {
request_id: request_id
});
if(res && res.standing === "queued" || res.standing === "running") {
// retry by polling once more
...
}
else
return Response.json(res); // This can comprise the ultimate accomplished response
Get began with Kimi K2.5 on Staff AI right this moment. You’ll be able to learn our developer docs to search out out mannequin info and pricing, and reap the benefits of immediate caching by way of session affinity headers and asynchronous API. The Brokers SDK starter additionally now makes use of Kimi K2.5 as its default mannequin. You can too connect with Kimi K2.5 on Staff AI by way of Opencode. For a stay demo, strive it in our playground.
And if this set of issues round serverless inference, ML optimizations, and GPU infrastructure sound fascinating to you — we’re hiring!



