Once we initially constructed Workflows, our sturdy execution engine for multi-step purposes, it was designed for a world wherein workflows have been triggered by human actions, like a person signing up or inserting an order. To be used instances like onboarding flows, workflows solely needed to help one occasion per individual — and other people can solely click on so quick.
Over time, what we’ve really seen is a quantitative shift within the workload and entry sample: fewer human-triggered workflows, and extra agent-triggered workflows, created at machine pace.
As brokers develop into persistent and autonomous infrastructure, working on behalf of customers for hours or days, they want a sturdy, asynchronous execution engine for the work they’re doing. Workflows gives precisely that: each step is independently retryable, the workflow can pause for human-in-the-loop approval, and every occasion survives failures with out dropping progress.
Furthermore, workflows themselves are getting used to implement agent loops and function the sturdy harnesses that handle and preserve brokers alive. Our Brokers SDK integration accelerated this, making it straightforward for brokers to spawn workflow situations and get real-time progress again. A single agent session can now kick off dozens of workflows, and plenty of brokers operating concurrently means hundreds of situations created in seconds. With Undertaking Suppose now out there, we anticipate that velocity will solely improve.
To assist builders scale their brokers and purposes on Workflows, we’re excited to announce that we now help:
50,000 concurrent situations (variety of workflow executions operating in parallel), initially 4,500
300 situations/second created per account, beforehand 100
2 million queued situations (that means situations which have been created or awoken and are ready for a concurrency slot) per workflow, up from 1 million
We redesigned the Workflows management airplane from utilization information and first rules to help these will increase. For V1 of the management airplane, a single Sturdy Object (DO) might function the central registry and coordinator of a whole account. For V2, we constructed two new parts to assist horizontally scale the system and alleviate the bottlenecks that V1 launched, earlier than migrating all clients — with reside site visitors — seamlessly onto the brand new model.
V1: preliminary structure of Workflows
As described in our public beta weblog publish, we constructed Workflows totally on our personal developer platform. Essentially, a workflow is a sequence of sturdy steps, every independently retryable, that may execute duties, watch for exterior occasions, or sleep till a predetermined time.
export class MyWorkflow extends WorkflowEntrypoint {
async run(occasion, step) {
const information = await step.do("fetch-data", async () => {
return fetchFromAPI();
});
const approval = await step.waitForEvent("approval", {
sort: "approval",
timeout: "24 hours",
});
await step.do("process-and-save", async () => {
return retailer(remodel(information));
});
}
}
To set off every occasion, execute its logic, and retailer its metadata, we leverage SQLite-backed Sturdy Objects, that are a easy however highly effective primitive for coordination and storage inside a distributed system.
Within the management airplane, some Sturdy Objects — just like the Engine, which executes the precise workflow occasion, together with its step, retry, and sleep logic — are spun up at a ratio of 1:1 per occasion. However, the Account is an account-level Sturdy Object that manages all workflows and workflow situations for that account.
To be taught extra in regards to the V1 management airplane, confer with our Workflows announcement weblog publish.
After we launched Workflows into beta, we have been thrilled to see clients rapidly scaling their use of the product, however we additionally realized that having a single Sturdy Object to retailer all that account-level data launched a bottleneck. Many purchasers wanted to create and execute tons of and even hundreds of Workflow situations per minute, which might rapidly overwhelm the Account in our unique structure. The unique price limits — 4,500 concurrency slots and 100 occasion creations per 10 seconds — have been a results of this limitation.
On the V1 management airplane, these limits have been a tough cap. Any and all operations relying on Account, together with create, replace, and listing, needed to undergo that single DO. Customers with excessive concurrency workloads might have hundreds of situations beginning and ending at any given second, constructing as much as hundreds of requests per second to Account. To unravel for this, we rearchitected the workflow management airplane such that it horizontally scales to greater concurrency and creation price limits.
V2: horizontal scale for greater throughput
For the brand new model, we rethought each single operation from the bottom up with the aim of optimizing for high-volume workflows. In the end, Workflows ought to scale to help no matter builders want – whether or not that’s hundreds of situations created per second or tens of millions of situations operating at a time. We additionally needed to make sure that V2 allowed for versatile limits, which we will toggle and proceed rising, slightly than the onerous cap which V1 limits imposed. After many design iterations, we settled on the next pillars for our new structure:
The supply of reality for the existence of a given occasion ought to be its Engine and nothing else.
Within the V1 management airplane structure, we lacked a verify earlier than queuing the occasion as as to whether its Engine really existed. This allowed for a nasty state the place an occasion could have been queued with out its corresponding Engine having spun up.
Occasion lifecycle and liveness mechanisms have to be horizontally scalable per-workflow and distributed all through many areas.
The brand new Account singleton ought to solely retailer the minimal needed metadata and have an invariant most quantity of concurrent requests.
There are two new, crucial parts within the V2 management airplane which allowed us to enhance the scalability of Workflows: SousChef and Gatekeeper. The primary part, SousChef, is a “second in command” to the Account. Recall that beforehand, the Account managed the metadata and lifecycle for all the situations throughout all the workflows inside a given account. SousChef was launched to maintain monitor of metadata and lifecycle on a subset of situations in a given workflow. Inside an account, a distribution of SousChefs can then report again to Account in a extra environment friendly and manageable manner. (An added good thing about this design: not solely did we have already got per-account isolation, however we additionally inadvertently gained “per-workflow” isolation inside the similar account, since every SousChef solely takes care of 1 particular workflow).
The second part, Gatekeeper, is a mechanism to distribute concurrency “slots” (derived from concurrency limits) throughout all SousChefs inside the account. It acts as a leasing system. When an occasion is created, it’s randomly assigned to one of many SousChefs inside that account. Then the SousChef makes a request to Account to set off that occasion. Both a slot is granted, or the occasion is queued. As soon as the slot is granted, the SousChef triggers execution of the occasion and assumes accountability that the occasion by no means will get caught.
Gatekeeper was wanted to make it possible for Engines by no means overloaded their Account (a urgent danger on V1) so each communication between SousChefs and their Account occurs on a periodic cycle, as soon as per second — every cycle may even batch all slot requests, making certain that just one JSRPC name is made. This ensures the occasion creation price can by no means overload or affect an important part, Account (as an apart: if the SousChef rely is simply too excessive, we rate-limit calls or unfold throughout completely different SousChefs all through completely different time intervals). Additionally, this periodic property permits us to protect equity on older situations and to make sure max-min equity by way of the numerous SousChefs, permitting all of them to progress. For instance, if an occasion wakes up, it ought to be prioritized for a slot over a newly created occasion, however every SousChef ensures that its personal situations don’t get caught.
This structure is extra distributed, and due to this fact, extra scalable. Now, when an occasion is created, the request path is:
Verify management airplane model
Verify if a cached model of the workflow and model particulars is out there in that location
If not, verify Account to get workflow identify, distinctive ID, and model, and cache that data
Retailer solely needed metadata (occasion payload, creation date) onto its personal Engine
So, how does Engine inform the management airplane that it now exists? That occurs within the background after occasion metadata is about. As background operations on a Sturdy Object can fail, attributable to eviction or server failure, we additionally set an “alarm” on Engine within the creation hot-path. That manner, if the background process doesn’t end, the alarm ensures that the occasion will start.
A Sturdy Object alarm permits a Sturdy Object occasion to be woke up at a fine-grained time sooner or later with an at-least-once execution mannequin, with computerized retries in-built. We extensively use this mix of background “tasks” and alarms to take away operations off the hot-path whereas nonetheless making certain that every thing will occur as deliberate. That’s how we preserve crucial operations like creating an occasion quick with out ever compromising on reliability.
Apart from unlocking scale, this model of the management airplane signifies that:
Occasion itemizing efficiency is quicker, and truly in line with cursor pagination;
Any operation on an occasion does precisely one community hop (as it could possibly go on to its Engine, making certain that eyeball request latency is as small as we will handle);
We are able to be sure that extra situations are literally behaving accurately (by operating on time) concurrently (and proper them if not, ensuring that Engines are by no means late to proceed execution).
Now that we had a brand new model of the Workflows management airplane that may deal with the next quantity of person load, we wanted to do the “boring” half: migrating our clients and situations to the brand new system. At Cloudflare’s scale, this turns into an issue in and of itself, so the “boring” half turns into the most important problem. Properly earlier than its one-year mark, Workflows had already racked up tens of millions of situations and hundreds of shoppers. Additionally, some tech debt on V1’s management airplane meant {that a} queued occasion may not have its personal Engine Sturdy Object created but, complicating issues additional.
Such a migration is difficult as a result of clients may need situations operating at any given second; we wanted a manner so as to add the SousChef and Gatekeeper parts into older accounts with out inflicting any disruption or downtime.
We in the end determined that we’d migrate current Accounts (which we’ll confer with as AccountOlds) to behave like SousChefs. By persisting the Account DOs, we maintained the occasion metadata, and easily transformed the DO right into a SousChef “DO”:
// You is perhaps questioning what's this SousChef class? That is the SousChef DO class!
import { SousChef } from "@repo/souschef";
class AccountOld extends DurableObject {
constructor(state: DurableObjectState, env: Env) {
// We added the next snippet to the tip of our AccountOld DO's
// constructor. This ensures that if we wish, we will use any primitive
// that's out there on SousChef DO
if (this.currentVersion === ControlPlaneVersions.SOUS_CHEFS) {
this.sousChef = new SousChef(this.ctx, this.env);
await this.sousChef.setup()
}
}
async updateInstance(params: UpdateInstanceParams) {
if (this.currentVersion === ControlPlaneVersions.SOUS_CHEFS) {
assert(this.sousChef !== undefined, 'SousChef should exist on v2');
return this.sousChef.updateInstance(params);
}
// previous logic stays the identical
}
@RequiresVersion(ControlPlaneVersions.V1)
async getMetadata() {
// this methodology can solely be run if
// this.currentVersion === ControlPlaneVersions.V1
}
} We are able to instantiate the SousChef class inside the AccountOld as a result of the SQL tables that monitor occasion metadata, on each SousChefs and AccountOld DOs, are the identical on each. As such, we might simply resolve which model of the code to make use of. If this hadn’t been the case, we’d have been compelled emigrate the metadata of tens of millions of situations, which might have made the migration tougher and longer operating for every account. So, how did the migration work?
First, we ready AccountOld DOs to be switched to behave as SousChefs (which meant making a launch with a model of the snippet above). Then, we enabled management airplane V2 per account, which triggered the following three steps roughly on the similar time:
All new occasion creation requests at the moment are routed to the brand new SousChefs (SousChefs are created after they obtain the primary request), new situations by no means go to AccountOld once more;
AccountOld DOs begin migrating themselves to behave like SousChefs;
The brand new Account DO is spun up with the corresponding metadata.
In any case accounts have been migrated to the brand new management airplane model, we have been in a position to sundown AccountOld DOs as their occasion retention intervals expired. As soon as all situations on all accounts on AccountOlds have been migrated, we might spin down these DOs completely. The migration was accomplished with no downtime in a course of that really felt like altering a automotive’s wheels whereas driving.
If you’re new to Workflows, strive our Get Began information or construct your first sturdy agent with Workflows.
In case your use case requires greater limits than our new defaults — a concurrency restrict of fifty,000 slots and account-level creation price restrict of 300 situations per second, 100 per workflow — attain out by way of your account group or the Staff Restrict Request Kind. You too can attain out with suggestions, characteristic requests, or simply to share how you’re utilizing Workflows on our Discord server.



