Crafting A Saga Rollback System For Cloudflare Workflows

Cloudflare Workflows lets you create sturdy, multi-step applications that include automatic retries and keep track of state even during long-running processes. When a Workflow runs, each step can reach out to outside systems, automatically retry if something goes wrong, and remember its progress across restarts. But if one step runs into trouble, it might leave behind work from earlier steps in a half-finished or inconsistent state.

Right now, we’re rolling out saga rollbacks for Workflows, so you can specify how to undo a step right inside the definition of that step—just in case something doesn’t work out.

For instance, imagine a workflow that moves money between accounts at two different banks:

Subtract funds from an account at Bank A
Add funds to an account at Bank B
Email both account owners to confirm the transfer

What if Step 2—adding funds to the account at Bank B—fails? Once the subtraction at Bank A has succeeded, the bank has already processed it and the money is gone from their records. As the one coordinating this transaction, you can’t just hit “undo” on Bank A’s end. Instead, you need to put the money back into the Bank A account through a new operation that basically reverses the original one.

This combination of an action and its matching reversal logic is known as the saga pattern.

Previously, developers had to write their own reversal logic separately, keeping track of what worked, what didn’t, and what should be undone if something broke—all outside the actual step definitions. Now, you can set up your own reversal logic directly inside each step.do() call, passing it as an argument.

// keep log of what finished so we know what to reverse
let debitA;
let creditB;
try {
  debitA = await step.do("debit-bank-a", () => bankA.debit(from, amount));
  creditB = await step.do("credit-bank-b", () => bankB.credit(to, amount));
  await step.do("notify", () => notifyBoth(from, to, amount));
} catch (error) {
  // reverse steps, starting from the end. each undo is its own durable step,
  // must do nothing if called more than once, and shouldn't stop if one fails.
  if (creditB) {
    try {
      await step.do("reverse-credit-b", () => bankB.debit(to, amount, creditB.id));
    } catch (e) {
      await alertOnCall("reverse-credit-b failed", e);
    }
  }
  if (debitA) {
    try {
      await step.do("refund-debit-a", () => bankA.credit(from, amount, debitA.id));
    } catch (e) {
      await alertOnCall("refund-debit-a failed", e);
    }
  }
  throw error;
}

^{Without rollbacks}

// every step comes with its own undo. include a step,
// add its rollback right here. no expanding catch
// block, no manual sequencing, no re-execution logic.
await step.do("debit-bank-a", () => bankA.debit(from, amount), {
  rollback: async ({ output }) => bankA.credit(from, amount, output.id),
});
await step.do("credit-bank-b", () => bankB.credit(to, amount), {
  rollback: async ({ output }) => bankB.debit(to, amount, output.id),
});
await step.do("notify", () => notifyBoth(from, to, amount));

^{With rollbacks}

To use rollbacks, simply pass an options object with a rollback function as the final argument to step.do().

const debit = await step.do(
  "debit-account-a",
  async () => {
    return await bankA.debit({
      accountId: fromAccountId,
      amount,
      idempotencyKey: `${transferId}:debit-account-a`,
    });
  },
  {
    rollback: async () => {
      await bankA.credit({
        accountId: fromAccountId,
        amount,
        idempotencyKey: `${transferId}:rollback-debit-account-a`,
      });
    },
  }
);

// The idempotency keys keep both the primary operations and their rollbacks safe to repeat accidentally without double-processing

const credit = await step.do(
  "credit-account-b",
  async () => {
    return await bankB.credit({
      accountId: toAccountId,
      amount,
      idempotencyKey: `${transferId}:credit-account-b`,
    });
  },
  {
    rollback: async ({ output }) => {
      if (output === undefined) {
        return;
      }

      await bankB.debit({
        accountId: toAccountId,
        amount,
        idempotencyKey: `${transferId}:rollback-credit-account-b`,
      });
    },
  }
);


// If an error happens here, we might want to undo every previous payment. Developers shouldn't have to build complicated try-catch logic just for reversing a couple of simple payments (see below)

await step.do("send-confirmation", async () => {
  await sendTransferConfirmation({ ... });
});

Rollback functions should be able to run multiple times safely, just like the main Workflow steps. When you refund a charge, make use of the payment provider’s idempotency key. If you’re releasing reserved stock, be sure it’s okay to free it more than once.

If any step’s main action fails, the corresponding rollback handlers will kick off in reverse step-start order. It sounds straightforward: run the undo steps when something goes wrong. In reality, there are a few specifics that shape how the API and execution model work.

1. The failing step might need rollback too. A step.do() that experienced an error can still require reversal if it has a rollback handler attached.

Rollback won’t activate just because user code catches an error and keeps going. But if an error is caught inside a step and the Workflow eventually fails for another reason, rollback can still happen for any handlers already registered—they’ll run in reverse step-start order.

Why? The step may have partially engaged with an external service before failing. For example, a payment provider might successfully capture a charge but the step doesn’t return the chargeId back to Workflows. That’s why rollback handlers accept an output argument but must be ready for when output === undefined.

2. Rollback only kicks in when the Workflow itself fails. Having a rollback handler on a step doesn’t mean every error in that step triggers it. If user code catches an error and keeps going, the Workflow moves on. Rollback kicks in when the entire Workflow is about to fail permanently.

Once rollback begins, Workflows identifies eligible step.do() calls, triggers their rollback handlers, and then logs the final Workflow failure.

3. Order must be predictable. For workflows where steps run one after another, rollback order feels intuitive:

Reserve stock.
Charge the card.
Create shipment.
If shipment can’t go through, refund the card and free up the inventory.

With things happening in parallel, it gets a bit more complicated. Steps might finish in a different order from how they started, so Workflows uses reverse step-start order rather than reverse completion order.

The practical rules are:

Any steps that started or completed and have rollback handlers are eligible.
The failing step.do() is also eligible if it has a rollback handler.
Handlers activate in reverse step-start order, not completion order.

Once we aligned on the expected behavior, we still needed to introduce this new pattern into the Workflows API. The rollbacks feature went through several design rounds before we settled on the rollback options approach.

Why not use a fluent or builder-style API?

The initial design used a fluent pattern: step.do(...).rollback(...). This reads naturally — the forward action and its compensation sit side by side, and the call site resembles standard JavaScript chaining.

The issue is that step.do() already carries an important responsibility: it initiates a durable step and returns a Promise for that step’s result. In Workers, promise-like values carry special significance because Workers RPC supports promise pipelining, a technique borrowed from systems like Cap’n Proto.

Promise pipelining allows code to invoke a method on a value that hasn’t fully resolved yet. For instance:

const session = api.authenticate(apiKey);
const name = await session.whoami();

In this case, session isn’t the actual session object yet. It’s more like a placeholder for the session that will eventually exist. When you call session.whoami(), Workers can dispatch that call to the remote side immediately and instruct: “once authentication produces the session, invoke whoami() on it.”

This eliminates a round trip. The caller doesn’t need to wait for authenticate() to fully complete before requesting whoami().

We explored a fluent API:

step.do("charge-card", chargeCard).rollback(refundCharge);

To someone reading the code, this could appear as “call .rollback() on the result of charge-card.” But rollback isn’t part of the step’s output. It’s part of the step.do() configuration, registered before the step begins, so Workflows knows how to compensate the step if a later step fails.

A fluent API also makes step timing harder to follow. Currently, step.do() starts the step when it’s invoked, so developers can kick off a step, perform other work, and await the first step later:

const first = step.do("first", () => serviceA.call());

await step.do("second", () => serviceB.call());

await first;

Under the current model, first begins immediately, before second. A fluent API would muddy this. Workflows would need to hold off and check whether .rollback() gets attached before it can determine the full step definition. That could postpone when the step is dispatched to the engine.

In the earlier example, first could begin at await first instead of at step.do("first", ...), after second has already finished.

That makes concurrent Workflows harder to reason about: step timing would hinge on when the returned Promise is consumed, not just where step.do() is invoked.

We also weighed a builder-style API:

const charge = await step
	.saga("charge")
	.do(() => chargeCard())
	.rollback(() => refundCharge())
	.run();

A builder API sidesteps the Promise ambiguity. It also provides a natural home for future step-level options, and makes it clear that the forward action and rollback action belong to the same saga step.

But it introduces extra ceremony. Every step requires a final .run(); omitting .run() would be easy to do and hard to catch without tooling, and simple one-step cases start to resemble configuration chains. It also introduces a new step.saga() builder, breaking from the existing step. convention. Most critically, it makes step.do() feel like a legacy API rather than the core Workflows primitive. The aim of rollback was to augment step.do(), not supplant it.

Rollback as step metadata

step.do(..., { rollback })

In the end, we settled on the explicit form where rollback is metadata attached to the step.

This approach keeps each rollback defined inside the forward step itself. Each handler receives the error that triggered the rollback, the step context, and the output — which is either the persisted value returned by the forward step (which may be undefined) or undefined if the step failed before persisting a value.

Rollbacks emit lifecycle events, so you can track whether compensation started, which rollback handler failed, and whether rollback completed successfully.

Importantly, the original Workflow failure stays separate: rollback is what Workflows does in response to the failure, not the reason the Workflow failed.

Just as you can customize retry and timeout behavior in the step configuration via WorkflowStepConfig, you supply rollback-specific settings in rollbackConfig.

{
  rollback: async ({ output }) => {
    await bankA.credit({ accountId: fromAccountId, amount, transferId: `${transferId}-reversal` });
  },
  rollbackConfig: {
    retries: { limit: 10, delay: '30 seconds', backoff: 'exponential' },
    timeout: '2 minutes',
  },
}

This aligns with the lifecycle-event mental model we were aiming for. A step.do() already describes a durable unit of work that Workflows records, retries, and later displays in logs. Rollback is another lifecycle behavior for that same unit of work. It should travel with the step definition, not reside in a separate wrapper or builder.

The step still begins when step.do() normally begins.
The returned promise still represents the step output.
Concurrent Workflow code retains the same execution model.
Retry and timeout options for rollback sit next to the rollback handler.
Existing step.do() calls continue to work exactly as they do today.

This shape is slightly more explicit than the fluent API, but that explicitness is beneficial. The operation and its compensation remain in one place, and the API doesn’t introduce a new step builder or a new kind of promise. Developers who already understand

step.do() requires learning just one extra options object.

This approach is less mysterious, but it’s easier to adopt and simpler to grasp.

How it works under the hood

Rollback may seem like a minor API addition, but it changes what Workflows must track about each step.

A standard step.do() already maintains a durable record. Workflows logs that the step began, whether it finished, what it returned, and whether it should be skipped rather than re-executed if the Workflow resumes later.

Rollbacks introduce one additional item to that record: whether the step registered compensation logic.

This means Workflows has two pieces of information to piece together if the Workflow fails.

The first is durable step history. The Workflow engine stores data to track what ran, what completed, what output was saved, and whether rollback was registered.

The second is the rollback handler itself — the function written to compensate for that step. Workflows does not store the source of that function as data. Instead, it maintains a callable reference to the handler while the Workflow is running.

In Workers RPC, this kind of callable reference is known as a stub. A stub allows one part of the system to invoke code running elsewhere. Stubs also have lifetimes, so they can be cleaned up when a call or execution context ends. If you need to retain a stub beyond that point, Workers RPC offers a dup() method, which creates another handle pointing to the same target.

For rollback, that model is handy. The durable step history records what needs compensation. The rollback stub gives Workflows a way to call the compensation code. And because rollback handlers may need to outlive the step.do() call that registered them, Workflows retains its own callable reference to the handler for the rollback phase.

In the typical case, when a Workflow enters rollback within the same engine lifetime, Workflows already holds the rollback stubs it needs. It can use the durable step history to identify eligible steps, then invoke the rollback stubs that were registered during forward execution.

This becomes more nuanced when Workflows must recover after a restart.

If the engine is evicted, crashes, or restarts while rollback is needed, Workflows still has the durable step history, but it may no longer have the in-memory rollback stubs. To recover, Workflows relies on replay: a recovery mode where it re-runs the Workflow code without re-executing completed forward step bodies.

When replay reaches a completed step.do(), Workflows reads the persisted result instead of running the step body again. For rollback recovery, Workflows only needs to rebuild handlers for steps that had rollback attached and are eligible for rollback. As those step.do() calls are encountered, their rollback options can register the callable stubs once more.

That allows Workflows to recover the rollback handlers it needs without duplicating the original external side effects.

With those pieces in place, rollback can work whether the handler is still available in memory or must be rebuilt during recovery.

When the workflow is about to fail, Workflows does not ask your application to reconstruct what happened. It already has the step history. It can examine the persisted record and answer the critical questions:

Which steps started?
Which steps finished?
Which failed step may still need cleanup?
Which steps registered rollback handlers?
What output should each rollback handler receive?
What order should compensation run in?

Then Workflows invokes each rollback stub with a rollback context: the original error, the step context, and the step output, if one was persisted.

The ordering detail matters. In normal JavaScript, especially with Promise.all(), completion order does not always match start order. If step A starts first and step B starts second, step B might finish first. For rollback, Workflows uses the persisted start order as the stable source of truth, then unwinds it in reverse.

Rollback handlers also run through Workflows’ normal step machinery. That means compensation receives the same operational properties you expect from Workflows: retries, timeouts, lifecycle events, logs, and a final recorded outcome. If a rollback handler keeps failing after its configured retries, Workflows records the rollback outcome as failed, halts the remaining rollback handlers, and the Workflow instance ultimately ends in the Errored state.

This is the key difference between saga rollbacks and a catch block. A catch block only knows what is still in memory at its exact point in your JavaScript execution. Workflows rollback uses persisted step history to determine what already happened, invokes the stubs it already has in the common case, and safely rebuilds missing stubs during recovery when necessary.

That is also why the API places rollback on step.do() itself. Rollback is not a separate global error handler — it is metadata attached to the durable unit of work Workflows already understands.

Our initial rollout of rollbacks includes:

Explicit per-step rollback handlers for step.do()
Sequential rollback execution
Retry and timeout configuration for compensation

Next, we want to explore:

When a multi-step application fails halfway through, the hardest part is often not knowing that it failed. It is knowing what already happened, and what needs to happen next.

Saga rollbacks let you place that answer directly beside each step. If you are building multi-step applications with Workflows, try saga rollbacks and tell us what compensation patterns you want next. Get started with the Workflows documentation and share feedback in the Cloudflare Community.

Top Posts

Crafting a Saga Rollback System for Cloudflare Workflows

Identiv Divests IoT Assets in Strategic Handover to Trackonomy

ARM Institute Launches Physical AI Expansion of RoboticsCareer.org

Crafting a Saga Rollback System for Cloudflare Workflows

A Surprising Choice: Trump’s Unconventional Pick for Defense Acquisition Deputy

Senate Gives New Momentum to Shielding Military Families in Privatized Housing

Supercharging Cloudflare’s App Universe: The Key of Universal OAuth

Up in the Air: How Drone Crowds Are Reshaping the Rules of the Sky

“From Conscious Design to Engineered Accessibility: Transforming Open Source”

Post-Quantum EO: A Milestone Achieved, The Real Work Begins

Crafting a Saga Rollback System for Cloudflare Workflows

Identiv Divests IoT Assets in Strategic Handover to Trackonomy

ARM Institute Launches Physical AI Expansion of RoboticsCareer.org

The Secret Architecture of OpenAI’s Jalapeño Chip

Semantic Clustering of Unstructured Text Using Large Language Model Embeddings and Density-Based Algorithms

The 2036 Shift: The Rise of the Sovereigns

The Hidden Threat: How Shared Data Creates Silent AI Agent Vulnerabilities

A Surprising Choice: Trump’s Unconventional Pick for Defense Acquisition Deputy

Trending

Crafting a Saga Rollback System for Cloudflare Workflows

Identiv Divests IoT Assets in Strategic Handover to Trackonomy

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Crafting a Saga Rollback System for Cloudflare Workflows

Why not use a fluent or builder-style API?

Rollback as step metadata

How it works under the hood

Related Posts