Failing Small, Winning Big: How Setbacks Built A Stronger Cloudflare Network

For the last two and a half quarters, we’ve been deeply involved in an internal engineering initiative called “Code Orange: Fail Small“. The goal? To boost the resilience, security, and reliability of Cloudflare’s infrastructure for all our customers.

Just earlier this month, the Cloudflare team wrapped up this major project.

While building resilience is an ongoing priority that will always be part of our development process, we’ve now finished the specific work that would have prevented the global outages on November 18, 2025 and December 5, 2025.

This initiative targeted several critical areas: making configuration changes safer, minimizing the blast radius when things go wrong, updating our emergency “break glass” procedures and incident response protocols, putting safeguards in place to prevent future drift and regressions, and improving how we communicate with customers during outages.

Below, we break down exactly what we’ve implemented and what it means for you.

Safer configuration changes

What it means for you: Most Cloudflare internal configuration changes no longer go live across our entire network all at once. Instead, they’re rolled out gradually with real-time health checks. This means our monitoring tools can spot issues and automatically roll them back before they ever impact your traffic.

To catch risky deployments before they hit production, we’ve pinpointed high-risk configuration pipelines and developed new tools to manage changes more effectively.

For products running on our network that handle customer traffic and receive configuration updates, we’ve stopped pushing changes out instantly. Instead, the relevant teams now use a “health-mediated deployment” approach — the same method we already use for software releases — for all configuration deployments. This covers, but isn’t limited to, the product teams directly involved in the recent incidents.

At the heart of this is a new internal tool we’ve built called Snapstone, designed specifically to bring health-mediated deployment to configuration changes. Snapstone packages configuration changes into a bundle and enables gradual rollout with built-in health checks. Before Snapstone, applying this approach to configurations was possible but cumbersome — it demanded significant effort from each team and wasn’t applied consistently. Snapstone bridges this gap by offering a standardized way to add progressive rollouts, real-time health monitoring, and automatic rollbacks to configuration deployments by default.

What makes Snapstone especially powerful is its adaptability. It’s not just a band-aid for past failures — it lets teams dynamically define any configuration unit that needs health mediation, whether that’s a data file like the one behind the November 18 outage, or a control flag in our global configuration system like the one tied to the December 5 outage. Teams can create these configuration units as needed, and Snapstone ensures they’re deployed safely wherever they’re used.

This gives us a capability we previously lacked: when a risk review or operational experience uncovers a dangerous configuration pattern, the remedy is simple — bring it into Snapstone, and that pattern immediately benefits from safe deployment practices.

Reducing the impact of failure

What it means for you: If an issue does arise on our network, our systems are now designed to fail more gracefully. This dramatically shrinks the potential blast radius, ensuring your traffic keeps flowing even in worst-case scenarios.

Product teams have thoroughly reviewed — both manually and through automated analysis — the potential failure modes for products critical to serving customer traffic. Teams have stripped out non-essential runtime dependencies and built in smarter failure responses. Where possible, the system now falls back to the last known working configuration (“fail stale”). When that’s not an option, we’ve evaluated each failure scenario and implemented either “fail open” or “fail close” — choosing based on whether serving traffic with reduced functionality is better than not serving it at all.

Here’s a concrete example. Our November 2025 outage was caused by a failed rollout of our Bot Management machine learning classifier. Under our new procedures, if the system encountered data it couldn’t read again, it would reject the updated configuration and revert to the previous one. And if the old configuration happened to be unavailable for any reason, the system would fail open to keep customer production traffic flowing — a far better outcome than an outage.

So if the same Bot Management change that triggered the November failure were deployed today, the system would catch the problem in the early stages of rollout, before it could affect more than a small fraction of traffic.

We’ve also started further segmenting our systems so that independent copies of services handle different groups of traffic. Cloudflare already uses customer cohorts for blast radius control through existing traffic management techniques, and this additional process segmentation gives us a powerful new reliability tool going forward.

Take the Workers runtime system as an example: it’s now divided into multiple independent services, each handling different traffic cohorts — with one dedicated solely to free customer traffic. Changes are deployed to these segments based on customer tier, starting with free customers first. We also push updates more rapidly and frequently to the least critical segments, while moving more cautiously with the most critical ones.

This means that if a change deployed to the Workers runtime system were to break traffic, it would now only impact a small percentage of our free customers before being automatically detected and rolled back.

Using the Workers runtime system as a case study, in a single seven-day period earlier this month, the deployment process was triggered over 50 times. You can see how each deployment happens in “waves” as the change spreads to the edge, often running in parallel with both the previous and next releases:

Edge-worker module change deployment graph -- over 50 changes in less than a week.

We’re working toward applying this rapid deployment approach to many more of our systems going forward.

Revised “break glass” and incident management procedures

What this means for you: Should an incident occur, we now have better tools and trained teams to communicate more effectively and resolve issues quickly, reducing downtime.

Cloudflare runs on its own infrastructure. We rely on our own Zero Trust products to secure operations, but this creates a dependency risk: if a network-wide outage affects these tools, we lose the very pathways we need to restore service. Before the Code Orange initiative, our emergency “break glass” paths were accessible to only a few individuals and provided limited tool access. We needed these pathways to be more broadly available when outages strike.

To address this, we carried out a thorough audit of the tools critical for system monitoring, debugging, and making production changes. We then built backup authorization pathways for 18 essential services, supported by new emergency scripts and proxy systems.

Throughout the Code Orange program, we turned plans into practice. After running exercises with small teams, we held a company-wide engineering drill on April 7, 2026, with over 200 participants. While automation keeps these pathways ready, drills like these ensure our engineers can use them confidently under real pressure.

This effort also focused on improving information flow. When internal visibility breaks down, our incident response slows, and our ability to communicate with customers suffers. In the past, real-time technical observations didn’t always translate into clear, timely updates for those relying on our services.

To close this gap, we created a dedicated communications team that works alongside incident responders during major events. Just as engineers drilled their “break glass” procedures, this team used the Code Orange program to practice streamlining the timing and clarity of customer updates. By ensuring we have both the tools to diagnose issues and the structure to communicate clearly, we can resolve incidents faster and keep customers better informed.

We have documented our improvements

What this means for you: We commit to learning from every incident and formalizing the fixes. Our network will continue to grow more resilient.

To prevent backsliding and avoid reintroducing issues that Code Orange addressed, the team has created an internal Codex that captures all our engineering guidelines in clear, concise rules.

The Codex is now required reading for every engineering and product team and has become a core part of Cloudflare’s internal standards. Its rules are enforced through AI-powered code reviews that automatically flag any deviation from the guidelines, triggering additional manual reviews. This applies uniformly across our entire codebase. Our goal is straightforward: build institutional memory that enforces itself.

The November and December outages shared a common root cause: code that assumed inputs would always be valid, with no graceful fallback when that assumption failed. A Rust service called .unwrap() instead of properly handling an error; Lua code indexed a nonexistent object. Both patterns are entirely preventable if lessons are properly captured and enforced.

The Codex is one key part of our response. It’s a living collection of engineering standards authored by domain experts through our Request for Comments (RFC) process, then refined into actionable guidelines. Best practices that previously existed only in the minds of senior engineers—or were discovered only after an incident occurred—are now shared knowledge accessible to all. Each rule follows a simple format: “If you need X, use Y,” with a link to the RFC explaining the reasoning.

For instance, one RFC now states: “Do not use .unwrap() outside of tests and `build.rs`.” Another captures a broader principle: “Services MUST verify that upstream dependencies are in an expected state before processing.”

Had these rules been enforced beforehand, the November and December outages would have been caught as rejected merge requests rather than becoming global incidents.

Rules without enforcement are just suggestions. The Codex integrates with AI-powered agents throughout the full software development lifecycle, from design review through deployment to incident analysis. This shifts enforcement earlier in the process—from “global outage” to “rejected merge request.” The impact of a violation shrinks from millions of affected requests down to a single developer receiving actionable feedback before their code ever reaches production.

The Codex is a continuously evolving resource. Domain experts write RFCs to capture best practices. Incidents reveal gaps that become new RFCs. Every approved RFC generates Codex rules. Those rules inform the agents that review the next merge request. It’s a virtuous cycle: expertise becomes standards, standards become enforcement, and enforcement raises the bar for everyone.

It’s not just about code: clear communication matters

What this means for you: Transparency is a priority for us. If something goes wrong, we’re committed to keeping you informed at every stage so you can stay focused on what matters most to you.

The global outages prompted us to re-examine core processes and cultural approaches well beyond engineering and product development. As part of the broader Code Orange effort, we’ve added service-level objectives (SLOs) for every service, rolled out a mandatory global changelog, brought all teams onto our maintenance coordination system, and improved company-wide transparency around our backlog of incident-prevention tickets.

We’ve also strengthened how we communicate with customers during an outage. Our aim is to notify you of an issue the moment we confirm it—before you even notice a problem. By the time you see a lag or an error, we want an update already waiting for you.

During an active incident, we now provide updates at predictable intervals (for example, every 30 or 60 minutes), even if the update is simply, “We are still testing the fix; no new changes yet.” This lets you plan your day instead of constantly refreshing a status page.

Our responsibility doesn’t end when service returns to normal. We publish detailed post-mortems explaining what happened, why it happened, and the specific structural changes we’re making to prevent recurrence.

This initiative is complete, but our work on resilience never stops

We take these incidents very seriously and fostered a culture of shared ownership across the entire Cloudflare organization by asking every team: What could we have done better? This question guided the work we carried out over the past two quarters.

While this work is never truly finished, we’re confident that we’re in a much stronger position today, and Cloudflare is more resilient because of it.

Top Posts

Bridging the Gap: Legacy Tools Now Powered by Enterprise AI

Unlocking the Potential of Wi-Fi HaLow: The Game-Changer Poised to Revolutionize IoT Networks

Orchestrating Intelligent Agents to Decode Biology—Modeling Networks, Proteins, Metabolism, and Cell Signals in Real Time

Failing Small, Winning Big: How Setbacks Built a Stronger Cloudflare Network

Bridging the Gap: Legacy Tools Now Powered by Enterprise AI

Orchestrating Intelligent Agents to Decode Biology—Modeling Networks, Proteins, Metabolism, and Cell Signals in Real Time

Shared Experiences: The Hidden Reset Button for Stress Relief

Unlocking the Public Sector’s AI Advantage: The Ultimate Cheat Sheet

Amazon Connect Talent: AWS Revolutionizes Hiring with AI-Powered Interviews

Despite Market Jitters, Returns Paint a Surprisingly Strong Picture

Bridging the Gap: Legacy Tools Now Powered by Enterprise AI

Unlocking the Potential of Wi-Fi HaLow: The Game-Changer Poised to Revolutionize IoT Networks

Orchestrating Intelligent Agents to Decode Biology—Modeling Networks, Proteins, Metabolism, and Cell Signals in Real Time

You Installed Hermes. Now Make It Look Better Than ChatGPT or Claude

torch-nvenc-compress: Using GPU NVENC Silicon as a PCIe Bandwidth Multiplier

Global Takedown Nets 276 Arrests: 9 Crypto Scam Rings Shut Down in $701M Bust

Shared Experiences: The Hidden Reset Button for Stress Relief

YouTube Premium vs. Premium Lite: Which Tier Gives You More Bang for Your Buck?

Trending

Bridging the Gap: Legacy Tools Now Powered by Enterprise AI

Unlocking the Potential of Wi-Fi HaLow: The Game-Changer Poised to Revolutionize IoT Networks

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Failing Small, Winning Big: How Setbacks Built a Stronger Cloudflare Network

Safer configuration changes

Reducing the impact of failure

Revised “break glass” and incident management procedures

We have documented our improvements

It’s not just about code: clear communication matters

This initiative is complete, but our work on resilience never stops

Related Posts