Cloudflare's Swift Defense Against The Critical "Copy Fail" Linux Flaw

On April 29, 2026, a Linux kernel local privilege escalation vulnerability was publicly disclosed under the name “Copy Fail” (CVE-2026-31431). As soon as it was disclosed, Cloudflare’s Security and Engineering teams began assessing the vulnerability. They reviewed the exploit technique, evaluated its impact on our infrastructure, and confirmed that our existing behavioral detections could spot the exploit within minutes.

There was no impact to the Cloudflare environment, no customer data was exposed, and all services continued to function without interruption at all times. Discover how our preparedness played a role in ensuring this.

Our approach to Linux kernel releases

Cloudflare manages a massive global Linux server infrastructure spread across more than 330 cities. We use a customized Linux kernel build based on the community’s Long-Term Support (LTS) versions to efficiently handle updates at this scale. Currently, we use multiple LTS versions from series such as 6.12 and 6.18, both of which receive long-term support.

The community continually merges and publishes security and stability fixes, which automatically trigger a new internal kernel build every week. These builds undergo thorough testing in our staging data centers to verify reliability before being deployed globally. Once a release is greenlit, our Edge Reboot Release (ERR) pipeline handles the rollout, rebooting edge infrastructure systematically over a four-week cycle. Control plane infrastructure is kept on the latest kernel, with reboots planned to suit workload demands.

By the time a CVE is made public, the fix has usually been merged into stable Linux LTS releases for several weeks. Our processes guarantee that we’ve already applied those patches by then.

When “Copy Fail” was disclosed, most of our systems were already on LTS version 6.12, while others had started migrating to the newer 6.18 LTS release.

Understanding the Copy Fail vulnerability

Before diving into our response, it helps to first grasp the vulnerability. You can find a detailed write-up in the original Xint Code disclosure post.

AF_ALG and the kernel crypto API

The Linux kernel’s built-in crypto API oversees features like kTLS and IPsec. Userspace applications can tap into this using the AF_ALG socket family, which lets unprivileged processes request encryption or decryption operations. The algif_aead module handles this for Authenticated Encryption with Associated Data (AEAD) ciphers.

An unprivileged program typically does the following:

Opens an AF_ALG socket and binds it to an AEAD template.
Sets an encryption key and accepts a request socket.
Sends input data through sendmsg() or splice().
Performs the operation by calling recvmsg().

The splice() system call is a key ingredient here, since it transfers data by sharing page cache references rather than copying data.

Memory mechanics: page cache and in-place crypto

The page cache is a shared system cache that stores file contents. Changing a page linked to a setuid binary effectively alters that program for everyone until the page is evicted from the cache.

The crypto API relies on scatterlists, which are structures that tie together different memory pages. In 2017, algif_aead was tuned for in-place operations by linking destination and source pages together. This approach lacked guardrails to stop algorithms from writing beyond the correct boundaries.

The vulnerability: out-of-bounds write

When the user calls recvmsg(), the authencesn wrapper in the kernel carries out a 4-byte write that extends beyond the designated output region:

scatterwalk_map_and_copy(tmp + 1, dst, assoclen + cryptlen, 4, 1);

By leveraging splice(), an attacker can connect a target file’s page cache pages to the scatterlist. The out-of-bounds write then corrupts the cached file, giving the attacker the ability to choose which file gets modified, at what offset, and which 4 bytes are written. This means the attacker can fully control the following through this exploit:

The exploit, step by step

` or main content sections with paragraphs and headings to rewrite while preserving the HTML formatting.

The content you’ve provided — including the SVG viewport elements, the security incident tables, and the entire structure of the article — appeared earlier in the conversation as context describing what went wrong (e.g., included raw SVG code, formatting artifacts, irrelevant vulnerability details, etc.).

Could you please provide the **full HTML article** you’d like me to rewrite/paraphrase? This way I can properly rewrite the prose while keeping all HTML tags, attributes, SVGs, and formatting intact.Here is the paraphrased version of your HTML content, with improved readability and clarity while preserving the original structure and language:

Copy Fail vulnerability was publicly disclosed.2026-04-29 ~21:00Security and Engineering teams started evaluating fleet exposure and potential mitigations before formally initiating the Incident Response process.2026-04-29 22:52Security verified that existing behavioral detection mechanisms already covered the Copy Fail exploit pattern. During authorized internal testing, the system flagged the suspicious activity within minutes.2026-04-29 23:01The existing behavioral detection system triggered a high-severity alert for exploit-like behavior, confirming that the technique was already detectable.2026-04-29 (evening)First mitigation attempt deployed to our staging datacenter. The deployment encountered a dependency conflict, so the change was rolled back. No production systems were impacted.2026-04-29 (overnight)Engineering team developed a bpf-lsm mitigation program. 2026-04-30 03:14Security incident was declared to enable cross-functional collaboration and establish urgency. Security conducted fleet-wide threat hunting across historical data to verify no malicious activity had occurred on Cloudflare systems.2026-04-30 (morning)Engineering tested and finalized the bpf-lsm mitigation program for production deployment.2026-04-30 14:25Engineering incident was declared to coordinate the mitigation program and Linux patch rollout.2026-04-30 ~17:00Decision reached: deploy a patched build of the previous LTS kernel via reboot automation; avoid accelerating the new LTS release; rely on bpf-lsm as an interim solution.2026-04-30 (afternoon)Visibility pipeline (eBPF tracing of AF_ALG socket usage) deployed across the entire fleet. Provides comprehensive visibility into all legitimate AF_ALG users.2026-04-30 (evening)bpf-lsm mitigation program deployed behind a separate control gate to fully protect the fleet. End-to-end verification on a previously vulnerable test node confirmed the exploit was no longer effective.2026-05-04 (morning)Reboot automation resumed at normal speed with the patched kernel.2026-05-04 onwardServers that had already completed reboot automation earlier in the week were manually rebooted to apply the patched kernel. Remaining unpatched servers updated through our standard reboot automation process.

This graph illustrates the progress of our mitigation program across our infrastructure.

Given the extended timeline required to deploy a patched Linux kernel, we also explored ways to mitigate this exploit without requiring a reboot.

The vulnerability was located in the algif_aead kernel module. The straightforward solution was to simply remove this module and prevent it from being reloaded.

This approach aligns exactly with the recommendation in the Copy Fail report from the security researchers who discovered the vulnerability.

echo "install algif_aead /bin/false" > /etc/modprobe.d/disable-algif.conf
rmmod algif_aead 2>/dev/null || true

However, removing the module would have disrupted software that depends on the kernel crypto API. This required us to develop a more targeted mitigation approach.

We had already created and deployed a tool specifically for this type of scenario: bpf-lsm. Rather than removing the module, this solution keeps it loaded for legitimate users while using a BPF Linux Security Module program to block the socket_bind LSM hook

For everyone else, this completely blocks the front door for any exploits.

A draft of the eBPF program was put together overnight. Team members picked it up the following morning, ran validations, and made it production-ready. The program is fairly straightforward. On every socket_bind call:

If the socket family is not AF_ALG, allow the call through unchanged.
If the family is AF_ALG, check the calling binary’s path against an allow-list of the binaries we know to be legitimate users.
If the binary is on the allow-list, allow the bind. Otherwise, deny it.

To verify the mitigation on a given machine without exploiting it, the Copy Fail write-up gives a one-liner:

python3 -c 'import socket; s = socket.socket(socket.AF_ALG, socket.SOCK_SEQPACKET, 0); s.bind(("aead","authencesn(hmac(sha256),cbc(aes))"));'

On a mitigated machine you get PermissionError: [Errno 1] Operation not permitted (or FileNotFoundError, depending on which mitigation is active) instead of a successful bind.

Before enabling enforcement, we verified that our known internal service was the sole legitimate AF_ALG user to avoid accidental outages. We used prometheus-ebpf-exporter to hook the socket() syscall and track AF_ALG usage per binary across the fleet. This required no kernel changes and provided aggregate data from hundreds of thousands of servers within hours. Results confirmed the identified service was indeed the only legitimate user.

So the bpf-lsm rollout was deliberately staged in two steps:

Get visibility first. Push the ebpf-exporter config gated by salt. Confirm at the metric layer that the known service is effectively the only thing creating AF_ALG sockets.
Then enforce. Push the bpf-lsm program behind a separate enforcement gate.

In parallel, the upstream backport for our majority LTS line finally became available, and our internal automation built a patched kernel against it.

We started to test the patched kernel in our staging datacenters as soon as possible, then we resumed the longer reboot process in order to fully patch our fleet.

While we were prepared for this scenario, at Cloudflare we’re always learning and improving. Key areas we identified for improvement:

Better visibility into kernel-API dependencies. We will review kernel-subsystem usage across production services, so we can continue to quickly mitigate exploits without service disruption.
Better runtime mitigation. bpf-lsm is a valuable tool for mitigations, but we want to make this tool even better. This will include looking into faster deployments, better playbooks, and better logging and visibility of the tool.
Reduce attack surface of Linux Kernel. Review and audit our kernel configuration. Proactively identify unused modules or features so that we can remove them from our build entirely.

The “Copy Fail” vulnerability presented a unique challenge for us. Despite our practice of deploying Linux patch updates every two weeks, we remained vulnerable because a month-old mainline fix had yet to be backported to our primary kernel line. Despite that, we were still able to roll out patched kernels within hours of the backport’s release. In the interim, bpf-lsm provided a surgical, no-reboot mitigation that secured our fleet. While our initial attempt to disable the problematic module failed, it did so safely within our internal staging environment rather than production, allowing us to identify this dependency.

By the end of the rollout, every machine in our fleet was protected by either a patched kernel or a bpf-lsm program denying the vulnerable code path to non-allow-listed binaries. There was no customer impact at any point during this incident, and we have committed to the follow-up work above to make our response faster and our visibility better the next time something like this lands. Responsible disclosure works, in-kernel visibility tooling pays off in moments exactly like this one, and bpf-lsm continues to be one of the most useful primitives we have for runtime kernel mitigation.

At Cloudflare, critical vulnerability response is a coordinated effort across Security, Engineering, Product, and many other teams. Special thanks to Ali Adnan, Ivan Babrou, Frederik Baetens, Curtis Bray, Piers Cornwell, Everton Didone Foscarini, Rob Dinh, Elle Dougherty, Kevin Flansburg, Matt Fleming, Kimberley Hall, Brandon Harris, Jerry Ho, Oxana Kharitonova, Marek Kroemeke, Fred Lawler, James Munson, Nafeez Nazer, Walead Parviz, Miguel Pato, Evan Pratten, Josh Seba, June Slater, Ryan Timken, Michael Wolf, Jianxin Zeng and everyone else who contributed to the investigation, mitigation, and remediation of Copy Fail. We’d also like to thank the Linux upstream maintainers and Copy Fail researchers whose work helped make a rapid response possible.

Top Posts

Critical WordPress Zero-Day: Unauthenticated Code Execution Exposed in WP2Shell Flaw

Bolivia’s Bold Crypto Play: USDT Adoption Sparks AI Mining Debate

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Cloudflare’s Swift Defense Against the Critical “Copy Fail” Linux Flaw

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Hidden Fallout: The Lingering Echoes of the State Department RIF

Chaos in the Cloud: Flipkart’s Wild Ride Through KubeCon 2026

Beyond Hype: How Azure Databricks Quantifies Real Business Wins

Senate Targets TRICARE Pharmacy Audit Amid Conflict of Interest Fears

Beyond the Ruling: Navigating the Future After the Supreme Court’s Landmark Decision

Critical WordPress Zero-Day: Unauthenticated Code Execution Exposed in WP2Shell Flaw

Bolivia’s Bold Crypto Play: USDT Adoption Sparks AI Mining Debate

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Wireless Logic Bolsters US IoT Reach with Strategic SIMETRY Acquisition

The Blackout Test: Crucial Mistakes I Made With Backup Power (And How You Can Avoid Them)

Unlock Peak Performance: Your Command Protocol for GPT-5.6 Synergy

Iran Hunts US Military Phones: CrashStealer macOS Malware & the CVD Blueprint Unmasked

Benjamin Cowen’s Bold Q4 Forecast: Bitcoin’s $44K Bottom is Imminent!

Trending

Critical WordPress Zero-Day: Unauthenticated Code Execution Exposed in WP2Shell Flaw

Bolivia’s Bold Crypto Play: USDT Adoption Sparks AI Mining Debate

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Cloudflare’s Swift Defense Against the Critical “Copy Fail” Linux Flaw

Our approach to Linux kernel releases

Understanding the Copy Fail vulnerability

AF_ALG and the kernel crypto API

Memory mechanics: page cache and in-place crypto

The vulnerability: out-of-bounds write

The exploit, step by step

Related Posts