The Surprising Networking Choices Powering OpenAI’s 131,000-GPU Training Fabric

Deliberately tolerate packet loss. Distribute every data transfer across hundreds of random routes. If someone presented you with this set of design choices for a network linking 131,000 GPUs, you’d think it was drafted by someone with zero experience running a production network.

A group formed by OpenAI, AMD, Broadcom, Intel, Microsoft, and NVIDIA constructed precisely this — and quietly overturned thirty years of established thinking about how high-performance data center networks should be built.

The protocol is named MRC, which stands for Multipath Reliable Connection. It was published on May 5, 2026 via the Open Compute Project. The supporting research paper (Araujo et al., 2026) describes its rollout across OpenAI’s largest NVIDIA GB200 supercomputer clusters, including the Stargate facility with Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers. MRC has already been used to train the most recent frontier models powering ChatGPT and Codex.

What stands out most upon a careful reading of the paper is something the media coverage has overlooked: MRC effectively removes the entire Layer 3 control plane from the data center fabric. No OSPF. No BGP. No IS-IS. No FIB. The switches in the deployment carry zero dynamic forwarding state. To the author’s knowledge, this represents the most thorough removal of dynamic routing in any production AI training fabric that has been publicly described to date.

The paper’s central claim is that at scales beyond 100,000 GPUs, tail latency caused by network congestion and failures becomes the dominant factor limiting training performance, and the traditional networking stack cannot address this without fundamentally rethinking how packets travel between GPUs. MRC embodies those fundamental changes, realized in 800 Gb/s NICs from three different chip vendors and running in production.

What makes MRC worth examining closely is not merely its speed. It is that the design choices behind it go against several principles the networking field has long considered settled. Grasping why those choices succeed at this scale — and where they might fall short — is important for anyone designing or managing AI infrastructure.

Figure 1. The failure cascade that MRC eliminates.
Left: conventional RoCE with single-path routing. A congested T1 link triggers PFC PAUSE that propagates backward, blocking GPU 2 even though its own path was clear. All 100,000 GPUs idle until GPU 2’s transfer completes. Right: MRC sprays packets across 8 independent planes. When a link fails in Plane 2, the NIC retires that entropy value and redistributes traffic to the remaining 7 planes in microseconds. No GPU ever stalls. The five numbered design decisions at the bottom are the subject of this article.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

Each of MRC’s design choices is individually recognizable to anyone who has kept up with networking research. What is radical is how they are combined. The networking community has investigated every one of these concepts separately — multi-plane fabrics, source routing, packet spraying, lossy transports with selective retransmission, ECN as a load-balancing signal. What makes MRC worth serious attention is that the OpenAI consortium committed to all of them at once, in production, across 131,000 GPUs.

The problem: one slow transfer stalls 100,000 GPUs

Synchronous pretraining operates in lock-step. Every training step requires millions of data exchanges among thousands of GPUs carrying out a mix of tensor parallelism, pipeline parallelism, data parallelism, and expert parallelism. The step cannot proceed until the slowest transfer finishes. At 100,000 GPUs, the length of each communication round is governed by the tail of the transfer latency distribution, not the average.

The paper states this clearly: “As computations scale, communication becomes increasingly outlier-dominated.” A single congested link, a single flow collision, a single switch buffer overflow can hold up thousands of GPUs for milliseconds. At the hourly expense of 100,000 H100-class GPUs (approximately $300,000 per hour at cloud pricing), a 10-millisecond delay occurring once per training step and repeating over thousands of steps is not a minor rounding error. It is a significant cost.

Network failures make things worse. At this scale, link flaps, optic failures, and switch reboots are not unusual incidents. They are statistical inevitabilities happening multiple times daily across a fabric containing hundreds of thousands of links. The paper recounts a production event where an optical transceiver on a T0 switch “suffered a glitch, and flapped all its four links in rapid succession,” impacting three active training nodes at the same time. In a traditional network, this would have crashed the training job.

MRC’s design objective was not simply higher bandwidth. It was dependable bandwidth, even when failures occur, paired with a control plane simple enough that a small team can oversee multiple supercomputers at once.

The topology: 131,000 GPUs across two switch tiers

The first design choice is architectural rather than protocol-level. Instead of treating an 800 Gb/s NIC as a single high-capacity pipe, MRC divides it into eight 100 Gb/s links, each attached to a different switch. This produces eight independent network planes, each functioning on its own.

Consider a traditional setup. Today’s fastest data center Ethernet switches provide 51.2 Tb/s of switching capacity, delivering 64 ports at 800 Gb/s. In a standard fat-tree Clos topology, each Tier-0 (T0) switch connects downward to 32 NICs and upward to 32 Tier-1 (T1) switches. Each T1 switch links to 64 pods. That results in a 3-tier network supporting roughly 64,000 GPUs at full bisection bandwidth. To reach 100,000, a fourth tier is required, which increases latency, cost, and failure domains.

Now split the NIC. The same 51.2 Tb/s switch at 100 Gb/s per port yields 512 ports instead of 64. Each T0 switch connects downward to 256 NIC ports and upward to 256 T1 switches. Each T1 connects to 512 T0s. A single two-tier plane supports 131,072 GPUs at full bisection bandwidth.

The paper quantifies the savings:

Conventional 3-tier (800 Gb/s):
  - 3 switch tiers, 64-port switches
  - Max ~64K GPUs at full bisection BW
  - 5-hop or 7-hop worst-case path

Multi-plane 2-tier (8 × 100 Gb/s):
  - 2 switch tiers, 512-port switches
  - 131K GPUs at full bisection BW
  - 3-hop worst-case path
  - 2/3 the optics of a 3-tier network
  - 3/5 the number of switches

Figure 2. Traditional 3-tier fat-tree compared to MRC 2-tier multi-plane design. Both setups use identical 51.2 Tb/s switch chips. The traditional method uses 64 ports running at 800 Gb/s, needs three tiers, and tops out at around 64,000 GPUs. MRC divides each NIC into 8 separate 100 Gb/s connections, building 8 independent two-tier Clos networks that can handle 131,072 GPUs while using fewer switches and fewer optical components. The red dashed line on the left shows the longest possible path at 7 hops. On the right side, no path exceeds 3 hops.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

Putting it all together: the MRC data transfer

A single MRC data transfer works as follows. The sender’s GPU issues a collective operation (e.g., an AllReduce). The MRC NIC translates this into a set of unicast RDMA writes, each tagged with a different entropy value. Each write is source-routed over the SRv6 multi-plane fabric. Switches forward packets statelessly based on the segment list in the packet header. The receiver’s NIC writes each packet directly to its final memory location, regardless of order. ECN marks guide the sender to avoid congested paths. Selective retransmission and packet trimming handle any drops. The collective completes when all packets arrive and the GPU is notified.

The paper reports that this design achieves 94% of the theoretical bisection bandwidth on a 10,000-GPU cluster, compared to 30–50% for conventional RoCE-based designs. Tail latencies are reduced by 10x, and the network operates without PFC, without dynamic routing protocols, and without per-flow state in switches.

The key insight is that by moving intelligence to the endpoints (NICs) and simplifying the network fabric to stateless forwarding, MRC eliminates the root causes of tail latency in large-scale GPU clusters. The network becomes a dumb, fast, predictable pipe.

This is a radical departure from decades of networking orthodoxy. It works because OpenAI designed the entire stack — NIC, switch, and protocol — as a single system, rather than bolting new features onto existing standards.

It looks like your message was cut off at the end. However, I can still work with the content you’ve provided. Here’s the paraphrased version of the HTML content you shared:

—

Figure 8. Failure recovery comparison. Top: conventional RoCE with dynamic routing. A link failure triggers OSPF/BGP reconvergence across all switches, taking 1-30 seconds. During this time, all 100,000 GPUs are idle, NCCL timeouts become likely, and the training job may need to restart. At $300K/hour for a 100,000-GPU cluster, each second of idle time costs $83. Bottom: MRC with static SRv6. The NIC detects the loss via SACK within microseconds, retires the affected entropy value, and redistributes traffic to the remaining planes. No routing protocol needs to converge. The timescale is zoomed 1,000× to show the microsecond-level response. The training step completes without interruption.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

What the production evidence shows

The paper presents findings from two settings: production frontier model training and controlled testbed experiments.

In real-world production use, MRC enabled training jobs to survive network disruptions that would have previously caused them to fail. The paper details the optical transceiver glitch described earlier: four links went down in quick succession across three active training nodes. MRC identified the path failures, stopped routing traffic through the affected EVs, and shifted the load across the remaining available paths. The training job kept running without any interruption. In a standard RoCE setup, this same event would have caused PFC storms, NCCL timeouts, and a job restart that would waste hours of GPU compute time.

The testbed experiments measure MRC’s key performance characteristics:

Point-to-point bandwidth: MRC reaches near-line-rate throughput on 800 Gb/s links when using packet spraying. The paper includes a comparison with standard RoCE, highlighting MRC’s advantage in multi-path scenarios.

Link failure recovery: when a link drops, MRC detects the failure and shifts traffic within tens of microseconds. There are no sender-side timeouts and no routing protocol convergence needed. The EV tied to the failed path is immediately retired, and the remaining EVs take over the traffic.

Load balancing across EVs: the paper measures how traffic is spread across planes and paths, demonstrating near-even utilization under production workloads.

NCCL collective performance at scale: the paper assesses MRC’s performance on all-reduce operations, which are the primary communication pattern in data-parallel training. MRC’s packet spraying removes the flow-collision issue that degrades all-reduce performance at scale when using conventional ECMP hashing.

The operational data supports the choice of static routing. The paper notes that T1 core switches were rebooted during live training runs without disrupting the job. In a traditional network using dynamic routing, rebooting a core switch triggers reconvergence across the entire fabric. With static SRv6, the switch simply reloads its static forwarding state and resumes forwarding. MRC’s transport layer managed the temporary loss of paths through that switch by shifting traffic to other planes.

Where these design decisions are strongest

MRC was built for a particular workload profile: synchronous pretraining with all-reduce-dominated communication, running on a single-tenant fabric with full bisection bandwidth. Within these parameters, the three core design choices align well with the problem:

Static routing is effective because the topology is fixed and known before deployment. Training clusters don’t add or remove switches mid-run. The failure modes are link-level (managed by MRC’s EV management), not topology-level (which would demand routing protocol reconvergence).

Lossy Ethernet is viable because the selective retransmission and packet trimming mechanisms recover faster than PFC pause frames can propagate. The cross-collective head-of-line blocking created by PFC is more harmful to tail latency than the occasional cost of retransmission.

ECN-as-load-balancing is effective because the multi-plane topology delivers full bisection bandwidth, meaning aggregate congestion doesn’t occur. Local imbalances are the only source of congestion, and ECN-guided EV avoidance is a precise, low-overhead way to smooth them out.

Figure 9. How MRC works at the GPU level. The GPU node (left) consists of a Blackwell-class GPU plus 192 GB of HBM3e, connected to the MRC NIC via PCIe Gen5 or NVLink-C2C. The MRC NIC contains four key modules: transport (QP + SACK), SRv6 path encoder, EV manager (256 EVs per QP), and the RDMA engine. The NIC’s single 800G port is broken out into 8 × 100G links, one per network plane. A typical MRC packet (right top) carries the SRv6 path, the entropy value, the RDMA header (vaddr + rkey), the sequence number, and the payload. The 4-step data flow (right middle) shows how a collective operation becomes a sprayed write across 8 planes. At the receiver (right bottom), packets arrive out of order and write directly to their target HBM offsets with no reorder buffer. The GPU never sees the network — it issues memory operations, and the NIC handles everything else.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

The boundary conditions: where MRC works and where it doesn’t

MRC is a production-proven protocol for its intended workload. The natural questions for the broader AI infrastructure community revolve around its limits.

First, multi-tenancy. OpenAI’s training clusters run a single training job at a time across the entire fabric. Most cloud providers and enterprise deployments share GPU clusters across many workloads. MRC’s static routing assumes a stable topology database at the NIC level. In a multi-tenant setting where workloads are placed dynamically, the topology each NIC sees changes frequently. Whether MRC’s path-generation logic can adapt to this or needs modifications remains an open engineering question.

Second, inference workloads. MRC was designed for synchronous training’s all-reduce pattern: large bulk transfers between known sets of GPUs. Inference workloads — especially disaggregated inference with KV cache transfers between prefill and decode pools — have a different communication profile: smaller transfers, point-to-point rather than collective, and latency-sensitive at the per-request level rather than the aggregate step level. Packet spraying across hundreds of paths introduces jitter into individual transfer latency, which may or may not be acceptable depending on the SLO requirements.

—

Please share the rest of the content if you’d like me to paraphrase the remaining sections as well!

Third, oversubscribed networks. MRC’s ECN-based load-balancing approach depends on having full bisection bandwidth available. In oversubscribed networks—typical in cloud settings where cost efficiency shapes topology choices—overall congestion is a real issue, not merely a local imbalance. In such cases, ECN must act as a true congestion indicator, which alters how MRC’s flow control behaves.

Fourth, interoperability. MRC currently runs on particular NIC hardware (NVIDIA ConnectX-8, AMD Pollara/Vulcano, Broadcom Thor Ultra) and specific switch platforms (NVIDIA Spectrum-4/5, Arista EOS on Broadcom Tomahawk 5). The OCP specification release allows for wider adoption, but developing and validating silicon-level protocol support requires 12–18 months. In the near term, only organizations using these specific hardware platforms will be able to adopt it.

These aren’t criticisms of MRC. They are the natural engineering questions that emerge when a protocol built for a tightly controlled environment encounters the variety of the broader infrastructure market. MRC’s success in solving tail latency at 131,000-GPU scale is a real accomplishment. The key question for the wider community is which of its design choices apply broadly and which are unique to the constraints of single-tenant, full-bisection-bandwidth training fabrics.

What MRC signals about the future of AI networking

MRC reflects a larger shift in how AI infrastructure approaches networking. The traditional model treats the network as a transparent pipe: packets enter one end and exit the other, and the transport protocol’s role is to fill that pipe as efficiently as possible. MRC treats the network as a managed resource with visible, per-path health indicators that the transport protocol actively uses.

This concept isn’t new in networking research. Multipath TCP, Valiant load balancing, and ECMP have explored similar ideas for years. What’s new is the scale at which MRC operates, the boldness of its design choices (no PFC, no dynamic routing, full packet spraying), and the real-world evidence that it works on the largest AI training clusters globally.

For networking professionals, MRC confirms a theory debated for years: at large enough scale, smarter endpoints beat smarter networks. Making the NIC more capable and the switch simpler creates a more robust system than making the switch more capable and the NIC simpler. Regardless of whether you agree with every design choice, the production results from OpenAI and Microsoft make this case harder to ignore than they were a week ago.

The MRC specification is available through OCP under an open license. The research paper includes detailed experimental findings. For anyone building GPU clusters at scale, both deserve careful attention. The three rules MRC breaks might be the same three rules limiting your network’s performance.

Top Posts

Critical WordPress Zero-Day: Unauthenticated Code Execution Exposed in WP2Shell Flaw

Bolivia’s Bold Crypto Play: USDT Adoption Sparks AI Mining Debate

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

The Surprising Networking Choices Powering OpenAI’s 131,000-GPU Training Fabric

Unlock Peak Performance: Your Command Protocol for GPT-5.6 Synergy

Beyond the Main Branch: Streamlining AI Workflows with Git Worktrees

The AI Safety Capital Rising: Beyond Silicon Valley’s Shadow

Beyond Context Engineering: The Loop Experiment Running Blind Without an LLM

The Agent Security Chasm: 54% of Enterprises Battling AI Breaches While Credentials Freely Roam

Unleashing Kimi K3: The 2.8 Trillion-Parameter Open MoE Powerhouse with Delta Attention and 1M Context Horizon

Critical WordPress Zero-Day: Unauthenticated Code Execution Exposed in WP2Shell Flaw

Bolivia’s Bold Crypto Play: USDT Adoption Sparks AI Mining Debate

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Wireless Logic Bolsters US IoT Reach with Strategic SIMETRY Acquisition

The Blackout Test: Crucial Mistakes I Made With Backup Power (And How You Can Avoid Them)

Unlock Peak Performance: Your Command Protocol for GPT-5.6 Synergy

Iran Hunts US Military Phones: CrashStealer macOS Malware & the CVD Blueprint Unmasked

Benjamin Cowen’s Bold Q4 Forecast: Bitcoin’s $44K Bottom is Imminent!

Trending

Critical WordPress Zero-Day: Unauthenticated Code Execution Exposed in WP2Shell Flaw

Bolivia’s Bold Crypto Play: USDT Adoption Sparks AI Mining Debate

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

The Surprising Networking Choices Powering OpenAI’s 131,000-GPU Training Fabric

The problem: one slow transfer stalls 100,000 GPUs

The topology: 131,000 GPUs across two switch tiers

Distributing packets using entropy values

Fixed source routing with SRv6

Putting it all together: the MRC data transfer

What the production evidence shows

Where these design decisions are strongest

The boundary conditions: where MRC works and where it doesn’t

What MRC signals about the future of AI networking

Related Posts