Deliberately tolerate packet loss. Distribute every data transfer across hundreds of random routes. If someone presented you with this set of design choices for a network linking 131,000 GPUs, you’d think it was drafted by someone with zero experience running a production network.
A group formed by OpenAI, AMD, Broadcom, Intel, Microsoft, and NVIDIA constructed precisely this — and quietly overturned thirty years of established thinking about how high-performance data center networks should be built.
The protocol is named MRC, which stands for Multipath Reliable Connection. It was published on May 5, 2026 via the Open Compute Project. The supporting research paper (Araujo et al., 2026) describes its rollout across OpenAI’s largest NVIDIA GB200 supercomputer clusters, including the Stargate facility with Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers. MRC has already been used to train the most recent frontier models powering ChatGPT and Codex.
What stands out most upon a careful reading of the paper is something the media coverage has overlooked: MRC effectively removes the entire Layer 3 control plane from the data center fabric. No OSPF. No BGP. No IS-IS. No FIB. The switches in the deployment carry zero dynamic forwarding state. To the author’s knowledge, this represents the most thorough removal of dynamic routing in any production AI training fabric that has been publicly described to date.
The paper’s central claim is that at scales beyond 100,000 GPUs, tail latency caused by network congestion and failures becomes the dominant factor limiting training performance, and the traditional networking stack cannot address this without fundamentally rethinking how packets travel between GPUs. MRC embodies those fundamental changes, realized in 800 Gb/s NICs from three different chip vendors and running in production.
What makes MRC worth examining closely is not merely its speed. It is that the design choices behind it go against several principles the networking field has long considered settled. Grasping why those choices succeed at this scale — and where they might fall short — is important for anyone designing or managing AI infrastructure.
Left: conventional RoCE with single-path routing. A congested T1 link triggers PFC PAUSE that propagates backward, blocking GPU 2 even though its own path was clear. All 100,000 GPUs idle until GPU 2’s transfer completes. Right: MRC sprays packets across 8 independent planes. When a link fails in Plane 2, the NIC retires that entropy value and redistributes traffic to the remaining 7 planes in microseconds. No GPU ever stalls. The five numbered design decisions at the bottom are the subject of this article.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
Each of MRC’s design choices is individually recognizable to anyone who has kept up with networking research. What is radical is how they are combined. The networking community has investigated every one of these concepts separately — multi-plane fabrics, source routing, packet spraying, lossy transports with selective retransmission, ECN as a load-balancing signal. What makes MRC worth serious attention is that the OpenAI consortium committed to all of them at once, in production, across 131,000 GPUs.
The problem: one slow transfer stalls 100,000 GPUs
Synchronous pretraining operates in lock-step. Every training step requires millions of data exchanges among thousands of GPUs carrying out a mix of tensor parallelism, pipeline parallelism, data parallelism, and expert parallelism. The step cannot proceed until the slowest transfer finishes. At 100,000 GPUs, the length of each communication round is governed by the tail of the transfer latency distribution, not the average.
The paper states this clearly: “As computations scale, communication becomes increasingly outlier-dominated.” A single congested link, a single flow collision, a single switch buffer overflow can hold up thousands of GPUs for milliseconds. At the hourly expense of 100,000 H100-class GPUs (approximately $300,000 per hour at cloud pricing), a 10-millisecond delay occurring once per training step and repeating over thousands of steps is not a minor rounding error. It is a significant cost.
Network failures make things worse. At this scale, link flaps, optic failures, and switch reboots are not unusual incidents. They are statistical inevitabilities happening multiple times daily across a fabric containing hundreds of thousands of links. The paper recounts a production event where an optical transceiver on a T0 switch “suffered a glitch, and flapped all its four links in rapid succession,” impacting three active training nodes at the same time. In a traditional network, this would have crashed the training job.
MRC’s design objective was not simply higher bandwidth. It was dependable bandwidth, even when failures occur, paired with a control plane simple enough that a small team can oversee multiple supercomputers at once.
The topology: 131,000 GPUs across two switch tiers
The first design choice is architectural rather than protocol-level. Instead of treating an 800 Gb/s NIC as a single high-capacity pipe, MRC divides it into eight 100 Gb/s links, each attached to a different switch. This produces eight independent network planes, each functioning on its own.
Consider a traditional setup. Today’s fastest data center Ethernet switches provide 51.2 Tb/s of switching capacity, delivering 64 ports at 800 Gb/s. In a standard fat-tree Clos topology, each Tier-0 (T0) switch connects downward to 32 NICs and upward to 32 Tier-1 (T1) switches. Each T1 switch links to 64 pods. That results in a 3-tier network supporting roughly 64,000 GPUs at full bisection bandwidth. To reach 100,000, a fourth tier is required, which increases latency, cost, and failure domains.
Now split the NIC. The same 51.2 Tb/s switch at 100 Gb/s per port yields 512 ports instead of 64. Each T0 switch connects downward to 256 NIC ports and upward to 256 T1 switches. Each T1 connects to 512 T0s. A single two-tier plane supports 131,072 GPUs at full bisection bandwidth.
The paper quantifies the savings:
Conventional 3-tier (800 Gb/s):
- 3 switch tiers, 64-port switches
- Max ~64K GPUs at full bisection BW
- 5-hop or 7-hop worst-case path
Multi-plane 2-tier (8 × 100 Gb/s):
- 2 switch tiers, 512-port switches
- 131K GPUs at full bisection BW
- 3-hop worst-case path
- 2/3 the optics of a 3-tier network
- 3/5 the number of switches
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
The reliability advantages are just as impressive. If one NIC-to-T0 link fails in an 800 Gb/s single-plane setup, that NIC loses 3% of its total bandwidth. In a 100 Gb/s multi-plane configuration, the same failure only costs 0.4%. Even better, with eight separate planes available, the NIC keeps working on the other seven links while the broken one gets fixed. The training process never has to pause.
This approach does come with costs. Eight independent planes mean eight times more links to keep track of, eight times as many places where failures could happen overall, and a transport protocol smart enough to spread traffic evenly across all of them. That is exactly what MRC is designed to do.
Distributing packets using entropy values
Standard RDMA transport methods (RoCEv2, InfiniBand RC) lock each connection to one fixed network path. Switches pick this path by hashing the flow’s five-tuple (source and destination IP addresses, source and destination ports, and protocol). Once locked in, every packet in that connection travels the same route until the connection ends.
This approach works fine at smaller scales. It breaks down with 100,000 or more GPUs because of flow collisions. When two connections get hashed onto the same path through the same congested link, both connections experience problems. The chance of collisions grows as the network scales up, and the impact on tail latency is especially severe.
MRC gets rid of fixed path assignments completely. Instead, it gives each Queue Pair (QP) a collection of 128 to 256 entropy values (EVs) when the connection is first established. Each EV represents a specific route through a specific network plane. The sender cycles through its EV set one packet at a time, spreading consecutive packets across hundreds of different paths spanning all eight planes. No two back-to-back packets from the same transfer ever follow the same route.
The EV is a 32-bit number divided between the UDP source port and the IPv6 flow label in each MRC packet. Switches use these fields for hashing, so switching to a different EV changes the path. The sender does not need any knowledge of the network layout. It just needs to understand that different EVs lead to different routes.
Per-QP state:
EV set: 128-256 entropy values (32-bit each)
Per-EV health: {active, congested, suspected_failed, confirmed_failed}
Packet sending:
for each packet in transfer:
ev = next_active_ev(qp.ev_set)
packet.udp_src_port = ev[0:15]
packet.ipv6_flow_label = ev[16:31]
send(packet)Each EV includes a small amount of health status information. When the receiver notices congestion on a path (through ECN markings from switches), it sends this information back to the sender, which then temporarily stops using that EV. When a packet is genuinely lost (not just trimmed), MRC treats the path as failed and immediately stops sending traffic on that EV. Background probe messages periodically check retired EVs to see if the problem was temporary, reactivating them if the probes succeed.
This method balances traffic very effectively. Since different senders independently create random EV sets, the overall traffic spread across paths is nearly even. Minor imbalances get corrected by the ECN feedback loop: if one path picks up slightly more traffic, ECN markings go up on that path, and senders shift traffic to less crowded alternatives.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
Fixed source routing with SRv6
This is the most surprising choice in the paper. Every production datacenter network relies on dynamic routing protocols (BGP, OSPF, IS-IS) that calculate forwarding tables, respond to topology changes, and stabilize after failures. MRC turns all of them off.
Instead, MRC uses IPv6 Segment Routing (SRv6) to specify the complete path each packet should follow. The sender includes the full list of switch identifiers directly in the packet’s destination address. Each switch along the way checks whether its own identifier appears, removes it by shifting the address, and forwards the packet to the next hop. There is no routing table lookup, no forwarding information base, and no control plane convergence process.
The paper explains the reasoning: “We took the unusual position of disabling dynamic routing in the switches because we didn’t want two adaptive routing mechanisms interacting with each other and dynamic routing wasn’t adding anything.”
MRC’s transport-layer adaptation (EV management, ECN feedback, path probing) already deals with failures at microsecond speeds. Dynamic routing protocols take seconds to minutes to converge. Running both at the same time creates a risk of conflicting behavior: MRC might avoid a failed path at the transport layer while the routing protocol is still recalculating its forwarding state, potentially causing routing loops or unstable oscillations.
By completely removing dynamic routing, MRC gains three operational advantages:
First, predictable forwarding. Every packet follows a known, pre-calculated
transfer continues at full speed. No GPU stall. The sender adapts its path selection, not its sending rate.
[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
Putting it all together: the MRC data transfer
A single MRC data transfer works as follows. The sender’s GPU issues a collective operation (e.g., an AllReduce). The MRC NIC translates this into a set of unicast RDMA writes, each tagged with a different entropy value. Each write is source-routed over the SRv6 multi-plane fabric. Switches forward packets statelessly based on the segment list in the packet header. The receiver’s NIC writes each packet directly to its final memory location, regardless of order. ECN marks guide the sender to avoid congested paths. Selective retransmission and packet trimming handle any drops. The collective completes when all packets arrive and the GPU is notified.
The paper reports that this design achieves 94% of the theoretical bisection bandwidth on a 10,000-GPU cluster, compared to 30–50% for conventional RoCE-based designs. Tail latencies are reduced by 10x, and the network operates without PFC, without dynamic routing protocols, and without per-flow state in switches.
The key insight is that by moving intelligence to the endpoints (NICs) and simplifying the network fabric to stateless forwarding, MRC eliminates the root causes of tail latency in large-scale GPU clusters. The network becomes a dumb, fast, predictable pipe.
This is a radical departure from decades of networking orthodoxy. It works because OpenAI designed the entire stack — NIC, switch, and protocol — as a single system, rather than bolting new features onto existing standards.
It looks like your message was cut off at the end. However, I can still work with the content you’ve provided. Here’s the paraphrased version of the HTML content you shared:
—

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
What the production evidence shows
The paper presents findings from two settings: production frontier model training and controlled testbed experiments.
In real-world production use, MRC enabled training jobs to survive network disruptions that would have previously caused them to fail. The paper details the optical transceiver glitch described earlier: four links went down in quick succession across three active training nodes. MRC identified the path failures, stopped routing traffic through the affected EVs, and shifted the load across the remaining available paths. The training job kept running without any interruption. In a standard RoCE setup, this same event would have caused PFC storms, NCCL timeouts, and a job restart that would waste hours of GPU compute time.
The testbed experiments measure MRC’s key performance characteristics:
Point-to-point bandwidth: MRC reaches near-line-rate throughput on 800 Gb/s links when using packet spraying. The paper includes a comparison with standard RoCE, highlighting MRC’s advantage in multi-path scenarios.
Link failure recovery: when a link drops, MRC detects the failure and shifts traffic within tens of microseconds. There are no sender-side timeouts and no routing protocol convergence needed. The EV tied to the failed path is immediately retired, and the remaining EVs take over the traffic.
Load balancing across EVs: the paper measures how traffic is spread across planes and paths, demonstrating near-even utilization under production workloads.
NCCL collective performance at scale: the paper assesses MRC’s performance on all-reduce operations, which are the primary communication pattern in data-parallel training. MRC’s packet spraying removes the flow-collision issue that degrades all-reduce performance at scale when using conventional ECMP hashing.
The operational data supports the choice of static routing. The paper notes that T1 core switches were rebooted during live training runs without disrupting the job. In a traditional network using dynamic routing, rebooting a core switch triggers reconvergence across the entire fabric. With static SRv6, the switch simply reloads its static forwarding state and resumes forwarding. MRC’s transport layer managed the temporary loss of paths through that switch by shifting traffic to other planes.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
Where these design decisions are strongest
MRC was built for a particular workload profile: synchronous pretraining with all-reduce-dominated communication, running on a single-tenant fabric with full bisection bandwidth. Within these parameters, the three core design choices align well with the problem:
Static routing is effective because the topology is fixed and known before deployment. Training clusters don’t add or remove switches mid-run. The failure modes are link-level (managed by MRC’s EV management), not topology-level (which would demand routing protocol reconvergence).
Lossy Ethernet is viable because the selective retransmission and packet trimming mechanisms recover faster than PFC pause frames can propagate. The cross-collective head-of-line blocking created by PFC is more harmful to tail latency than the occasional cost of retransmission.
ECN-as-load-balancing is effective because the multi-plane topology delivers full bisection bandwidth, meaning aggregate congestion doesn’t occur. Local imbalances are the only source of congestion, and ECN-guided EV avoidance is a precise, low-overhead way to smooth them out.

[Source: Author (SVG created using Inkscape) – Reference:arXiv:2605.04333 (2026)]
The boundary conditions: where MRC works and where it doesn’t
MRC is a production-proven protocol for its intended workload. The natural questions for the broader AI infrastructure community revolve around its limits.
First, multi-tenancy. OpenAI’s training clusters run a single training job at a time across the entire fabric. Most cloud providers and enterprise deployments share GPU clusters across many workloads. MRC’s static routing assumes a stable topology database at the NIC level. In a multi-tenant setting where workloads are placed dynamically, the topology each NIC sees changes frequently. Whether MRC’s path-generation logic can adapt to this or needs modifications remains an open engineering question.
Second, inference workloads. MRC was designed for synchronous training’s all-reduce pattern: large bulk transfers between known sets of GPUs. Inference workloads — especially disaggregated inference with KV cache transfers between prefill and decode pools — have a different communication profile: smaller transfers, point-to-point rather than collective, and latency-sensitive at the per-request level rather than the aggregate step level. Packet spraying across hundreds of paths introduces jitter into individual transfer latency, which may or may not be acceptable depending on the SLO requirements.
—
Please share the rest of the content if you’d like me to paraphrase the remaining sections as well!
Third, oversubscribed networks. MRC’s ECN-based load-balancing approach depends on having full bisection bandwidth available. In oversubscribed networks—typical in cloud settings where cost efficiency shapes topology choices—overall congestion is a real issue, not merely a local imbalance. In such cases, ECN must act as a true congestion indicator, which alters how MRC’s flow control behaves.
Fourth, interoperability. MRC currently runs on particular NIC hardware (NVIDIA ConnectX-8, AMD Pollara/Vulcano, Broadcom Thor Ultra) and specific switch platforms (NVIDIA Spectrum-4/5, Arista EOS on Broadcom Tomahawk 5). The OCP specification release allows for wider adoption, but developing and validating silicon-level protocol support requires 12–18 months. In the near term, only organizations using these specific hardware platforms will be able to adopt it.
These aren’t criticisms of MRC. They are the natural engineering questions that emerge when a protocol built for a tightly controlled environment encounters the variety of the broader infrastructure market. MRC’s success in solving tail latency at 131,000-GPU scale is a real accomplishment. The key question for the wider community is which of its design choices apply broadly and which are unique to the constraints of single-tenant, full-bisection-bandwidth training fabrics.
What MRC signals about the future of AI networking
MRC reflects a larger shift in how AI infrastructure approaches networking. The traditional model treats the network as a transparent pipe: packets enter one end and exit the other, and the transport protocol’s role is to fill that pipe as efficiently as possible. MRC treats the network as a managed resource with visible, per-path health indicators that the transport protocol actively uses.
This concept isn’t new in networking research. Multipath TCP, Valiant load balancing, and ECMP have explored similar ideas for years. What’s new is the scale at which MRC operates, the boldness of its design choices (no PFC, no dynamic routing, full packet spraying), and the real-world evidence that it works on the largest AI training clusters globally.
For networking professionals, MRC confirms a theory debated for years: at large enough scale, smarter endpoints beat smarter networks. Making the NIC more capable and the switch simpler creates a more robust system than making the switch more capable and the NIC simpler. Regardless of whether you agree with every design choice, the production results from OpenAI and Microsoft make this case harder to ignore than they were a week ago.
The MRC specification is available through OCP under an open license. The research paper includes detailed experimental findings. For anyone building GPU clusters at scale, both deserve careful attention. The three rules MRC breaks might be the same three rules limiting your network’s performance.



