launched Gaudi accelerators to Amazon’s EC2 DL1 cases, we confronted a problem that threatened your complete deployment. The efficiency numbers weren’t simply disappointing; they had been disastrous. Fashions that required coaching successfully had been seeing as much as 50% of their efficiency degradation when scaling throughout a number of nodes. The issue? A community topology that routed all bytes of knowledge by way of host reminiscence, inflicting a bottleneck that undermined every little thing Gaudi was designed to do.
I led the engineering effort to handle this subject, which finally resulted within the improvement of what we now name Peer Direct. It’s a function that reworked the way in which Gaudi accelerators talk in cloud environments, and its historical past has some helpful classes on distributed AI coaching at scale.
The Downside with Host NICs
Gaudi was designed with the NIC (Community Interface Card) being embedded straight within the silicon. Every chip has ten community interfaces that may deal with 100 Gbps and help RDMA with RoCE v2, permitting gadgets to entry one another’s reminiscence straight while not having the CPU or This structure is extremely environment friendly for AI coaching workloads, the place collective operations like AllReduce must accumulate gradients from dozens or lots of of gadgets per coaching iteration.
However cloud deployments are usually not at all times compliant with excellent architectures. When Amazon examined Gaudi for DL1 cases, they needed to utilise bizarre host NICs quite than Gaudi’s built-in networking. The explanations had been pragmatic: value financial savings and the logistics of working round present information centre infrastructure to accommodate a brand new community topology. From their enterprise perspective, leveraging established community infrastructure made excellent sense.
From the efficiency perspective, it was a catastrophe. As an alternative of peer-to-peer RDMA transfers between Gaudi playing cards, all communication went the great distance round. Knowledge needed to be duplicated out of Gaudi’s high-bandwidth reminiscence into host DRAM, processed by the host CPU, despatched out the host NIC over TCP/IP, acquired by the far host, and duplicated again into the far Gaudi’s reminiscence. All of the added hops brought about latency, stole CPU cycles, and added bandwidth restrictions that utterly ruined the scalability of distributed coaching.
The efficiency shortfall was so dangerous that one questioned whether or not deployment would ever be value it in any respect. This wasn’t a matter of some trivial optimisation; it was an existential menace to your complete association with AWS.
Why Efficiency Issues This A lot
It’s value realizing why a 50% lack of efficiency is so disastrous within the life of coaching fashions, and particularly massive fashions reminiscent of GPT-5. It now takes weeks or months to coach enormous language fashions even on humongous clusters. If you’re messing round with fashions which have billions or trillions of parameters, each share level of efficiency interprets straight into time and {dollars}.
Contemplate the economics. If it takes 30 days to coach a mannequin versus 15, you’re not solely ready longer; you’re paying for double the compute time. At cloud scale, with lots of or 1000’s of accelerators in steady use, this provides as much as thousands and thousands of {dollars}. Worse, it halves your iteration velocity. In an aggressive AI world the place firms are racing to develop improved fashions, doubling the variety of checks inside the similar timeframe may be the excellence between being in entrance and being behind.
Environmental value can also be essential. Giant fashions require quite a lot of electrical energy to show. Higher efficiency means much less compute time, which halves vitality consumption and carbon emissions. As extra strain is mounted on the AI business to chop its carbon footprint, positive factors in effectivity are now not a luxurious however quite a necessity.
The answer we designed, Peer Direct, delivered RDMA-like efficiency when the bodily community format wasn’t appropriate for regular RDMA. We wanted direct reminiscence entry between Gaudi gadgets on completely different programs with out traversing host reminiscence, however on host NICs that weren’t designed for this within the first place.
The enabler was AWS Elastic Material Adapter, a high-performance community interface for HPC and AI workloads on EC2. EFA supplies low-latency OS-bypass communications, usually sub-10 microsecond latency. EFA supplies RDMA-like semantics utilizing libfabric, an in-user-space communication library offering a standard interface throughout a number of networking applied sciences.
The duty was to mix libfabric with Habana’s Collective Communication Library, HCCL, which handles all distributed coaching workloads. HCCL was constructed on the idea of native RDMA utilizing Gaudi’s on-chip NICs. We wanted to create a bridge enabling HCCL to leverage libfabric transparently for communications with out compromising its efficiency ensures and communication semantics.
The answer wanted a number of technical advances. First, we launched a reminiscence registration system that allowed libfabric to straight entry Gaudi’s high-bandwidth reminiscence. We utilised the Linux kernel DMA-BUF framework, which supplies a shared mechanism for sharing machine driver buffers. When HCCL must switch information, the Gaudi driver supplies a DMA-BUF file descriptor for the reminiscence area, which libfabric can utilise to create RDMA transfers straight from machine reminiscence.
Second, we included an LRU cache for reminiscence registrations. Reminiscence registration is dear; it includes kernel calls and setup operations that may trigger important overhead. By caching the mapping of reminiscence addresses to their libfabric handles, we may reuse registrations in hot-access areas, eliminating most registration overhead from precise coaching.
The end result was a communication pipeline that seemed one thing like this: HCCL calls the OFI wrapper, which calls the cached libfabric deal with to carry out an RDMA switch straight from supply Gaudi reminiscence to vacation spot Gaudi reminiscence, with neither CPU ever being referred to as. The OFI wrapper was launched to maintain the codebase clear and keep away from direct header inclusions — it’s a light-weight library that dynamically hyperlinks to HCCL and allows using libfabric with out requiring direct integration
After the switch is full, libfabric studies by way of a completion queue, and HCCL continues computation with the just lately acquired information.
The Growth Expertise
Constructing Peer Direct concerned venturing into new territory on tight schedules. Libfabric wasn’t but mainstream within the area of AI accelerators but. There wasn’t quite a lot of public documentation out there, and dialogue was meagre. There was extra of an emphasis on diving into libfabric supply code and reverse-engineering primarily based on experimentation.
The communication with AWS engineers was paramount however time-zone constrained. Working with a workforce twelve hours forward meant that debug iterations had 24-hour turnarounds. Each subject wanted cautious documentation and correct communication, as real-time collaboration was not attainable.
The stakes had been excessive because the whole DL1 deployment was driving on this performance working. Delays would have thwarted a serious product launch. No person on our workforce had deep background information of libfabric internals, so we had been studying a posh codebase whereas designing a essential integration concurrently.
The Outcomes
Once we truly deployed Peer Direct, the velocity enhancements had been all the trouble was value. We noticed a 1.5 to 2x throughput enhance for collective operations on a 32MB message dimension. On bigger messages, the efficiency was much more astounding, with as much as 1.76x higher throughput at a 256MB message dimension. CPU overhead created a bottleneck that utterly disappeared.
Most importantly, these microbenchmark enhancements straight translated into actual mannequin coaching efficiency. Coaching Habana’s DeepSpeed BERT mannequin with 5 billion parameters throughout 128 Gaudi gadgets, we noticed substantial throughput achieve. Fashions utilizing extra aggressive reminiscence optimisation strategies, like ZeRO-2, that are extra collective operation dependent, benefited disproportionately from Peer Direct.
PeerDirect was one of many major enablers for Gaudi efficiency on AWS DL1 cases, permitting high-scale distributed coaching to run effortlessly on the launch day. Past this preliminary impression, the trouble set the groundwork for future high-performance communication options and proved that cloud-native AI accelerators may stay aggressive regardless of the constraints of cloud infrastructure.
The expertise jogged my memory of an necessary lesson in programs engineering: usually a very powerful efficiency enhancements don’t end result from optimising the quick path, however from sidestepping unjustified detours altogether. Throughout distributed AI coaching, having information journey straight throughout accelerators with no pointless copies and no CPU intervention is what makes a working system versus one which scales.
Key takeaways? One necessary “takeaway” from this challenge is that assumptions about community topology ought to be examined on the earliest level within the distributed coaching course of. As most of the accelerator stacks had been constructed primarily based on an idealised surroundings, they don’t take into consideration the extra hops, translation layers, and/or cost-driven components that exist within the cloud environments. Subsequently, earlier than specializing in optimising both mannequin degree or kernel degree, engineers ought to carry out easy collective microbenchmarking throughout the specified topology. If scaling effectivity dramatically decreases with rising node counts or message sizes, the probably motive is the information path, not the kernel. By figuring out the host-memory detour early on, engineers can focus their efforts the place they may have the best impression.
One other necessary lesson discovered was the necessity to deal with each reminiscence registration and information switch as first-class efficiency issues. Reminiscence registration overhead can enormously exceed the time spent speaking if every information switch requires a brand new registration. The LRU cache for registered recollections was a non-glamorous addition to HCCL; nevertheless, it successfully eradicated a systemic supply of latency and made the RDMA path viable for real-world workloads. When creating distributed programs, engineers ought to profile not solely the out there community bandwidth but in addition the lifecycle prices related to allocating buffers, registering them, and tearing down these registrations. Small modifications to those management paths may end up in massive will increase in end-to-end throughputs.
Lastly, the mixing methodology used on this challenge supplies a sample for integration. As an alternative of rewriting HCCL to make use of libfabric straight, we created a skinny abstraction layer that maintained present semantics whereas changing the underlying transport layer. This offered a number of advantages, together with minimising threat, decreasing code churn, and permitting incremental testing. Groups dealing with the same problem (i.e., adapting accelerator-native communication libraries to cloud-native materials) ought to try and isolate the transport layer, preserve collective semantics, and create small, testable interfaces between the 2. This not solely permits for sooner improvement but in addition permits for easier help of future transport backends.
Disclosure: I work as an AI Runtime Group Supervisor at Intel. The views shared on this article are my very own.



