GPU communication often acts as a significant speed constraint in real-world AI workloads. Citing data from the mKernel initiative, researchers note that communication can occupy 43.6% of the forward pass and 32% of total training duration. For common Mixture-of-Experts (MoE) architectures, transmitting data between devices may take as much as 47% of total runtime. A team from UC Berkeley’s UCCL project has introduced mKernel, a library of persistent CUDA kernels designed to merge NVLink traffic within a node, RDMA transfers across nodes, and processing tasks into one unified operation.
The Challenge: CPU-Managed Communication
Traditionally, handling data exchange between GPUs follows a CPU-managed pattern. The CPU manages the control flow, invoking libraries such as NCCL or NVSHMEM to execute operations like AllReduce or AllGather. Processing and data transfer occur on independent CUDA streams and are synchronized only at kernel boundaries.
The researchers highlight two core limitations with this traditional setup:
(1) CPU capabilities fail to keep pace with advancing GPU power. Consider a GB300 NVL72 rack, which packs 72 Blackwell Ultra GPUs alongside 36 Grace CPUs, providing 720 PFLOP/s of FP8/FP6 performance, 1.44 EFLOP/s of FP4 Tensor Core throughput, and 130 TB/s of rack-wide NVLink bandwidth. At these extreme speeds, even minor delays from host-side coordination — like a cudaLaunchKernel call, a CPU verifying writes are complete, or a stream-synchronization event — manifest as stalls in the processing pipeline.
(2) CPU-mediated systems can only synchronize processing and communication at coarse, kernel-level intervals. More detailed, chunk-level coordination is impossible to achieve from the host processor.
The proposed solution is GPU-managed communication: the GPU autonomously handles data transfers, integrating communication operations directly into the same kernel executing the math. While many current fused-kernel tools are confined to one node or one GPU, mKernel is engineered specifically for multi-node environments.
mKernel’s Functionality
mKernel provides a set of persistent CUDA kernels. Each kernel combines intra-node NVLink transfers, inter-node RDMA, and core computation within a single execution unit.
Unified Multi-GPU, Multi-node Execution: Both NVLink communication and cross-node RDMA traffic are integrated into the same persistent kernel.
Detailed Intra-kernel Synchronization: Processing and data exchange happen concurrently at a tile-by-tile or chunk-by-chunk level, improving efficiency for both local and remote GPU communication.
Persistent Kernel with Specialized Streaming Multiprocessors (SMs): Compute Thread Arrays (CTAs) are designated for specific functions: compute, intra-comm, inter-send, and inter-reduce. The allocation of SMs to each function can be customized based on the workload shape.
GPU-Initiated Networking via libibverbs: mKernel executes direct GPU-initiated RDMA writes without relying on NCCL or NVSHMEM. This custom-built backend is engineered for peak performance and compatibility with various networking hardware types.
The Five Integrated Operations
| Kernel | Integrated Components | Description |
|---|---|---|
| AllGather + GEMM | AllGather → GEMM | Each rank manages a portion of matrix A. As peers’ segments are fetched via NVLink/RDMA, the local GEMM operation begins using the data immediately upon arrival. |
| GEMM + AllReduce | GEMM → AllReduce | Executes C = A @ B and combines partial results from all ranks in one step. Result tiles enter the reduction process the moment they are generated. |
| MoE Dispatch + GEMM | All-to-All routing → grouped GEMM | Routes MoE tokens to designated expert ranks (across NVLink and inter-node channels) and conducts the corresponding grouped GEMM. Tokens are processed immediately as they reach their target, eliminating the need for temporary storage overhead. |
| Ring Attention | Ring KV exchange → FlashAttention | Performs sequence-parallel attention over multiple ranks. During each cycle, a KV block circulates through the ring while local FlashAttention processes the previously received block. Computation and ring communication (send/receive) operate simultaneously inside one persistent kernel. |
| GEMM + ReduceScatter | GEMM → ReduceScatter | Executes C = A @ B and distribute the output via reduction. Every output tile is forwarded to its respective owning rank as soon as it is computed. |
Testing Environment
The mKernel team tested their library on two separate clusters, each consisting of 2 nodes with 8 H200 GPUs; the key difference between the clusters lies in their inter-node networking fabric:
| Testbed | Nodes × GPUs | Intra-node | Inter-node Link | Adapter Specifications |
|---|---|---|---|---|
| AWS EFA | 2 × 8 H200 | NVLink | AWS EFA / SRD | 16 × 200 Gb/s EFA adapters per node |
| ConnectX-7 | 2 × 8 H200 | NVLink | InfiniBand | 8 × 400 Gb/s NVIDIA ConnectX-7 adapters per node |
mKernel was measured against several established tools, including NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The developers mention that comprehensive scaling evaluations are currently proceeding.
Supported Backends and System Requirements
mKernel accommodates two primary networking backends:
| Backend | Build Flag | Transport Protocol | Target Environment |
|---|---|---|---|
| CX7 | -DINTERNODE_BACKEND_IBVERBS | libibverbs RC | ConnectX-7 / InfiniBand / RoCE |
| EFA | -DINTERNODE_BACKEND_EFA | libibverbs + efadv (SRD) | AWS p5/p5e instances (H200, equipped with EFA) |
Both backends utilize the identical host API and the same GPU-side kernel. Only the proxy/session implementation varies (session.h for CX7, session_efa.h for EFA). Requirements: NVIDIA Hopper GPUs (default build targets sm_90a), CUDA 12.9, Python with PyTorch. The CX7 backend necessitates libibverbs development libraries. The EFA backend necessitates AWS EFA software installed with libfabric, libibverbs, efadv, and EFA headers, typically situated at EFA_HOME=/opt/amazon/efa.
Marktechpost’s Visual Explainer
Combines NVLink, inter-node RDMA, and computation within a single persistent CUDA kernel
Five integrated kernels: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attention, GEMM+ReduceScatter
Direct GPU-driven RDMA using libibverbs — removes reliance on NCCL or NVSHMEM
Currently needs Hopper GPUs (sm_90a) and ConnectX-7 or AWS EFA networking



