MKernel: Unleashing Multi-GPU, Multi-Node Fused Kernel Power For GPU-Driven Communication

GPU communication often acts as a significant speed constraint in real-world AI workloads. Citing data from the mKernel initiative, researchers note that communication can occupy 43.6% of the forward pass and 32% of total training duration. For common Mixture-of-Experts (MoE) architectures, transmitting data between devices may take as much as 47% of total runtime. A team from UC Berkeley’s UCCL project has introduced mKernel, a library of persistent CUDA kernels designed to merge NVLink traffic within a node, RDMA transfers across nodes, and processing tasks into one unified operation.

The Challenge: CPU-Managed Communication

Traditionally, handling data exchange between GPUs follows a CPU-managed pattern. The CPU manages the control flow, invoking libraries such as NCCL or NVSHMEM to execute operations like AllReduce or AllGather. Processing and data transfer occur on independent CUDA streams and are synchronized only at kernel boundaries.

The researchers highlight two core limitations with this traditional setup:

(1) CPU capabilities fail to keep pace with advancing GPU power. Consider a GB300 NVL72 rack, which packs 72 Blackwell Ultra GPUs alongside 36 Grace CPUs, providing 720 PFLOP/s of FP8/FP6 performance, 1.44 EFLOP/s of FP4 Tensor Core throughput, and 130 TB/s of rack-wide NVLink bandwidth. At these extreme speeds, even minor delays from host-side coordination — like a cudaLaunchKernel call, a CPU verifying writes are complete, or a stream-synchronization event — manifest as stalls in the processing pipeline.

(2) CPU-mediated systems can only synchronize processing and communication at coarse, kernel-level intervals. More detailed, chunk-level coordination is impossible to achieve from the host processor.

The proposed solution is GPU-managed communication: the GPU autonomously handles data transfers, integrating communication operations directly into the same kernel executing the math. While many current fused-kernel tools are confined to one node or one GPU, mKernel is engineered specifically for multi-node environments.

mKernel’s Functionality

mKernel provides a set of persistent CUDA kernels. Each kernel combines intra-node NVLink transfers, inter-node RDMA, and core computation within a single execution unit.

Unified Multi-GPU, Multi-node Execution: Both NVLink communication and cross-node RDMA traffic are integrated into the same persistent kernel.

Detailed Intra-kernel Synchronization: Processing and data exchange happen concurrently at a tile-by-tile or chunk-by-chunk level, improving efficiency for both local and remote GPU communication.

Persistent Kernel with Specialized Streaming Multiprocessors (SMs): Compute Thread Arrays (CTAs) are designated for specific functions: compute, intra-comm, inter-send, and inter-reduce. The allocation of SMs to each function can be customized based on the workload shape.

GPU-Initiated Networking via libibverbs: mKernel executes direct GPU-initiated RDMA writes without relying on NCCL or NVSHMEM. This custom-built backend is engineered for peak performance and compatibility with various networking hardware types.

The Five Integrated Operations

Kernel	Integrated Components	Description
AllGather + GEMM	AllGather → GEMM	Each rank manages a portion of matrix `A`. As peers’ segments are fetched via NVLink/RDMA, the local GEMM operation begins using the data immediately upon arrival.
GEMM + AllReduce	GEMM → AllReduce	Executes `C = A @ B` and combines partial results from all ranks in one step. Result tiles enter the reduction process the moment they are generated.
MoE Dispatch + GEMM	All-to-All routing → grouped GEMM	Routes MoE tokens to designated expert ranks (across NVLink and inter-node channels) and conducts the corresponding grouped GEMM. Tokens are processed immediately as they reach their target, eliminating the need for temporary storage overhead.
Ring Attention	Ring KV exchange → FlashAttention	Performs sequence-parallel attention over multiple ranks. During each cycle, a KV block circulates through the ring while local FlashAttention processes the previously received block. Computation and ring communication (send/receive) operate simultaneously inside one persistent kernel.
GEMM + ReduceScatter	GEMM → ReduceScatter	Executes `C = A @ B` and distribute the output via reduction. Every output tile is forwarded to its respective owning rank as soon as it is computed.

Testing Environment

The mKernel team tested their library on two separate clusters, each consisting of 2 nodes with 8 H200 GPUs; the key difference between the clusters lies in their inter-node networking fabric:

Testbed	Nodes × GPUs	Intra-node	Inter-node Link	Adapter Specifications
AWS EFA	2 × 8 H200	NVLink	AWS EFA / SRD	16 × 200 Gb/s EFA adapters per node
ConnectX-7	2 × 8 H200	NVLink	InfiniBand	8 × 400 Gb/s NVIDIA ConnectX-7 adapters per node

mKernel was measured against several established tools, including NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The developers mention that comprehensive scaling evaluations are currently proceeding.

Supported Backends and System Requirements

mKernel accommodates two primary networking backends:

Backend	Build Flag	Transport Protocol	Target Environment
CX7	`-DINTERNODE_BACKEND_IBVERBS`	libibverbs RC	ConnectX-7 / InfiniBand / RoCE
EFA	`-DINTERNODE_BACKEND_EFA`	libibverbs + efadv (SRD)	AWS p5/p5e instances (H200, equipped with EFA)

Both backends utilize the identical host API and the same GPU-side kernel. Only the proxy/session implementation varies (session.h for CX7, session_efa.h for EFA). Requirements: NVIDIA Hopper GPUs (default build targets sm_90a), CUDA 12.9, Python with PyTorch. The CX7 backend necessitates libibverbs development libraries. The EFA backend necessitates AWS EFA software installed with libfabric, libibverbs, efadv, and EFA headers, typically situated at EFA_HOME=/opt/amazon/efa.

Marktechpost’s Visual Explainer

01 / 07 — Overview

What is mKernel?

mKernel is an open-source library of persistent CUDA kernels from UC Berkeley’s UCCL project. It fuses

Merges intra-node NVLink transfers, inter-node RDMA, and heavy computation into a single persistent kernel.

Most existing fused-kernel libraries are confined to one node or one GPU. mKernel is built from the ground up to cross node boundaries.

43.6%

of the forward pass is consumed by communication in production workloads

47%

of total execution time in widely-used MoE models

32%

of end-to-end training time is spent on communication

02 / 07 — The Problem

Why Host-Driven Communication Falls Short

The common approach is host-driven: the CPU invokes NCCL or NVSHMEM to launch collective operations across GPUs. The UCCL team highlights two key shortcomings.

⚡

CPUs cannot keep pace with GPUs. A GB300 NVL72 rack delivers 720 PFLOP/s FP8/FP6 and 1.44 EFLOP/s FP4. At those speeds, even microsecond-level overhead from cudaLaunchKernel, CPU-side synchronization checks, and inter-stream events translate directly into pipeline bubbles.

🔲

Overlap granularity is too coarse. Host-driven approaches can only overlap compute and communication at kernel boundaries. Finer-grained tile-level or chunk-level overlap is unattainable from the host.

🔀

The solution: GPU-driven communication. The GPU itself triggers fine-grained data transfers, all fused into the same kernel that performs the computation.

03 / 07 — Design

Four Core Design Properties

🖧

Multi-GPU and multi-node, unified in one kernel. Both intra-node NVLink and inter-node RDMA operate inside the same persistent kernel.

🔬

Fine-grained intra-kernel overlap. Compute and communication overlap at the tile or chunk level, covering transfers both within and across nodes.

⚙️

Persistent kernel with SM specialization. CTAs self-assign roles: compute, intra-comm, inter-send, inter-reduce. The SM partition is configurable per shape.

📡

GPU-driven networking via libibverbs. Employs GPU-initiated RDMA writes, removing any dependency on NCCL or NVSHMEM. The communication backend is written entirely from scratch.

04 / 07 — Kernels

The Five Fused Kernels

AllGather + GEMM

AllGather —> GEMM

Each rank holds a shard of A. The local GEMM starts consuming tiles over NVLink/RDMA as they arrive — the matmul begins before the collective is complete.

GEMM + AllReduce

GEMM —> AllReduce

Computes C = A @ B and reduces partial results across all ranks in a single launch. Output tiles enter the reduction tree the moment they are produced.

MoE Dispatch + GEMM

All-to-All dispatch —> grouped GEMM

Routes MoE tokens to expert ranks via NVLink plus inter-node all-to-all, then executes per-expert grouped GEMM in the same kernel — no staging-buffer round-trip needed.

Ring Attention

Ring KV exchange —> FlashAttention

Implements sequence-parallel attention across ranks. Each step rotates a KV chunk around the ring while the local FlashAttention processes the previously received chunk.

GEMM + ReduceScatter

GEMM —> ReduceScatter

Computes C = A @ B and reduce-scatters the output. Every tile is reduced and forwarded to its destination rank as soon as it is ready.

05 / 07 — Evaluation

Evaluation Setup

Results were gathered on two 2-node × 8-H200 clusters that differ only in their inter-node fabric.

Testbed	Nodes × GPUs	Inter-node	NIC
AWS EFA	2 × 8 H200	AWS EFA / SRD	16 × 200 Gb/s EFA per node
ConnectX-7	2 × 8 H200	InfiniBand	8 × 400 Gb/s CX7 per node

Both clusters use NVLink for intra-node communication. Performance was benchmarked against NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. Larger-scale evaluation is still underway.

06 / 07 — Backends & Requirements

Backends & Requirements

Backend	Transport	Where it runs
CX7	libibverbs RC	ConnectX-7 / InfiniBand / RoCE
EFA	libibverbs + efadv (SRD)	AWS p5/p5e (H200, EFA)

📋

Requirements: NVIDIA Hopper GPUs (default sm_90a), CUDA 12.9, Python with PyTorch. CX7 needs libibverbs headers. EFA requires libfabric, libibverbs, and efadv under EFA_HOME=/opt/amazon/efa.

📝

License & Attribution: Released under the MIT license. MMA and compute code are adapted from ThunderKittens (HazyResearch).

07 / 07 — Roadmap & Key Takeaways

Roadmap & Key Takeaways

✅

Fused GPU-driven multi-node kernels (AG+GEMM, GEMM+AR, MoE Dispatch+GEMM, Ring Attention, GEMM+RS)

✅

ConnectX-7 and AWS EFA backends

🚧

Full heterogeneous accelerator / NIC support with topology-aware discovery,

placement, routing

🚧

Inter-node megakernels: merging multiple fused operations into one megakernel that spans an entire transformer layer

🚧

Support for Blackwell GPUs

Combines NVLink, inter-node RDMA, and computation within a single persistent CUDA kernel

Five integrated kernels: AllGather+GEMM, GEMM+AllReduce, MoE Dispatch+GEMM, Ring Attention, GEMM+ReduceScatter

Direct GPU-driven RDMA using libibverbs — removes reliance on NCCL or NVSHMEM

Currently needs Hopper GPUs (sm_90a) and ConnectX-7 or AWS EFA networking

Top Posts

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

mKernel: Unleashing Multi-GPU, Multi-Node Fused Kernel Power for GPU-Driven Communication

What is mKernel?

Why Host-Driven Communication Falls Short

Four Core Design Properties

The Five Fused Kernels

Evaluation Setup

Backends & Requirements

Roadmap & Key Takeaways

Key Takeaways

5 No-Cost Courses to Transform from AI Newbie to Pro

The System76 Thelio Mira: My Dream Linux Desktop Come True

Google’s Gemini 3.6 Flash: Slashing Enterprise Agent Token Costs

Stop ML Chaos: Your Blueprint for Experiment Order

NVIDIA Cosmos 3 Edge: 4B-Power Robot Brains Thinking and Acting on Your Device

5 Premier MCP Servers to Supercharge Agentic Development

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

5 No-Cost Courses to Transform from AI Newbie to Pro

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

The Magic of Friction: Engineering Smarter Robot World Models

Trending

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

mKernel: Unleashing Multi-GPU, Multi-Node Fused Kernel Power for GPU-Driven Communication

The Challenge: CPU-Managed Communication

mKernel’s Functionality

The Five Integrated Operations

Testing Environment

Supported Backends and System Requirements

Marktechpost’s Visual Explainer

What is mKernel?

Why Host-Driven Communication Falls Short

Four Core Design Properties

The Five Fused Kernels

Evaluation Setup

Backends & Requirements

Roadmap & Key Takeaways

Key Takeaways

Related Posts