Cutting LLM Memory By 84%: A Deep Dive Into Fused Kernels

The Stride Swap: When computing P . W_T , we don’t actually need to physically transpose the massive W matrix in memory. Instead, we invert the shapes and strides in W ’s block pointer to read the rows of W as columns of W^T . This results in a “free” transpose that saves both time and VRAM.
Numerical Precision: It is worth noting that while X and W might be in bfloat16 , the accumulation of dW and dX via atomic_add is usually performed in float32 to prevent the accumulation of tiny rounding errors across thousands of rows.
Contention Note: While atomic_add is necessary for dW (because every program updates the same weights), dX is private to each program, meaning there is zero contention between program IDs for that specific tensor.
Atomic Add Masking: atomic_add doesn’t support block pointers. Therefore, we implement the pointer and mask logic for dW explicitly.

or fine-tuned an LLM, you’ve likely hit a wall at the very last step: the Cross-Entropy Loss.

The culprit is the logit bottleneck. To predict the next token, we project a hidden state into a massive vocabulary space. For Llama 3 (128,256 tokens), the weight matrix alone is over 525 million parameters. While that’s only ~1GB in bfloat16, the intermediate logit tensor is the real issue. For large batches, it can easily exceed 80GB of VRAM just to compute a single scalar loss.

Optimising this layer is how libraries like Unsloth and Liger-Kernel achieve such massive memory reductions. In this article, we’ll build a fused Linear + Cross Entropy kernel from scratch in Triton. We will derive the math and implement a tiled forward and backward pass that slashes peak memory usage by 84%.

Note on Performance: This implementation is primarily educational. We prioritise mathematical clarity and readable Triton code by using global atomic operations. While it solves the memory bottleneck, matching production-grade speeds would require significantly more complex implementations which are out of scope for this article.

This post is part of my Triton series. We’ll be using concepts like tiling and online softmax that we’ve covered previously. If those sound unfamiliar, I recommend catching up there first!