Sakana AI And NVIDIA Unveil TwELL: Boosting LLM Inference By 20.5% And Training By 21.9% With CUDA Kernels

Scaling large language models (LLMs) is expensive. Every token processed during inference and every gradient computed during training flows through feedforward layers that account for over two-thirds of model parameters and more than 80% of total FLOPs in larger models. A team of researchers from Sakana AI and NVIDIA have worked on a new research that directly targets this bottleneck — not by changing the architecture, but by making the computation inside feedforward layers significantly cheaper through unstructured sparsity.

Sparsity Exists, But GPUs Ignore It

Inside a transformer’s feedforward block, for any given input token, only a small fraction of hidden neurons actually fire — the rest produce zero after passing through the activation function. This is called activation sparsity, and prior work has documented this phenomenon in models with ReLU activations.

The frustrating reality is that this theoretical savings rarely translates into actual speedups. NVIDIA GPUs are heavily optimized for dense matrix multiplications using Tensor Cores, which operate on large contiguous tiles of data. Traditional sparse formats like ELLPACK (ELL) require a separate kernel pass to convert activations from dense to sparse representation, and that conversion overhead often cancels out what’s saved by skipping the zeros.

Critically, prior work on sparse LLM kernels (including TurboSparse, ProSparse, and Q-Sparse) has focused on memory-bound GEMV operations — the single- or few-token inference regime. The research team instead targets compute-bound GEMM operations in the batched setting with thousands of input tokens, where dense baselines on modern devices can execute orders-of-magnitude higher FLOP/s with large tiles and Tensor Cores. That is a fundamentally harder problem, and the reason prior approaches didn’t generalize to batched training or high-throughput inference.

01 — The Problem

Feedforward layers dominate LLM cost — and most of that work is wasted.

> ⅔

of all model parameters live in feedforward layers

80%+

of total FLOPs consumed by feedforward layers

99%+

of hidden activations can be zero with no accuracy drop

For any given token, only a tiny fraction of hidden neurons actually fire. The rest output zero after the activation function. This is called activation sparsity — and it has historically been impossible to exploit on modern GPUs because sparse operations ran slower than dense ones.

Prior sparse LLM kernels (TurboSparse, ProSparse, Q-Sparse) only targeted single-token GEMV operations. Sakana AI and NVIDIA tackle the harder problem: batched GEMM with thousands of tokens — the regime that covers both training and high-throughput inference.

02 — The Innovation

TwELL: a sparse format built around how GPU kernels actually work.

Old Way — ELL

Row-wide packing, costly to build

Standard ELLPACK packs non-zeros row-by-row across the entire matrix. To construct it from a tiled matmul output you need a separate kernel launch, a full global memory read, and synchronization across all CTAs. Those overheads cancel out the savings from skipping zeros.

New Way — TwELL

Tile-wise packing, built in the epilogue

TwELL partitions columns into horizontal tiles matching the matmul kernel’s tile size T_n. Non-zeros are packed locally within each tile. By matching dimensions, TwELL is constructed inside the existing gate projection kernel epilogue — no extra kernel, no extra memory read, no synchronization overhead.

The inference pipeline uses one fused kernel that reads gate activations in TwELL format and performs up + down projections together. The intermediate hidden state is never written to global memory, cutting DRAM traffic at every forward pass.

For training, a hybrid sparse format dynamically routes rows into a compact ELL matrix (sparse rows) or a dense backup (overflow rows). Sparsity during training is highly non-uniform — max non-zeros per row can be orders of magnitude above the average — so the hybrid design handles this without becoming brittle.

03 — Training Recipe

Two changes to your training config. Nothing else.

Replace SiLU with ReLU as the gate activation function. ReLU produces exact zeros for negative inputs — this is what enables unstructured sparsity. No other architectural change is needed. (Unregularized ReLU sits slightly below SiLU on task accuracy: 46.4% vs 47.1% on the 1.5B model, offset by the efficiency gains.)

Add an L1 loss term on the hidden feedforward activations, averaged over all tokens and hidden dimensions across all layers. Recommended coefficient: L1 = 2×10⁻⁵. Add it to your standard cross-entropy loss. No changes to learning rate, weight decay, batch size, or optimizer.

Sparsity stabilizes fast. The non-zero count settles within ~1,000 training steps (~1B tokens). The training kernels deliver memory and throughput benefits for almost the entire training run, not just toward the end.

Watch Out

At L1 = 2×10⁻⁵, over 30% of neurons become permanently inactive (dead neurons) on average across layers. Downstream accuracy is not visibly affected at this level. The paper explores targeted gate weight reinitialization as a mitigation — yielding +19.1% speedup vs +17.9% baseline with no accuracy cost.

04 — Benchmark Results

Accuracy preserved. Efficiency scales up with model size.

Model	Accuracy	Inference	Energy / tok	Training	Peak Mem
0.5B	40.4% → 40.4%	+17.0%	−11.8%	−1.5%	−19.2%
1B	44.6% → 44.7%	+18.1%	−14.6%	+7.1%	−25.5%
1.5B	46.4% → 46.2%	+18.8%	−15.0%	+11.6%	−28.1%
2B	49.1% → 48.8%	+20.5%	−17.0%	+21.9%	+22.3% *

All results at L1 = 2×10⁻⁵ on a single node of eight H100 PCIe GPUs, sequence length 2048. Efficiency gains grow with scale — average non-zero activations drop from 39 (0.5B) to 24 (2B), giving the sparse kernels proportionally more computation to skip. * The 2B sparse model uses a larger micro-batch enabled by reduced activation memory, raising peak usage while improving throughput.

0.5 — Key Findings

What the paper reveals about where sparsity actually lives.

◆

Early layers are least active. In a 28-layer 1.5B model, the first two layers have the fewest non-zero activations. Activity peaks in the early-to-middle layers — consistent with prior work showing LLM reasoning and knowledge retrieval concentrate there.

◆

First tokens in a sequence fire far more neurons. The model allocates exponentially more computation to early sequence positions where contextual cues from prior tokens are absent. This non-uniformity is exactly what the sparse kernels exploit for speedups.

◆

Strong inverse correlation between sparsity and speedup. The paper measures a Pearson correlation of −0.996 between each layer’s average non-zero count and its inference speedup contribution. Sparser layers deliver proportionally larger gains.

◆

Larger gains on less specialized hardware. On NVIDIA RTX PRO 6000 (188 SMs vs 114 on H100), training speedups are significantly higher. Dense GEMM is slower on the RTX 6000, while sparse ops run faster — widening the relative advantage of sparsity on accessible hardware.

06 — Get Started

Open-source. All kernels and training code released.

■

Architecture: Works with gated feedforward LLMs — Llama, Qwen, and any Transformer++ design. Non-gated (original transformer) variant also supported: 11.2% inference speedup vs 17.9% for gated at the same L1.

■

Hardware: CUDA kernels written for H100 GPUs using TMA-based pipelining and persistent cooperative design. Gains verified on RTX PRO 6000 with even larger speedups.

■

Existing models: Fine-tuning via sparsification approaches is flagged as a future direction for bringing these kernels to pretrained dense models — not yet demonstrated in this paper.

So, What Exactly is Proposed

The research team addresses this mismatch with two primary contributions: a new sparse data format called TwELL (Tile-wise ELLPACK), and a set of custom CUDA kernels for inference and training built around it.

TwELL is designed around one key insight: modern matmul kernels already divide computation across small 2D tiles (of size T_m × T_n) assigned to individual cooperative thread arrays (CTAs). Standard ELL packs non-zeros row-by-row across the entire matrix, which requires global synchronization to construct from tiled matmul outputs. TwELL instead partitions the columns of the gate activation matrix into horizontal tiles of size T, and within each tile stores non-zero values and their indices in a local ELL-style layout. By matching the tile dimension T to the column tile size T_n of the matmul kernel, TwELL can be produced directly in the epilogue of the gate projection kernel — no extra kernel launch, no additional global memory read, no synchronization across CTAs. The format uses a compression factor C such that T/C exceeds the maximum non-zeros per tile, and packages values, indices, and non-zero counts into a single 32-bit matrix for locality.

During inference, a single combined kernel processes gate activations in TwELL format and carries out both the up-projection and down-projection in one step. Each CTA (Cooperative Thread Array) is responsible for one row of inputs, first stepping through column tiles in a fixed order and then dynamically processing the non-zero entries within each tile. For every active neuron at position n, the CTA fetches the n-th column of the up-projection weight matrix W_u and the n-th row of the down-projection weight matrix W_d, calculates their dot product, and adds the result to the output. The intermediate hidden state h_u is never written to global memory, which dramatically reduces DRAM bandwidth usage.

During training, things get trickier because sparsity patterns vary widely across tokens and layers — the peak number of non-zeros per row can far exceed the average, making a pure ELL layout unreliable. To address this, the team proposes a hybrid sparse format that assigns each row either to a compact ELL matrix (for rows with non-zeros below a set threshold) or to a dense backup matrix (for rows that exceed the threshold). This design enables efficient sparse gradient computation in the backward pass while avoiding dense matrix multiplications for the majority of rows. The team also provides CUDA kernels for the standard non-gated transformer feedforward block; at the recommended sparsity level, the non-gated version achieves an 11.2% inference speedup, compared to 17.9% for the gated variant.

Simply ReLU and L1 Regularization

The approach to inducing sparsity is intentionally straightforward. The team uses ReLU as the gate activation function and adds a basic L1 loss term on the hidden feedforward activations, scaled by a coefficient called L1. No other modifications to the architecture are needed, and the researchers confirmed that introducing L1 regularization had no impact on other hyperparameters such as learning rate, weight decay, or optimizer configuration.

All models were trained on the fineweb dataset (a deduplicated subset of fineweb-edu) using chinchilla-optimal token counts — roughly 10 billion tokens for a 0.5B-parameter model up to 40 billion tokens for a 2B-parameter model — with a context window of 2048 tokens and a batch size of 1 million tokens.

After testing eight different L1 coefficient values on a 1.5B-parameter model, the researchers found that up to L1 = 3 × 10⁻⁵, there is virtually no decline in average task accuracy across seven downstream benchmarks (ARC Easy/Challenge, HellaSwag, OpenBookQA, PIQA, WinoGrande, CommonsenseQA), with the final cross-entropy rising by less than 2% compared to the baseline without regularization. The recommended setting of L1 = 2 × 10⁻⁵ cuts the average number of non-zero activations per layer from 911 (in the unregularized 1.5B model with a feedforward hidden dimension of 5632) down to just 29 — achieving roughly 99.5% sparsity — with no detectable loss in downstream performance.

One notable observation: at L1 = 2 × 10⁻⁵, more than 30% of neurons become permanently inactive (so-called dead neurons) on average across layers. The team investigates two strategies to counteract this — gradually warming up the L1 coefficient during training and reinitializing the gate projection columns corresponding to dead neurons — and finds that the reinitialization method preserves similar sparsity levels while modestly boosting both downstream accuracy and efficiency (+19.1% inference speedup versus +17.9% for the baseline). This is highlighted as an area for future investigation.

Measured Efficiency Gains

The efficiency measurements were conducted on a single node equipped with eight H100 PCIe GPUs, using a fixed sequence length of 2048 tokens. For the cross-scale comparison, the L1 coefficient was held constant at 2 × 10⁻⁵.

At smaller model scales, sparsity delivers substantial peak memory savings during training:

Model	Dense Peak Memory	Sparse Peak Memory	Change
0.5B	26.2 GB	21.2 GB	−19.2%
1B	44.5 GB	33.1 GB	−25.5%
1.5B	62.8 GB	45.1 GB	−28.1%

At the 2B-parameter scale, the sparse model benefits from a larger micro-batch size (made possible by the reduced activation memory footprint), which leads to higher peak GPU memory usage (46.7 → 57.1 GB) but significantly faster training throughput (+21.9%). The full set of efficiency improvements for the 2B model:

Forward pass throughput: 87.8 → 106 input tokens/ms (+20.5%)
Energy per token: 7.85 → 6.51 mJ (−17.0%)
Training step throughput: 22.4 → 27.3 input tokens/ms (+21.9%)

Across the entire 0.5B–2B range, the average task accuracy of sparse and dense models remains statistically equivalent. Efficiency gains increase with model scale: larger models naturally exhibit lower average non-zero counts (falling from 39 at 0.5B to 24 at 2B), meaning the sparse kernels skip a proportionally larger fraction of computation.

Training speedups are also seen on NVIDIA’s RTX PRO 6000 GPU, where the higher Streaming Multiprocessor count (188 versus 114 on the H100) allows sparse operations to execute more quickly — indicating that these benefits carry over to less specialized hardware as well.

What the Sparsity Patterns Reveal

Sparsity is distributed unevenly throughout the model: in a 28-layer 1.5B model, the first two layers are the least active, followed by a sharp rise in non-zero activations across the early-middle layers — a pattern consistent with prior research suggesting that much of an LLM’s reasoning and knowledge retrieval happens in this region. Additionally, the first tokens in an input sequence trigger far more neurons than subsequent tokens, with an exponential drop-off thereafter. The team measured an inverse Pearson correlation of −0.996 between each layer’s average non-zero count and its contribution to inference speedup, confirming that the sparsest layers deliver the greatest per-layer performance gains.

Check out the Paper, Repo and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Top Posts

Mini Worm Wrecks Havoc: How a Tiny Shai-Hulud Script Compromised TanStack, Mistral AI, Guardrails AI, and Other Major Packages

AWS and Anthropic Deepen Alliance with Claude Platform Launch

From Bootstrap to Breakthrough: Why a SIM with a Global Profile Falls Short of True In-Factory Provisioning

Sakana AI and NVIDIA Unveil TwELL: Boosting LLM Inference by 20.5% and Training by 21.9% with CUDA Kernels

“First Movers Reveal Their Insider Strategies”

Revolutionizing Inference: How Meta and Stanford’s Fast Byte Latent Transformer Slashes Memory Bandwidth by Over 50%—No Tokenization Needed

Crafting Word Vectors for Sentiment Analysis: A Python Reproduction

Guardrails for LLMs: Taming AI Hallucinations and Taming Verbosity

Best Vector Databases in 2026: Pricing, Scale Limits, and Architecture Tradeoffs Across Nine Leading Systems

Streaming vs. Batch: The Enduring Data Processing Debate

Mini Worm Wrecks Havoc: How a Tiny Shai-Hulud Script Compromised TanStack, Mistral AI, Guardrails AI, and Other Major Packages

AWS and Anthropic Deepen Alliance with Claude Platform Launch

From Bootstrap to Breakthrough: Why a SIM with a Global Profile Falls Short of True In-Factory Provisioning

“Claude Code-Powered Knowledge Base: The Ultimate Builder’s Guide”

“First Movers Reveal Their Insider Strategies”

US Inflation Poised to Surge Again Amid Rising Oil Prices Fueled by US-Iran Tensions

Build Application Firewalls: Your Shield Against the Next Supply Chain Attack

Trump Taps FEMA Whistleblower Cameron Hamilton to Head the Agency He Once Defended

Trending

Mini Worm Wrecks Havoc: How a Tiny Shai-Hulud Script Compromised TanStack, Mistral AI, Guardrails AI, and Other Major Packages

AWS and Anthropic Deepen Alliance with Claude Platform Launch

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Sakana AI and NVIDIA Unveil TwELL: Boosting LLM Inference by 20.5% and Training by 21.9% with CUDA Kernels

Sparsity Exists, But GPUs Ignore It

So, What Exactly is Proposed

Simply ReLU and L1 Regularization

Measured Efficiency Gains

What the Sparsity Patterns Reveal

Related Posts