Revolutionizing Inference: How Meta And Stanford's Fast Byte Latent Transformer Slashes Memory Bandwidth By Over 50%—No Tokenization Needed

Researchers from Meta, Stanford University, and the University of Washington have unveiled three innovative techniques that dramatically speed up text generation in the Byte Latent Transformer (BLT) — a language model architecture that processes raw bytes directly, bypassing traditional tokenization.

Why Byte-Level Models Struggle with Speed

To appreciate the significance of this new work, it helps to understand the fundamental tradeoff in byte-level language modeling.

Most modern language models operate on tokens — text fragments generated by subword tokenizers such as byte-pair encoding (BPE). A single token often covers multiple characters or even an entire word. While this approach is computationally efficient, tokenization introduces well-known limitations: vulnerability to noisy inputs, difficulty handling multilingual content, limited character-level reasoning, and brittleness when dealing with structured data like code and numerical values.

Byte-level models avoid these pitfalls entirely by working directly on raw bytes — the most granular representation of text. The Byte Latent Transformer (BLT) marked a significant breakthrough: it achieved performance comparable to tokenization-based models at scale by dynamically grouping bytes into variable-length patches through an entropy-driven segmentation approach. Regions with high entropy (harder to predict) are assigned shorter patches, while more predictable stretches receive longer ones. Most of the heavy computation happens on latent token representations rather than raw bytes — using three core components: a local encoder, a large global Transformer, and a local decoder — with an average patch size of 4 bytes and a maximum of 8.

The lingering challenge, however, is inference speed. Despite BLT’s hierarchical architecture, the local decoder still produces bytes one at a time in an autoregressive fashion. Since a typical subword token maps to multiple bytes, BLT requires several decoder forward passes to generate the same volume of text that a token-level model produces in a single step. In today’s LLM serving infrastructure, the primary bottleneck is often memory bandwidth rather than raw compute — the repeated loading of model weights and key-value caches from memory. More decoder passes mean more memory transfers, which directly slows down generation.

Three Techniques, One Objective: Cutting Down Forward Passes

The research team presents three methods designed to alleviate this bottleneck, each balancing speed and output quality differently.

BLT Diffusion (BLT-D)

This is the flagship contribution and the fastest of the three approaches. The central concept is to swap out autoregressive byte-by-byte decoding for block-wise discrete diffusion within the local decoder.

During training, the decoder is fed two types of input: a clean byte sequence (the original text) and a corrupted version made up of fixed-length byte blocks. For each block, a continuous diffusion timestep t is drawn from U(0,1), and every byte in the block is independently swapped with a [MASK] token with probability t. This means the masking intensity varies across training examples — a lower t preserves most bytes, while a higher t obscures the majority. The block size B (tested at 4, 8, or 16 bytes) typically exceeds BLT’s average patch size of 4 bytes, training the decoder to predict bytes further ahead than it normally would. The overall training loss merges the standard autoregressive next-byte prediction loss on clean sequences with a masked-byte prediction loss on the corrupted blocks — conceptually akin to BERT’s masked language modeling, but implemented at the byte level within BLT’s hierarchical framework.

At inference time, BLT-D starts with a block of [MASK] positions and progressively reveals multiple byte positions per decoder step using one of two strategies: confidence-based unmaskingHere is the paraphrased version:

Positions whose predicted probability surpasses a threshold α are unmasked, or alternatively, entropy-bounded (EB) sampling picks the largest group of positions whose combined entropy remains under a threshold γ. In both cases, multiple bytes are generated in each forward pass instead of just one. The encoder and global model — the most resource-intensive parts of BLT — are called once per block rather than once per patch, which further cuts down the total number of model invocations. BLT-D also works with KV caching, taking advantage of any methods that shrink the KV-cache memory footprint.

With 3 billion parameters, BLT-D-4 (block size 4) comes close to matching BLT’s task performance while using less than half the memory bandwidth. BLT-D-16 (block size 16) delivers an 87–92% drop in estimated memory-bandwidth cost relative to BLT, making it the quickest setup tested — though it does show weaker pass@1 scores on coding benchmarks (HumanEval, MBPP).

BLT Self-Speculation (BLT-S)

This approach follows a different path, building on speculative decoding — a method where an inexpensive draft model suggests tokens and a larger model checks them in parallel. What sets BLT-S apart is that it needs no separate draft model and no modifications to the architecture or extra training. It reuses BLT’s built-in lightweight local decoder as the drafter.

During normal BLT inference, the decoder halts generation as soon as the entropy-based patcher detects that a new patch boundary has been hit — usually every four bytes. BLT-S, on the other hand, allows the decoder to generate autoregressively up to a set window size k (8 or 16 bytes in the experiments) without regard for entropy spikes, using the most recent latent token as context. Once a draft of k bytes is produced, the full model re-encodes the candidate sequence through the encoder, global model, and decoder to produce next-byte predictions. Drafted bytes are kept up to the first mismatch, and the first incorrect byte is swapped out for the verified prediction.

With greedy decoding, this process ensures that verified outputs are exactly the same as standard autoregressive BLT decoding — there is no drop in quality. BLT-S adds a few more decoder forward passes but significantly cuts down on encoder and global model calls. At 3B parameters with k=16, BLT-S can deliver up to a 77% reduction in memory bandwidth with no decline in task performance.

BLT Diffusion+Verification (BLT-DV)

This method occupies a middle ground. Since BLT-D is trained with both a diffusion objective and a standard next-byte prediction objective, the same model weights can run autoregressively using causal decoder masks — no separate model or extra training is required. BLT-DV takes advantage of this: diffusion drafts a block of bytes first, then a single autoregressive forward pass verifies the draft, accepting bytes up to the first mismatch. In practice, one-step diffusion paired with verification turned out to be the fastest BLT-DV setup. While one-step diffusion by itself usually causes generation quality to deteriorate quickly, the verification step effectively blocks this from happening. At 3B parameters, BLT-DV can achieve up to an 81% reduction in memory bandwidth compared to BLT.

Understanding the Numbers

All models were trained on the BLT-1T dataset (1 trillion tokens drawn from public sources, including a subset of Datacomp-LM), with 1B-parameter models trained for 240,000 steps and 3B-parameter models for 480,000 steps. Evaluation spanned four generation tasks: French-to-English and German-to-English translation using the FLORES-101 benchmark (4-shot, SentencePiece BLEU) and two coding benchmarks — HumanEval (0-shot, pass@1) and MBPP (3-shot, pass@1).

Beyond generation tasks, the research team also tested BLT-D on five likelihood-based benchmarks: ARC-Easy, ARC-Challenge, PIQA, HellaSwag, and MMLU. Because BLT-D is trained with a next-byte prediction objective in addition to the diffusion objective, it can compute autoregressive likelihoods by applying a causal mask to the decoder — the same mechanism that BLT-DV’s verification step depends on. The results show BLT-D variants reach scores close to BLT’s baseline across all five benchmarks, confirming that adding block diffusion does not weaken the model’s autoregressive reasoning ability.

Efficiency is measured through three proxy metrics: decoder network function evaluations (NFEs), encoder/global model NFEs, and an estimated memory-bandwidth figure in gigabytes calculated from parameter counts and forward-pass counts under 16-bit precision. The research team is clear that these are proxy metrics — turning NFE reductions into real wall-clock speedups demands a highly optimized inference implementation, which the team identifies as the most critical area for future work.

Translation tasks see the biggest gains from BLT-D across all block sizes. Coding tasks are more sensitive to block size: BLT-D-16 provides the greatest efficiency improvements but shows noticeable score drops on HumanEval and MBPP. An interesting additional finding emerges from the generation diversity analysis: when entropy-bounded sampling is combined with top-p sampling at inference, more decoder NFEs are linked to a higher type-token ratio (a measure of lexical diversity). This means the tradeoff between efficiency and diversity can be adjusted at inference time with no retraining needed.

Key Takeaways

BLT-D integrates block-wise discrete diffusion into BLT’s local decoder, using a joint training objective that combines next-byte prediction and masked-byte prediction losses, enabling the model to generate several bytes in a single forward pass rather than one byte at a time
BLT-S repurposes BLT’s built-in lightweight decoder as a speculative drafting module — requiring no extra model, no modifications to the architecture, and no additional training — while guaranteeing bit-for-bit identical output to standard BLT under greedy decoding
BLT-DV merges diffusion-based drafting with an autoregressive verification stage that leverages the same BLT-D model weights, effectively restoring the quality lost during diffusion-only decoding without any extra training overhead
All proposed techniques can deliver an estimated memory-bandwidth reduction exceeding 50% compared to standard BLT on generation workloads; BLT-D-16 in particular may achieve reductions as high as 87–92%
BLT-D maintains strong autoregressive performance on likelihood-based evaluation benchmarks (ARC-Easy, ARC-Challenge, PIQA, HellaSwag, MMLU), and its output diversity can be adjusted at inference time through entropy-bounded sampling thresholds

Check out the Paper. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.

Looking to collaborate with us on promoting your GitHub repo, Hugging Face page, product release, webinar, or similar? Connect with us

Top Posts

Bitcoin Bulls Charge Toward $82K While Altcoins Hold Steady

Mini Worm Wrecks Havoc: How a Tiny Shai-Hulud Script Compromised TanStack, Mistral AI, Guardrails AI, and Other Major Packages

AWS and Anthropic Deepen Alliance with Claude Platform Launch

Revolutionizing Inference: How Meta and Stanford’s Fast Byte Latent Transformer Slashes Memory Bandwidth by Over 50%—No Tokenization Needed

“First Movers Reveal Their Insider Strategies”

Crafting Word Vectors for Sentiment Analysis: A Python Reproduction

Guardrails for LLMs: Taming AI Hallucinations and Taming Verbosity

Sakana AI and NVIDIA Unveil TwELL: Boosting LLM Inference by 20.5% and Training by 21.9% with CUDA Kernels

Best Vector Databases in 2026: Pricing, Scale Limits, and Architecture Tradeoffs Across Nine Leading Systems

Streaming vs. Batch: The Enduring Data Processing Debate

Bitcoin Bulls Charge Toward $82K While Altcoins Hold Steady

Mini Worm Wrecks Havoc: How a Tiny Shai-Hulud Script Compromised TanStack, Mistral AI, Guardrails AI, and Other Major Packages

AWS and Anthropic Deepen Alliance with Claude Platform Launch

From Bootstrap to Breakthrough: Why a SIM with a Global Profile Falls Short of True In-Factory Provisioning

“Claude Code-Powered Knowledge Base: The Ultimate Builder’s Guide”

“First Movers Reveal Their Insider Strategies”

US Inflation Poised to Surge Again Amid Rising Oil Prices Fueled by US-Iran Tensions

Build Application Firewalls: Your Shield Against the Next Supply Chain Attack

Trending

Bitcoin Bulls Charge Toward $82K While Altcoins Hold Steady

Mini Worm Wrecks Havoc: How a Tiny Shai-Hulud Script Compromised TanStack, Mistral AI, Guardrails AI, and Other Major Packages

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Revolutionizing Inference: How Meta and Stanford’s Fast Byte Latent Transformer Slashes Memory Bandwidth by Over 50%—No Tokenization Needed

Why Byte-Level Models Struggle with Speed

Three Techniques, One Objective: Cutting Down Forward Passes

BLT Diffusion (BLT-D)

BLT Self-Speculation (BLT-S)

BLT Diffusion+Verification (BLT-DV)

Understanding the Numbers

Key Takeaways

Related Posts