Zyphra Unveils ZAYA1-8B-Diffusion-Preview: Pioneering The First MoE Diffusion Model Transformed From An Autoregressive LLM, Achieving Up To 7.7x Speedup

Here is the paraphrased version of the article in HTML format:

Zyphra, the San Francisco-based AI lab behind the ZAYA1 model family, has released ZAYA1-8B-Diffusion-Preview — an early look at its work in diffusion-language models. The release shows that an existing autoregressive language model can be transformed into a discrete diffusion model without any significant drop in evaluation performance, while achieving major inference speedups on AMD hardware.

The Problem With Autoregressive Decoding

To grasp why this is important, it helps to first understand how most language models produce text today. Standard large language models use autoregressive decoding: they generate one token at a time in order. For each new token, the attention mechanism must look back at all previously generated tokens and load their stored representations — known as the KV-cache — from GPU memory. Importantly, because every user in a batch has a unique history of tokens, each user’s KV-cache must be loaded individually and cannot be shared between requests.

This creates a performance bottleneck. When the GPU spends more time fetching data from memory than doing actual computation, the system becomes limited by memory bandwidth rather than compute power. This restricts how effectively modern GPU hardware — which has been increasing compute FLOPs faster than memory bandwidth — can be utilized during inference.

Diffusion provides a different approach. Rather than generating one token at a time, a diffusion model creates multiple drafts of N tokens at once and repeats this drafting process several times. Since all N tokens in the block share the same KV-cache, the operation shifts from being memory-bandwidth limited to compute-limited, allowing the GPU to be used more efficiently. In ZAYA1-8B-Diffusion-Preview specifically, the model performs a single-step transformation from mask to token for each token in the block — meaning it directly predicts the unmasked token in one step rather than through iterative denoising.

Converting Autoregression to Diffusion Without Training From Scratch

Training a diffusion language model from the ground up is technically challenging, and there are few well-established methods for doing so. The Zyphra team gives two reasons for choosing conversion over training from scratch: first, it is simply difficult, with few proven methods; second, there is no benefit to training in diffusion mode because training is already compute-limited — the memory-bandwidth bottleneck that diffusion addresses only occurs at inference time. This means all the advantages of diffusion are inference-time advantages, and an existing pretraining pipeline can be reused without modification.

Building on the TiDAR recipe, Zyphra took the ZAYA1-8B-base checkpoint and carried out an additional 600 billion tokens of diffusion-conversion mid-training at a 32k context length, followed by 500 billion tokens of native context extension to 128k, and then a diffusion supervised fine-tuning (SFT) phase.

ZAYA1-8B-Diffusion-Preview is the first MoE diffusion model converted from an autoregressive LLM, and the first diffusion-language model to be trained on AMD GPUs. Zyphra reports minimal evaluation degradation compared to the base autoregressive checkpoint, with improvements on some benchmarks such as LCB-v6. They attribute this partly to better mid-training datasets and partly to the greater expressiveness of diffusion-style within-block non-causal inference compared to causal autoregression.

How the Diffusion Sampler Works

During inference, ZAYA1-8B-Diffusion-Preview generates a draft of 16 tokens at the same time. A portion of these tokens are accepted based on a sampling criterion borrowed from speculative decoding. The key benefit here is that the same model acts as both speculator and verifier within a single forward pass, which eliminates the overhead of running two separate models as in traditional methods like EAGLE or dFlash. In heavily memory-bandwidth-bound regimes, almost all accepted

Tokens generated via diffusion represent a free speedup over traditional autoregressive decoding — since the GPU is already loaded, the extra tokens require very little additional computation.

The Zyphra team introduces two sampling methods, each offering a different balance between speed and output quality:

Lossless diffusion sampler: This method applies the standard speculative decoding acceptance rule: min(1, p(x)/q(x)), where p is the autoregressive model’s logit distribution and q is the diffusion model’s distribution. If a token is rejected, the next one is sampled from the residual distribution p(x) − q(x). This approach delivers a 4.6× speedup with no measurable drop in evaluation performance.
Logit-mixing sampler: Here, the logits from both the diffusion speculator and the autoregressive model are blended first, and the combined distribution is used for verification. This boosts acceptance rates because the verification logits more closely match the diffusion logits, though it slightly affects output quality. This method achieves a 7.7× speedup. Users can adjust the speed-quality trade-off at runtime.

An important note about these results: ZAYA1-8B-Diffusion-Preview is a base mid-train checkpoint that has not yet undergone RL training. Therefore, Zyphra uses pass@ evaluations instead of standard accuracy benchmarks to better reflect the model’s potential after RL training. Readers should keep this in mind when comparing these numbers to benchmarks reported for other models.

The Zyphra team also highlights that diffusion-based speedups exceed those achieved by alternative techniques such as multi-token prediction (MTP) and various speculative decoding methods like EAGLE3. Because TiDAR-style diffusion models require only a single forward pass, acceptance rates comparable to dFlash still translate into significant speedups.

Architecture Details

ZAYA1-8B-Diffusion-Preview is a single-step speculative diffusion model that uses order-constrained generation, meaning the diffusion model can only produce tokens in a contiguous subsequence starting from the prefix. This constraint greatly improves training stability compared to unconstrained mask diffusion objectives or set block decoding, and was a key reason Zyphra adopted the TiDAR recipe.

The model leverages ZAYA1-8B’s existing CCA attention variant from Zyphra. CCA significantly reduces prefill FLOPs in attention, which directly benefits diffusion since diffusion transforms decoding into a prefill-like operation. This allows the model to diffuse more tokens in parallel before reaching compute limits.

More specifically, the architecture employs CCGQA with a 4:1 ratio of query heads to key heads. One design decision was to avoid MLA (Multi-Head Latent Attention), whose high arithmetic intensity was considered a poor fit compared to CCGQA. Since block diffusion accesses the same cache, arithmetic intensity scales with block size and the number of blocks per forward pass. On AMD MI300x hardware in bf16, the system supports roughly three block-sized proposals per single forward pass; on MI355x, this increases to approximately five. CCGQA also operates at 2× compression, which allowed Zyphra to accommodate the additional training FLOPs associated with TiDAR mid-training. The larger VRAM capacity of AMD GPU hardware further enabled more efficient diffusion training overall.

In practice, achieving the theoretical speedups is more difficult because diffusion introduces additional operational overhead, and the inference stack for diffusion models is far less optimized than the mature tooling available for autoregressive inference.

Marktechpost’s Visual Explainer

■ Marktechpost Guide
ZAYA1-8B-Diffusion-Preview

01 / 08 — Overview
What is ZAYA1-8B-Diffusion-Preview?
On May 14, 2026, Zyphra introduced ZAYA1-8B-Diffusion-Preview. This model transforms a standard autoregressive MoE language model into a discrete diffusion model without any measurable drop in evaluation performance, while achieving up to 7.7x faster inference on AMD hardware.
Rather than generating one token at a time, it produces 16 tokens in parallel through a single-step process that replaces masks with actual tokens.
ReleasedMay 14, 2026 — San Francisco
ByZyphra
Base modelZAYA1-8B (autoregressive MoE)
HardwareAMD MI300x / MI355x
First of kindFirst MoE diffusion model converted from an AR LLM; first diffusion-LM trained on AMD

02 / 08 — The Problem
Why Autoregressive Decoding Creates a Bottleneck
Conventional LLMs use autoregressive decoding: they produce one token per step. Each time a new token is generated, the model must load each user’s KV-cache from GPU memory individually. Because every user in a batch has a unique token history, caches cannot be shared between requests.
This turns decoding into a memory-bandwidth bound operation in many serving scenarios — the GPU spends more time waiting for data transfers than performing calculations. As modern GPUs increase their compute power faster than memory bandwidth improves, this gap continues to widen.
For engineers: Memory-bandwidth bound = GPU compute units sit idle while waiting for HBM data. Compute-bound = GPU is fully utilized. Diffusion addresses this by sharing a single KV-cache load across N tokens.

03 / 08 — The Solution
How Diffusion Removes the Bottleneck
A diffusion model generates multiple drafts of N tokens at once. All N tokens within a block share the same KV-cache — one cache load regardless of block size. This shifts the workload from memory-bandwidth bound to compute-bound.
Autoregressive
1 token per pass
Separate KV-cache per user
Memory-bandwidth bound
Low GPU utilization
Diffusion (ZAYA1)
16 tokens per pass
Shared KV-cache per block
Compute-bound
Up to 7.7x speedup

04 / 08 — Training Pipeline
How the Model Was Converted
Training from scratch is difficult and provides no advantage since training is already compute-bound. The bottleneck only manifests during inference. Zyphra performs the conversion via mid-training using the TiDAR recipe, leveraging the existing pretraining infrastructure.
ZAYA1-8B-base checkpointPretrained autoregressive MoE base model
Diffusion mid-training — 600B tokens @ 32kTiDAR recipe applied to convert to discrete diffusion
Context extension — 500B tokens @ 128kNatively extends context length to 128k tokens
Diffusion SFT phaseSupervised fine-tuning in diffusion mode
Total: 1.1 trillion tokens of additional mid-training on top of ZAYA1-8B pretraining.

05 / 08 — Inference
Two Samplers: Speed vs. Quality
The model drafts 16 tokens per step. A portion are accepted based on a sampling criterion, similar to speculative decoding, but the same model serves as both speculator and verifier in a single forward pass — no separate draft model is required, unlike EAGLE or dFlash.
4.6x
Lossless Sampler
No systematic eval loss
min(1, p(x)/q(x))
7.7x
Logit-Mixing Sampler
Some quality trade-off
Mixes AR + diffusion logits
Note: On rejection in the lossless sampler, the next token is sampled from the residual distribution p(x)—q(x). The speed/quality trade-off is selectable at runtime.

06 / 08 — Architecture
Architecture Details
A single-step speculative diffusion model using order-constrained generation — it only generates tokens in a contiguous subsequence starting from the prefix. This improves training stability compared to unconstrained mask diffusion or set block decoding.
AttentionZyphra’s CCA attention — reduces prefill FLOPs, enables more parallel tokens before hitting the compute limit
CCGQA4:1 query-to-key heads; 2x compression; avoids MLA’s high arithmetic intensity
MI300x (bf16)~3 block-sized proposals per forward pass
MI355x~5 block-sized proposals per forward pass

07 / 08 — Results
Benchmark Results & Comparisons
Minimal evaluation degradation compared to the base AR checkpoint. Gains on benchmarks including LCB-v6, attributed to improved mid-training datasets and greater expressivity of diffusion-style within-block non-causal inference.
ZAYA1 Diffusion: 4.6x—7.7x
MTP: lower
EAGLE3: lower
dFlash: lower net speedup
Important: Evaluations use pass@

These are pass@k metrics, not standard accuracy benchmarks — because this is a base mid-train checkpoint pre-RL training. Do not compare directly to standard benchmark scores from other models.

08 / 08 — Implications
Why This Matters for AI Engineers
The deeper implication is for RL training: on-policy rollouts — model-generated sequences used during reinforcement learning — are expensive. Faster, compute-optimal generation lowers rollout cost, making RL and test-time compute scaling more practical.
For MLEsCompute-bound inference = better GPU utilization at serving time
For RL teamsCheaper on-policy rollouts = more RL iterations at same hardware budget
For architectsCCA + CCGQA co-designed for diffusion from the start — not bolted on
AccessZAYA1-8B-base on Hugging Face (Zyphra). Diffusion inference stack is early-stage.

Key Takeaways

Zyphra converted its existing ZAYA1-8B autoregressive MoE model into a discrete diffusion model using the TiDAR recipe, with 1.1 trillion tokens of additional mid-training
The model performs a single-step transformation from mask to token per block, generating 16 tokens simultaneously, achieving 4.6x speedup with a lossless sampler and 7.7x with the logit-mixing sampler
This is the first MoE diffusion model converted from an autoregressive LLM and the first diffusion-language model trained on AMD GPUs
Evaluation figures are pass@ metrics on a base mid-train checkpoint — the model has not yet undergone RL training
Faster diffusion inference lowers the cost of on-policy RL rollouts, making test-time compute scaling more practical

Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Top Posts

Cisco’s Budget AI Guardians: Revolutionizing Source Code Security

Proof Of Node Doesn’t Change Bitcoin: The Immutable Truth

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Zyphra Unveils ZAYA1-8B-Diffusion-Preview: Pioneering the First MoE Diffusion Model Transformed from an Autoregressive LLM, Achieving Up to 7.7x Speedup

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

5 No-Cost Courses to Transform from AI Newbie to Pro

The System76 Thelio Mira: My Dream Linux Desktop Come True

Google’s Gemini 3.6 Flash: Slashing Enterprise Agent Token Costs

Stop ML Chaos: Your Blueprint for Experiment Order

NVIDIA Cosmos 3 Edge: 4B-Power Robot Brains Thinking and Acting on Your Device

Cisco’s Budget AI Guardians: Revolutionizing Source Code Security

Proof Of Node Doesn’t Change Bitcoin: The Immutable Truth

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Trending

Cisco’s Budget AI Guardians: Revolutionizing Source Code Security

Proof Of Node Doesn’t Change Bitcoin: The Immutable Truth

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Zyphra Unveils ZAYA1-8B-Diffusion-Preview: Pioneering the First MoE Diffusion Model Transformed from an Autoregressive LLM, Achieving Up to 7.7x Speedup

The Problem With Autoregressive Decoding

Converting Autoregression to Diffusion Without Training From Scratch

How the Diffusion Sampler Works

Architecture Details

Marktechpost’s Visual Explainer

Key Takeaways

Related Posts