Here is the paraphrased version of the article in HTML format:
Zyphra, the San Francisco-based AI lab behind the ZAYA1 model family, has released ZAYA1-8B-Diffusion-Preview — an early look at its work in diffusion-language models. The release shows that an existing autoregressive language model can be transformed into a discrete diffusion model without any significant drop in evaluation performance, while achieving major inference speedups on AMD hardware.

The Problem With Autoregressive Decoding
To grasp why this is important, it helps to first understand how most language models produce text today. Standard large language models use autoregressive decoding: they generate one token at a time in order. For each new token, the attention mechanism must look back at all previously generated tokens and load their stored representations — known as the KV-cache — from GPU memory. Importantly, because every user in a batch has a unique history of tokens, each user’s KV-cache must be loaded individually and cannot be shared between requests.
This creates a performance bottleneck. When the GPU spends more time fetching data from memory than doing actual computation, the system becomes limited by memory bandwidth rather than compute power. This restricts how effectively modern GPU hardware — which has been increasing compute FLOPs faster than memory bandwidth — can be utilized during inference.
Diffusion provides a different approach. Rather than generating one token at a time, a diffusion model creates multiple drafts of N tokens at once and repeats this drafting process several times. Since all N tokens in the block share the same KV-cache, the operation shifts from being memory-bandwidth limited to compute-limited, allowing the GPU to be used more efficiently. In ZAYA1-8B-Diffusion-Preview specifically, the model performs a single-step transformation from mask to token for each token in the block — meaning it directly predicts the unmasked token in one step rather than through iterative denoising.
Converting Autoregression to Diffusion Without Training From Scratch
Training a diffusion language model from the ground up is technically challenging, and there are few well-established methods for doing so. The Zyphra team gives two reasons for choosing conversion over training from scratch: first, it is simply difficult, with few proven methods; second, there is no benefit to training in diffusion mode because training is already compute-limited — the memory-bandwidth bottleneck that diffusion addresses only occurs at inference time. This means all the advantages of diffusion are inference-time advantages, and an existing pretraining pipeline can be reused without modification.
Building on the TiDAR recipe, Zyphra took the ZAYA1-8B-base checkpoint and carried out an additional 600 billion tokens of diffusion-conversion mid-training at a 32k context length, followed by 500 billion tokens of native context extension to 128k, and then a diffusion supervised fine-tuning (SFT) phase.
ZAYA1-8B-Diffusion-Preview is the first MoE diffusion model converted from an autoregressive LLM, and the first diffusion-language model to be trained on AMD GPUs. Zyphra reports minimal evaluation degradation compared to the base autoregressive checkpoint, with improvements on some benchmarks such as LCB-v6. They attribute this partly to better mid-training datasets and partly to the greater expressiveness of diffusion-style within-block non-causal inference compared to causal autoregression.
How the Diffusion Sampler Works
During inference, ZAYA1-8B-Diffusion-Preview generates a draft of 16 tokens at the same time. A portion of these tokens are accepted based on a sampling criterion borrowed from speculative decoding. The key benefit here is that the same model acts as both speculator and verifier within a single forward pass, which eliminates the overhead of running two separate models as in traditional methods like EAGLE or dFlash. In heavily memory-bandwidth-bound regimes, almost all accepted
Tokens generated via diffusion represent a free speedup over traditional autoregressive decoding — since the GPU is already loaded, the extra tokens require very little additional computation.
The Zyphra team introduces two sampling methods, each offering a different balance between speed and output quality:
- Lossless diffusion sampler: This method applies the standard speculative decoding acceptance rule: min(1, p(x)/q(x)), where p is the autoregressive model’s logit distribution and q is the diffusion model’s distribution. If a token is rejected, the next one is sampled from the residual distribution p(x) − q(x). This approach delivers a 4.6× speedup with no measurable drop in evaluation performance.
- Logit-mixing sampler: Here, the logits from both the diffusion speculator and the autoregressive model are blended first, and the combined distribution is used for verification. This boosts acceptance rates because the verification logits more closely match the diffusion logits, though it slightly affects output quality. This method achieves a 7.7× speedup. Users can adjust the speed-quality trade-off at runtime.
An important note about these results: ZAYA1-8B-Diffusion-Preview is a base mid-train checkpoint that has not yet undergone RL training. Therefore, Zyphra uses pass@ evaluations instead of standard accuracy benchmarks to better reflect the model’s potential after RL training. Readers should keep this in mind when comparing these numbers to benchmarks reported for other models.
The Zyphra team also highlights that diffusion-based speedups exceed those achieved by alternative techniques such as multi-token prediction (MTP) and various speculative decoding methods like EAGLE3. Because TiDAR-style diffusion models require only a single forward pass, acceptance rates comparable to dFlash still translate into significant speedups.


Architecture Details
ZAYA1-8B-Diffusion-Preview is a single-step speculative diffusion model that uses order-constrained generation, meaning the diffusion model can only produce tokens in a contiguous subsequence starting from the prefix. This constraint greatly improves training stability compared to unconstrained mask diffusion objectives or set block decoding, and was a key reason Zyphra adopted the TiDAR recipe.
The model leverages ZAYA1-8B’s existing CCA attention variant from Zyphra. CCA significantly reduces prefill FLOPs in attention, which directly benefits diffusion since diffusion transforms decoding into a prefill-like operation. This allows the model to diffuse more tokens in parallel before reaching compute limits.
More specifically, the architecture employs CCGQA with a 4:1 ratio of query heads to key heads. One design decision was to avoid MLA (Multi-Head Latent Attention), whose high arithmetic intensity was considered a poor fit compared to CCGQA. Since block diffusion accesses the same cache, arithmetic intensity scales with block size and the number of blocks per forward pass. On AMD MI300x hardware in bf16, the system supports roughly three block-sized proposals per single forward pass; on MI355x, this increases to approximately five. CCGQA also operates at 2× compression, which allowed Zyphra to accommodate the additional training FLOPs associated with TiDAR mid-training. The larger VRAM capacity of AMD GPU hardware further enabled more efficient diffusion training overall.
In practice, achieving the theoretical speedups is more difficult because diffusion introduces additional operational overhead, and the inference stack for diffusion models is far less optimized than the mature tooling available for autoregressive inference.
Marktechpost’s Visual Explainer
■ Marktechpost Guide
ZAYA1-8B-Diffusion-Preview
Key Takeaways
- Zyphra converted its existing ZAYA1-8B autoregressive MoE model into a discrete diffusion model using the TiDAR recipe, with 1.1 trillion tokens of additional mid-training
- The model performs a single-step transformation from mask to token per block, generating 16 tokens simultaneously, achieving 4.6x speedup with a lossless sampler and 7.7x with the logit-mixing sampler
- This is the first MoE diffusion model converted from an autoregressive LLM and the first diffusion-language model trained on AMD GPUs
- Evaluation figures are pass@ metrics on a base mid-train checkpoint — the model has not yet undergone RL training
- Faster diffusion inference lowers the cost of on-policy RL rollouts, making test-time compute scaling more practical
Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us



