In brief
- Google has unveiled Multi-Token Prediction (MTP) drafters for Gemma 4, boosting inference speeds by up to 3x while maintaining full output quality.
- This method, known as speculative decoding, employs a compact “drafter” model to forecast multiple tokens simultaneously, which the primary model then verifies in parallel, sidestepping the traditional one-token bottleneck.
- The MTP drafters are accessible on Hugging Face, Kaggle, and Ollama under the Apache 2.0 license, compatible with frameworks like vLLM, MLX, and SGLang.
Running an AI model locally sounds ideal—until it isn’t.
The main appeal is clear: enhanced privacy, zero subscription costs, and complete data control. But for most users, the reality is staring at a blinking cursor, waiting painfully long between generated sentences.
This delay stems from inference speed—a factor unrelated to the model’s intelligence. It’s fundamentally a hardware challenge. Standard AI models generate text one piece, or token, at a time. For each single token, hardware must transfer billions of parameters from memory to processors, making the process inherently sluggish, especially on consumer-grade equipment.
Many people try to fix this by running smaller, less capable models or using quantized versions that trade some accuracy for quicker performance. Both options are compromises—they make things faster but force you to use a model that doesn’t fully meet your needs.
Google offers an alternative approach. The company recently launched Multi-Token Prediction (MTP) drafters for its Gemma 4 open models, a solution that can triple processing speed without any impact on output quality or reasoning capabilities.
The underlying technique, speculative decoding, isn’t new—Google researchers first published the core research in 2022. However, it only recently gained traction as advancements in model architecture enable it to work efficiently at a larger scale.
Here’s a simplified breakdown: instead of relying solely on the large, powerful model, you pair it with a smaller, lightweight “drafter.” This drafter is quick and cost-effective—it can predict several tokens at once faster than the main model generates a single token. The powerful model then validates all those predictions simultaneously. If the predictions match, the model accepts the entire sequence in one step and can even create an extra token.
According to Google, “when the main model’s predictions align with the draft, it accepts the whole sequence in a single pass—and may even generate an additional
Nothing is lost in the process: The large model—such as Gemma 4’s 31B dense variant—still validates every token, and the output quality remains exactly the same. You’re simply tapping into spare computing power that would otherwise sit idle during slower phases.
Google explains that the drafter models share the target model’s KV cache—a memory structure that holds previously processed context—so they avoid redundant calculations the larger model has already completed. For the compact edge models built for phones and Raspberry Pi devices, the team also developed an efficient clustering method to further trim generation time.
This isn’t the AI industry’s only effort to parallelize text generation. Diffusion-based language models—such as Mercury from Inception Labs—took a fundamentally different route: Rather than predicting one token at a time, they begin with random noise and progressively refine the entire output. That sounds fast in theory, but diffusion LLMs have had difficulty matching the quality of conventional transformer models, keeping them more as a research novelty than a production-ready solution.
Speculative decoding takes a different path because it leaves the underlying model completely unchanged. It’s an optimization at the serving layer, not an architectural overhaul. The same Gemma 4 you were already running simply gets faster.
The real-world benefits are tangible. A Gemma 4 26B model running on an Nvidia RTX Pro 6000 desktop GPU achieves roughly double the tokens per second with the MTP drafter turned on, based on Google’s own benchmarks. On Apple Silicon, handling batches of 4 to 8 requests delivers around 2.2x speedups. It doesn’t always hit the theoretical 3x ceiling, but it’s still a meaningful leap from “barely usable” to “genuinely fast enough to work with.”
The broader context is important here. When Chinese model DeepSeek stunned the market in January 2025—erasing $600 billion from Nvidia’s market cap in a single day—the key takeaway was that efficiency breakthroughs can hit harder than raw computing power. Working smarter outperforms throwing more hardware at the problem. Google’s MTP drafter follows that same philosophy, though it’s targeted directly at the consumer side of the market.
The entire AI industry right now revolves around three pillars: inference, training, and memory. A breakthrough in any one of these areas tends to ripple through and shake up the whole ecosystem. DeepSeek’s training method (building powerful models with less expensive hardware) was one example, while Google’s TurboQuant (compressing AI memory without sacrificing quality) was another. Both sent markets into a tailspin as companies scrambled to reassess their strategies.
Google says the drafter enables “improved responsiveness: drastically reduce latency for near real-time chat, immersive voice applications and agentic workflows”—the kinds of tasks where low latency is essential to feeling useful at all.
The use cases become immediately clear: A local coding assistant that doesn’t stutter; a voice interface that replies before you lose your train of thought; an agentic workflow that doesn’t force you to wait three seconds between each step. All of this, running on hardware you already have.
The MTP drafters are available today on Hugging Face, Kaggle, and Ollama, under the Apache 2.0 license. They work seamlessly with vLLM, MLX, SGLang, and Hugging Face Transformers right out of the box.
Daily Debrief Newsletter
Start every day with the top news stories right now, plus original features, a podcast, videos and more.



