In brief
- Mercury 2, developed by Inception Labs, produces around 1,000 tokens per second and earned a score of 90 on the AIME 2026 benchmark.
- Google’s DiffusionGemma reaches comparable speeds but falls short on performance benchmarks.
- DiffusionGemma is freely available as an open-weight model on Hugging Face, whereas Mercury 2 is a closed-weight, paid API offering.
On Thursday, Inception Labs unveiled Mercury 2, branding it as the planet’s quickest reasoning language model. According to the company’s release, it churns out approximately 1,000 tokens per second—the fragments of text an AI system processes and generates—compared to roughly 89 tokens per second for Anthropic’s Claude Haiku 4.5 Reasoning and 71 for OpenAI’s GPT-5 Mini.
That places it in the same velocity range that Google would later assert for DiffusionGemma.
Welcome to the diffusion era.
We bet on parallel generation years ago, when it was a contrarian idea. It’s great to see the industry arrive.
Mercury 2 continues to lead the Pareto frontier for quality, speed, and cost among publicly available diffusion LLMs. pic.twitter.com/qSHuiR7vmH
— Inception (@_inception_ai) June 18, 2026
Both models achieve this by abandoning the conventional word-by-word writing method. A typical chatbot generates one word, reviews what it just produced, then moves on to the next, repeating the cycle until the response is complete. Diffusion models, on the other hand, populate a block of text with random placeholder tokens and strip away the noise through several parallel passes—the same technique that transforms static into a coherent image in tools like Stable Diffusion—until the entire block solidifies into a complete response all at once.
Where the two differ is in the quality that emerges from that process. On AIME 2026—constructed from actual American Invitational Mathematics Examination problems and graded by the percentage solved correctly—Mercury 2 achieved 90%. Google evaluated DiffusionGemma on the identical test, where it managed 69.1%, while the standard, non-diffusion Gemma 4 reached 88.3% on the same assessment.
On GPQA, a PhD-level science benchmark scored identically, the two models are nearly neck-and-neck: Mercury 2 at 77% versus DiffusionGemma’s 73.2%. However, Google’s own developer documentation advises using standard Gemma 4 for tasks requiring peak quality, acknowledging that DiffusionGemma lags behind it across all metrics.
The speed claims hold up beyond controlled testing as well. Augment Code, an AI coding-agent firm, replaced Anthropic’s Claude Opus 4.7 with Mercury 2 on its context-compaction subagent and observed an 82% reduction in latency and a 90% decrease in cost, while maintaining equivalent output quality, as detailed in a joint case study.
Inception was founded on research by its creator Stefano Ermon, a Stanford professor who co-authored several of the score-based diffusion methods that drive today’s image generators. The startup’s $50 million funding round attracted investment from Nvidia’s venture division and individual backers Andrew Ng and Andrej Karpathy.
For everyday users, the most noticeable difference is something you don’t see until you experience it—the “flow.” Conventional models force you to pause between exchanges during an extended conversation. Diffusion models like this one make the AI feel like it’s matching your pace—instant autocomplete, swift iterations on code or plans, and sub-agents that tackle repetitive high-volume tasks without bogging everything down.
That subagent layer represents the intriguing architectural evolution. Sophisticated AI systems are no longer a single massive intelligent model. They’re ensembles of specialized assistants: one dedicated to deep reasoning, several handling quick summarization, routing, tool lookups, output verification, and so on. Sequential models make those utility calls costly and sluggish. Parallel diffusion models make them affordable and fast enough to deploy generously.
Practical caveats for typical users: These models are still best suited for speed-critical, high-throughput segments of workflows rather than the most demanding frontier reasoning tasks (where the largest AR models may still hold a slight edge for the time being). Mercury 2 is not open-weight, so it’s accessible only via API/cloud at this point. And similar to Google’s offering, the broader ecosystem (local runtimes, agent frameworks) is still maturing to deliver a seamless experience everywhere.
Use cases that stand out immediately: real-time rapid programming and “vibe coding” where the model keeps up with your changes, multi-agent coding or support platforms where numerous fast sub-calls occur, voice interfaces that feel responsive rather than delayed, and any latency-sensitive autocomplete or next-action prediction. At scale, the cost and energy savings from higher throughput on standard hardware accumulate quickly.
The figures Inception presents (along with independent evaluations) make the argument compellingly: Mercury 2 occupies the “fast and capable” sweet spot for diffusion models, bringing performance that once demanded specialized hardware down to commodity GPUs.
Daily Debrief Newsletter
Start every day with the top news stories right now, plus original features, a podcast, videos and more.



