Mercury 2 AI From Inception Labs Outperforms Google's DiffusionGemma In A Head-to-Head Showdown

In brief

Mercury 2, developed by Inception Labs, produces around 1,000 tokens per second and earned a score of 90 on the AIME 2026 benchmark.
Google’s DiffusionGemma reaches comparable speeds but falls short on performance benchmarks.
DiffusionGemma is freely available as an open-weight model on Hugging Face, whereas Mercury 2 is a closed-weight, paid API offering.

On Thursday, Inception Labs unveiled Mercury 2, branding it as the planet’s quickest reasoning language model. According to the company’s release, it churns out approximately 1,000 tokens per second—the fragments of text an AI system processes and generates—compared to roughly 89 tokens per second for Anthropic’s Claude Haiku 4.5 Reasoning and 71 for OpenAI’s GPT-5 Mini.

That places it in the same velocity range that Google would later assert for DiffusionGemma.

Welcome to the diffusion era.
We bet on parallel generation years ago, when it was a contrarian idea. It’s great to see the industry arrive.
Mercury 2 continues to lead the Pareto frontier for quality, speed, and cost among publicly available diffusion LLMs. pic.twitter.com/qSHuiR7vmH
— Inception (@_inception_ai) June 18, 2026

Both models achieve this by abandoning the conventional word-by-word writing method. A typical chatbot generates one word, reviews what it just produced, then moves on to the next, repeating the cycle until the response is complete. Diffusion models, on the other hand, populate a block of text with random placeholder tokens and strip away the noise through several parallel passes—the same technique that transforms static into a coherent image in tools like Stable Diffusion—until the entire block solidifies into a complete response all at once.

Where the two differ is in the quality that emerges from that process. On AIME 2026—constructed from actual American Invitational Mathematics Examination problems and graded by the percentage solved correctly—Mercury 2 achieved 90%. Google evaluated DiffusionGemma on the identical test, where it managed 69.1%, while the standard, non-diffusion Gemma 4 reached 88.3% on the same assessment.

On GPQA, a PhD-level science benchmark scored identically, the two models are nearly neck-and-neck: Mercury 2 at 77% versus DiffusionGemma’s 73.2%. However, Google’s own developer documentation advises using standard Gemma 4 for tasks requiring peak quality, acknowledging that DiffusionGemma lags behind it across all metrics.

The speed claims hold up beyond controlled testing as well. Augment Code, an AI coding-agent firm, replaced Anthropic’s Claude Opus 4.7 with Mercury 2 on its context-compaction subagent and observed an 82% reduction in latency and a 90% decrease in cost, while maintaining equivalent output quality, as detailed in a joint case study.

Inception was founded on research by its creator Stefano Ermon, a Stanford professor who co-authored several of the score-based diffusion methods that drive today’s image generators. The startup’s $50 million funding round attracted investment from Nvidia’s venture division and individual backers Andrew Ng and Andrej Karpathy.

For everyday users, the most noticeable difference is something you don’t see until you experience it—the “flow.” Conventional models force you to pause between exchanges during an extended conversation. Diffusion models like this one make the AI feel like it’s matching your pace—instant autocomplete, swift iterations on code or plans, and sub-agents that tackle repetitive high-volume tasks without bogging everything down.

That subagent layer represents the intriguing architectural evolution. Sophisticated AI systems are no longer a single massive intelligent model. They’re ensembles of specialized assistants: one dedicated to deep reasoning, several handling quick summarization, routing, tool lookups, output verification, and so on. Sequential models make those utility calls costly and sluggish. Parallel diffusion models make them affordable and fast enough to deploy generously.

Practical caveats for typical users: These models are still best suited for speed-critical, high-throughput segments of workflows rather than the most demanding frontier reasoning tasks (where the largest AR models may still hold a slight edge for the time being). Mercury 2 is not open-weight, so it’s accessible only via API/cloud at this point. And similar to Google’s offering, the broader ecosystem (local runtimes, agent frameworks) is still maturing to deliver a seamless experience everywhere.

Use cases that stand out immediately: real-time rapid programming and “vibe coding” where the model keeps up with your changes, multi-agent coding or support platforms where numerous fast sub-calls occur, voice interfaces that feel responsive rather than delayed, and any latency-sensitive autocomplete or next-action prediction. At scale, the cost and energy savings from higher throughput on standard hardware accumulate quickly.

The figures Inception presents (along with independent evaluations) make the argument compellingly: Mercury 2 occupies the “fast and capable” sweet spot for diffusion models, bringing performance that once demanded specialized hardware down to commodity GPUs.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.

Top Posts

MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode

Why 5G Private Networks Are Powering the Future of Industrial IoT

Unlocking the Power of Date Tables in Self-Service Environments

Mercury 2 AI from Inception Labs Outperforms Google’s DiffusionGemma in a Head-to-Head Showdown

Daily Debrief Newsletter

XRP’s Great Retirement Exposed: The Hidden Math Behind the Hoax

Cardano’s Hoskinson Makes a Bold AI Wager as Midnight City Charges Ahead

The 5 Biggest Publicly Traded Companies Holding SOL on Their Balance Sheets

Federal Reserve Cracks Down on Stablecoin Loopholes With Sweeping New Customer ID Rules

OpenRouter’s Fusion Delivers Fable-Quality Claude AI at a Fraction of the Cost—Just as Fable 5 Shuts Down

Stablecoins and DeFi Take Center Stage in MiCA 2.0 Crypto Reform Push

MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode

Why 5G Private Networks Are Powering the Future of Industrial IoT

Unlocking the Power of Date Tables in Self-Service Environments

XRP’s Great Retirement Exposed: The Hidden Math Behind the Hoax

How AI Is Rewriting the Rules of Threat Management

Ghost TOC, Found: Reconstructing a Missing PDF’s Structure for Precision RAG Retrieval

Build Reactivity into Python Dashboards: Prefab UI Components with Seamless Static HTML Export

Cardano’s Hoskinson Makes a Bold AI Wager as Midnight City Charges Ahead

Trending

MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode

Why 5G Private Networks Are Powering the Future of Industrial IoT

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Mercury 2 AI from Inception Labs Outperforms Google’s DiffusionGemma in a Head-to-Head Showdown

In brief

Daily Debrief Newsletter

Related Posts