Xiaomi MiMo And TileRT Achieve Breakthrough: 1-Trillion-Parameter Model Exceeds 1000 Tokens Per Second On Standard GPUs

Inference speed is quickly emerging as a key battleground for large language models. Xiaomi’s MiMo team has introduced MiMo-V2.5-Pro-UltraSpeed, developed alongside the TileRT systems team. This new release pushes a 1-trillion-parameter model past 1,000 tokens per second during decoding—a milestone the company describes as unprecedented at this scale. In demonstrations, generation rates have even touched 1,200 tokens per second. What makes this noteworthy is the hardware setup: it all runs on off-the-shelf GPUs, rather than on proprietary chips.

What is MiMo-V2.5-Pro-UltraSpeed

UltraSpeed is a high-performance inference mode layered on top of the existing MiMo-V2.5-Pro model. The underlying foundation is built on a Mixture-of-Experts (MoE) setup with 1 trillion parameters. UltraSpeed focuses on pushing output generation velocity rather than altering the model’s core abilities. The boost is achieved through three tightly integrated techniques spanning both the model and the serving infrastructure. Xiaomi refers to this strategy as a deep model-system codesign. Importantly, the full pipeline fits comfortably on a single, standard 8-GPU server.

The Speed Case: Three Layers Working Together

The first layer employs FP4 quantization. At the trillion-parameter level, using FP8 or FP16 precision puts significant strain on memory and bandwidth. Shrinking the weight footprint accelerates how quickly data moves, directly translating to faster decoding. Xiaomi implements the MXFP4 format, specifically targeting the MoE Experts. Other components maintain higher precision, noted as FP8 by TileRT. Since the Experts account for the bulk of parameters and handle lower precision well, the trade-off is favorable. Quantization-Aware Training (QAT) ensures the model’s overall performance remains virtually identical to the original.

The second layer is the DFlash speculative decoding engine, which is explained in detail further below. The third layer is TileRT, the software framework responsible for orchestrating all the computational work on the GPU. No single ingredient is sufficient on its own; hitting 1000 TPS requires the precise alignment of all three layers.

DFlash: Parallel Drafting Without a Serial Bottleneck

Conventional speculative decoding relies on a lightweight draft model to predict upcoming tokens, which the larger model then validates simultaneously. Rejection sampling guarantees the final output is indistinguishable from standard decoding, preserving quality. The catch is that the traditional draft model produces tokens sequentially. DFlash, a newer technique from the research space, bypasses this limitation. It predicts blocks of masked tokens in parallel within a single forward pass.

Xiaomi refined DFlash using the Muon second-order optimizer and model self-distillation. The draft model relies exclusively on Sliding Window Attention (SWA), consistent with the MiMo-V2 architecture. This design keeps computation per prediction constant, regardless of context length. The block size is restricted to 8 tokens to manage verification overhead and maximize concurrency.

Acceptance length reflects the average number of draft tokens validated successfully in each cycle.

Scenario	Acceptance Length
Coding	6.30
Math / Reasoning	5.56
Agent	4.29

For coding tasks, six to seven out of eight proposed tokens pass validation each cycle. Certain test cases peaked at 7.14 accepted tokens.

TileRT: Squeezing the Microseconds

At a rate of 1000 TPS, individual GPU operations execute in mere microseconds. Conventional frameworks dispatch operations individually, and each dispatch introduces overhead. These interruptions disrupt the processing flow and become the performance constraint. TileRT eliminates this by maintaining a Persistent Engine Kernel that remains active on the GPU. Through Warp Specialization, it assigns dedicated roles for data handling, computation, and communication. At these speeds, even minor operations like RMSNorm, RoPE, and KV cache writes can stifle performance. The system was architected in tandem with the FP4 and DFlash strategies, rather than bolted on later.

Use Cases

The release is tailored for scenarios where latency directly impacts the user experience:

Parallel reasoning: executing numerous Best-of-N or tree-search strategies within tight time budgets.
Coding agents: shortening the delay between agent decision-making cycles.
Real-time decision loops: algorithmic trading signals, fraud detection, and conversational AI.
Interactive prototyping: demonstrations show creating a Snake game in roughly 10 seconds and a macOS interface in about 60 seconds.

These are essentially throughput-driven tasks where token generation speed is the primary limitation.

How It Compares

The first table outlines the two distinct paths to high-speed decoding.

Approach	Hardware	How speed is achieved
Cerebras	Wafer-Scale integration (custom)	Scale on a single custom wafer
Groq	Custom architecture	Pure on-chip SRAM
MiMo × TileRT	Commodity GPUs (8-GPU node)	Model-system codesign: FP4 + DFlash + TileRT

The second table contrasts the standard model features with the UltraSpeed mode.

Dimension	MiMo-V2.5-Pro	MiMo-V2.5-Pro-UltraSpeed
Decode speed	Baseline	~10× faster (1000+ TPS)
Price	1×	3×
Weight precision	Standard	FP4 MoE Experts via QAT
Decoding	Standard autoregressive	DFlash speculative decoding
Access	Standard model plans	API only, application-based trial
Token Plan	Supported	Not supported

Access, Pricing, and Open Source

UltraSpeed is available initially through a restricted, application-only window. The API trial runs from June 9 to June 23, 2026. Pricing is set at 3 times the standard MiMo-V2.5-Pro rate, in exchange for roughly 10 times the generation speed. It is exclusively offered via API, and subscription Token Plans are not available. Approved users additionally receive complimentary Chat access during the trial period, subject to usage limits: 10 queue entries per day, 30-minute session lengths, and automatic session termination after 5 minutes of inactivity. Xiaomi released the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face. TileRT has opened select components on GitHub.

Strengths and Limitations

Strengths

1000+ TPS on a 1T model without custom silicon.
Lossless decoding through rejection sampling in DFlash.
FP4 applied only where tolerance
- The output quality remains as high as possible.
- An open checkpoint allows the community to verify the performance claims.
Limitations
- Initial access is restricted, brief, and requires approval.
- Cost per token is three times higher than the standard model.
- Maximum accepted input length is reduced during open-ended conversations.
- Independent third-party benchmarks for speed are not yet publicly available.
Key Takeaways
- Xiaomi’s MiMo and TileRT teams achieved decoding speeds exceeding 1000 tokens per second on a 1-trillion-parameter model using standard GPUs.
- The acceleration is driven by three components: FP4 quantization, DFlash speculative decoding, and the TileRT runtime engine.
- FP4 (MXFP4) quantization is applied exclusively to MoE Experts, with Quantization-Aware Training preserving nearly identical performance.
- DFlash speculatively predicts an entire masked block in a single forward pass, achieving an average acceptance length of 6.30 in coding tasks.
- UltraSpeed is available via an application-based API trial running from June 9 to June 23, 2026, on a single 8-GPU node.
Marktechpost’s Visual Explainer
01 / 08
What It Is
Developed by Xiaomi’s MiMo team in collaboration with the TileRT systems group.
Decodes more than 1000 tokens per second on a model with 1 trillion parameters.
Live demonstrations show generation rates approaching 1200 tokens per second.
Runs entirely on commodity hardware — a single standard 8-GPU node.
Launched on June 8, 2026.
1000+tokens / second
1Tparameters (MoE)
8commodity GPUs
02 / 08
Three Layers Working Together
FP4 quantization compresses model weights and reduces memory bandwidth demands.
DFlash speculative decoding forecasts multiple tokens simultaneously.
TileRT orchestrates the full pipeline at microsecond-level latency.
Xiaomi describes this strategy as extreme model-system codesign.
No single technique suffices — all three must work in concert.
03 / 08
Layer 1 — FP4 Quantization
Leverages the MXFP4 format to cut memory footprint and bandwidth requirements.
Applied only to the MoE Experts, not the entire model.
Remaining modules retain higher precision (FP8, as managed by TileRT).
Experts contain the majority of parameters and handle quantization most gracefully.
Quantization-Aware Training ensures performance stays virtually unchanged from the original.
04 / 08
Layer 2 — DFlash Speculative Decoding
A technique from the research community that uses block-level masked parallel prediction.
The draft model generates an entire block of tokens in a single forward pass.
It employs Sliding Window Attention with a maximum block size of 8.
Rejection sampling ensures the final output is lossless.
Scenario Acceptance Length
Coding 6.30
Math / Reasoning 5.56
Agent 4.29
05 / 08
Layer 3 — TileRT Runtime
At 1000 TPS, each operator executes in just a few microseconds.
A Persistent Engine Kernel remains loaded on the GPU at all times.
Warp Specialization separates data movement, computation, and communication tasks.
Smaller operations like RMSNorm and RoPE become performance bottlenecks at this scale.
The runtime was designed hand-in-hand with the FP4 and DFlash strategies.
06 / 08
Where It Fits
Parallel reasoning: running multiple Best-of-N or tree-search paths simultaneously.
Coding agents: shorter delays between agent actions.
Real-time applications: trading signals, fraud detection, live conversations.
Interactive prototyping: building a Snake game in roughly 10 seconds.
07 / 08
Standard vs UltraSpeed
Dimension MiMo-V2.5-Pro UltraSpeed
Decode speed Baseline ~10× (1000+ TPS)
Price 1× 3×
Weights Standard FP4 MoE Experts (QAT)
Decoding Autoregressive DFlash speculative
Access Standard plans API only, by application
08 / 08
Access, Pricing & Open Source
API trial is open from June 9 to June 23, 2026 (Beijing time).
Pricing is 3× the standard rate in exchange for roughly 10× faster throughput.
API access only; the Token Plan option is not available.
Model checkpoint open-sourced: MiMo-V2.5-Pro-FP4-DFlash on Hugging Face.
TileRT has released select modules as open source on GitHub.
Marktechpost
AI research, models, and developer tools — explained for engineers.
Check out the Model weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.
Looking to partner with us to promote your GitHub repo, Hugging Face page, product release, webinar, or more? Connect with us

Top Posts

Unleashing Speed at Scale: KubeVirt Performance Reimagined with VirtBench

The CISO’s Playbook for Smarter Data Minimization

Unlock Claude’s Full Potential: Your Definitive Blueprint for Mastering AI Skill Development with Anthropic

Xiaomi MiMo and TileRT Achieve Breakthrough: 1-Trillion-Parameter Model Exceeds 1000 Tokens Per Second on Standard GPUs

What It Is

Three Layers Working Together

Layer 1 — FP4 Quantization

Layer 2 — DFlash Speculative Decoding

Layer 3 — TileRT Runtime

Where It Fits

Standard vs UltraSpeed

Access, Pricing & Open Source

Boosting Recommendation Accuracy with Large Language Models in Python

Unlocking AI Mastery: 5 Essential Python Concepts Every Engineer Needs

Google Unveils Gemma 4 QAT Checkpoints: Q4_0 and a Revolutionary Mobile Format Slash On-Device Memory

Building Multi-Agent Systems in Python: A Hands-On Approach

Meet Harness-1: Inside the Reinforcement Learning Trained 20B Retrieval Agent Built on gpt-oss-20b in a Stateful Search Environment

How a SciPy ODE Solver Derailed My Bayesian Inference — A Cosmologist’s Journey to Discovering Diffrax

Unleashing Speed at Scale: KubeVirt Performance Reimagined with VirtBench

The CISO’s Playbook for Smarter Data Minimization

Unlock Claude’s Full Potential: Your Definitive Blueprint for Mastering AI Skill Development with Anthropic

Xiaomi MiMo and TileRT Achieve Breakthrough: 1-Trillion-Parameter Model Exceeds 1000 Tokens Per Second on Standard GPUs

Bitcoin Traders Brace for No Bear-Market Bottom Until Q3 at the Earliest

Vibe Coding Is Everywhere—But Security Is Still in the Dark

AWS Weekly Roundup: BYOM for Amazon RDS for SQL Server, AWS IoT Device SDK for Swift, and More

Generative AI Enhancing IoT Security: Key Use Cases, Risks, and Deployment Models

Trending

Unleashing Speed at Scale: KubeVirt Performance Reimagined with VirtBench

The CISO’s Playbook for Smarter Data Minimization

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Xiaomi MiMo and TileRT Achieve Breakthrough: 1-Trillion-Parameter Model Exceeds 1000 Tokens Per Second on Standard GPUs

What is MiMo-V2.5-Pro-UltraSpeed

The Speed Case: Three Layers Working Together

DFlash: Parallel Drafting Without a Serial Bottleneck

TileRT: Squeezing the Microseconds

Use Cases

How It Compares

Access, Pricing, and Open Source

Strengths and Limitations

Strengths

Limitations

Key Takeaways

Marktechpost’s Visual Explainer

What It Is

Three Layers Working Together

Layer 1 — FP4 Quantization

Layer 2 — DFlash Speculative Decoding

Layer 3 — TileRT Runtime

Where It Fits

Standard vs UltraSpeed

Access, Pricing & Open Source

Related Posts