Inference speed is quickly emerging as a key battleground for large language models. Xiaomi’s MiMo team has introduced MiMo-V2.5-Pro-UltraSpeed, developed alongside the TileRT systems team. This new release pushes a 1-trillion-parameter model past 1,000 tokens per second during decoding—a milestone the company describes as unprecedented at this scale. In demonstrations, generation rates have even touched 1,200 tokens per second. What makes this noteworthy is the hardware setup: it all runs on off-the-shelf GPUs, rather than on proprietary chips.
What is MiMo-V2.5-Pro-UltraSpeed
UltraSpeed is a high-performance inference mode layered on top of the existing MiMo-V2.5-Pro model. The underlying foundation is built on a Mixture-of-Experts (MoE) setup with 1 trillion parameters. UltraSpeed focuses on pushing output generation velocity rather than altering the model’s core abilities. The boost is achieved through three tightly integrated techniques spanning both the model and the serving infrastructure. Xiaomi refers to this strategy as a deep model-system codesign. Importantly, the full pipeline fits comfortably on a single, standard 8-GPU server.
The Speed Case: Three Layers Working Together
The first layer employs FP4 quantization. At the trillion-parameter level, using FP8 or FP16 precision puts significant strain on memory and bandwidth. Shrinking the weight footprint accelerates how quickly data moves, directly translating to faster decoding. Xiaomi implements the MXFP4 format, specifically targeting the MoE Experts. Other components maintain higher precision, noted as FP8 by TileRT. Since the Experts account for the bulk of parameters and handle lower precision well, the trade-off is favorable. Quantization-Aware Training (QAT) ensures the model’s overall performance remains virtually identical to the original.
The second layer is the DFlash speculative decoding engine, which is explained in detail further below. The third layer is TileRT, the software framework responsible for orchestrating all the computational work on the GPU. No single ingredient is sufficient on its own; hitting 1000 TPS requires the precise alignment of all three layers.
DFlash: Parallel Drafting Without a Serial Bottleneck
Conventional speculative decoding relies on a lightweight draft model to predict upcoming tokens, which the larger model then validates simultaneously. Rejection sampling guarantees the final output is indistinguishable from standard decoding, preserving quality. The catch is that the traditional draft model produces tokens sequentially. DFlash, a newer technique from the research space, bypasses this limitation. It predicts blocks of masked tokens in parallel within a single forward pass.
Xiaomi refined DFlash using the Muon second-order optimizer and model self-distillation. The draft model relies exclusively on Sliding Window Attention (SWA), consistent with the MiMo-V2 architecture. This design keeps computation per prediction constant, regardless of context length. The block size is restricted to 8 tokens to manage verification overhead and maximize concurrency.
Acceptance length reflects the average number of draft tokens validated successfully in each cycle.
| Scenario | Acceptance Length |
|---|---|
| Coding | 6.30 |
| Math / Reasoning | 5.56 |
| Agent | 4.29 |
For coding tasks, six to seven out of eight proposed tokens pass validation each cycle. Certain test cases peaked at 7.14 accepted tokens.
TileRT: Squeezing the Microseconds
At a rate of 1000 TPS, individual GPU operations execute in mere microseconds. Conventional frameworks dispatch operations individually, and each dispatch introduces overhead. These interruptions disrupt the processing flow and become the performance constraint. TileRT eliminates this by maintaining a Persistent Engine Kernel that remains active on the GPU. Through Warp Specialization, it assigns dedicated roles for data handling, computation, and communication. At these speeds, even minor operations like RMSNorm, RoPE, and KV cache writes can stifle performance. The system was architected in tandem with the FP4 and DFlash strategies, rather than bolted on later.
Use Cases
The release is tailored for scenarios where latency directly impacts the user experience:
- Parallel reasoning: executing numerous Best-of-N or tree-search strategies within tight time budgets.
- Coding agents: shortening the delay between agent decision-making cycles.
- Real-time decision loops: algorithmic trading signals, fraud detection, and conversational AI.
- Interactive prototyping: demonstrations show creating a Snake game in roughly 10 seconds and a macOS interface in about 60 seconds.
These are essentially throughput-driven tasks where token generation speed is the primary limitation.
How It Compares
The first table outlines the two distinct paths to high-speed decoding.
| Approach | Hardware | How speed is achieved |
|---|---|---|
| Cerebras | Wafer-Scale integration (custom) | Scale on a single custom wafer |
| Groq | Custom architecture | Pure on-chip SRAM |
| MiMo × TileRT | Commodity GPUs (8-GPU node) | Model-system codesign: FP4 + DFlash + TileRT |
The second table contrasts the standard model features with the UltraSpeed mode.
| Dimension | MiMo-V2.5-Pro | MiMo-V2.5-Pro-UltraSpeed |
|---|---|---|
| Decode speed | Baseline | ~10× faster (1000+ TPS) |
| Price | 1× | 3× |
| Weight precision | Standard | FP4 MoE Experts via QAT |
| Decoding | Standard autoregressive | DFlash speculative decoding |
| Access | Standard model plans | API only, application-based trial |
| Token Plan | Supported | Not supported |
Access, Pricing, and Open Source
UltraSpeed is available initially through a restricted, application-only window. The API trial runs from June 9 to June 23, 2026. Pricing is set at 3 times the standard MiMo-V2.5-Pro rate, in exchange for roughly 10 times the generation speed. It is exclusively offered via API, and subscription Token Plans are not available. Approved users additionally receive complimentary Chat access during the trial period, subject to usage limits: 10 queue entries per day, 30-minute session lengths, and automatic session termination after 5 minutes of inactivity. Xiaomi released the MiMo-V2.5-Pro-FP4-DFlash checkpoint on Hugging Face. TileRT has opened select components on GitHub.
Strengths and Limitations
Strengths
- 1000+ TPS on a 1T model without custom silicon.
- Lossless decoding through rejection sampling in DFlash.
- FP4 applied only where tolerance
- The output quality remains as high as possible.
- An open checkpoint allows the community to verify the performance claims.
Limitations
- Initial access is restricted, brief, and requires approval.
- Cost per token is three times higher than the standard model.
- Maximum accepted input length is reduced during open-ended conversations.
- Independent third-party benchmarks for speed are not yet publicly available.
Key Takeaways
- Xiaomi’s MiMo and TileRT teams achieved decoding speeds exceeding 1000 tokens per second on a 1-trillion-parameter model using standard GPUs.
- The acceleration is driven by three components: FP4 quantization, DFlash speculative decoding, and the TileRT runtime engine.
- FP4 (MXFP4) quantization is applied exclusively to MoE Experts, with Quantization-Aware Training preserving nearly identical performance.
- DFlash speculatively predicts an entire masked block in a single forward pass, achieving an average acceptance length of 6.30 in coding tasks.
- UltraSpeed is available via an application-based API trial running from June 9 to June 23, 2026, on a single 8-GPU node.
Marktechpost’s Visual Explainer
Marktechpost
AI research, models, and developer tools — explained for engineers.Check out the Model weights and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait! Are you on Telegram? Now you can join us on Telegram as well.
Looking to partner with us to promote your GitHub repo, Hugging Face page, product release, webinar, or more? Connect with us



