In brief
- Xiaomi and its inference partner TileRT have achieved over 1,000 tokens per second on a model with 1 trillion parameters—a first at this scale—using a standard 8-GPU commodity node rather than custom hardware.
- The breakthrough relies on FP4 quantization applied to the model’s expert layers and DFlash speculative decoding, which generates an entire block of tokens in a single pass instead of one token at a time.
- A limited API trial runs from June 9 through June 23, priced at 3× standard MiMo rates for roughly 10× the generation speed.
Most people know Xiaomi as the Chinese phone brand—the one that makes affordable electric scooters and air purifiers. It’s not exactly the company you’d expect to shatter a major AI inference speed record on a Monday morning.
And yet, that’s precisely what happened. Xiaomi has just launched MiMo-V2.5-Pro-UltraSpeed, a serving mode for its trillion-parameter flagship that exceeds 1,000 tokens per second—reaching peaks near 1,200 in demonstrations.
Parameters are the internal numerical weights that define how a model processes information—the more parameters, the more complex the patterns it can recognize. Tokens are the chunks of text the model reads and writes, with each token representing roughly three-quarters of a word on average.
Xiaomi accomplished this on a single 8-GPU commodity node using standard hardware—no custom chips involved. That fundamentally changes the equation for who can realistically deploy this level of speed in production environments.
To put that figure in perspective: according to Artificial Analysis, GPT-5.5—the model most ChatGPT users are actually interacting with—runs at 68 tokens per second. Claude Opus 4.6 reaches around 71, with the smaller Haiku variant hitting 98 tokens per second. Gemini Flash achieves 192 tokens per second. MiMo-V2.5-Pro-UltraSpeed delivers 1,000, on a model that matches Opus on coding benchmarks.
Cerebras and Groq have built entire businesses around solving this problem. Cerebras engineered a wafer-scale chip the size of a dinner plate, packing 44GB of on-chip memory to eliminate the bandwidth bottleneck that slows down GPU inference. It reached 969 tokens per second on Meta’s Llama 3.1 405B—an impressive feat, but that’s a 405-billion-parameter model, less than half the size of MiMo-V2.5-Pro. Groq’s custom Language Processing Unit architecture tops out around 300–750 tokens per second depending on the model.
Neither of these solutions runs on hardware you can rent from AWS tonight.
Xiaomi achieved its result on commodity GPUs through software alone—a combination of model-level optimizations and a purpose-built inference engine called TileRT.
What’s actually going on under the hood
Two techniques drive the speed. The first is called FP4 Quantization: instead of running
Xiaomi runs the model at full 8-bit or 16-bit precision, but compresses the expert layers—which account for the bulk of the 1 trillion parameters—down to 4-bit. This reduces memory usage, eases bandwidth demands, and boosts speed. The usual trade-off is a slight drop in quality. Xiaomi’s solution is precise: only the expert layers are compressed, while everything else remains at full precision. With this method, quality loss is said to be nearly nonexistent.
The second innovation is DFlash speculative decoding. Traditional speculative decoding uses a small draft model to predict the next few tokens, which the main model then verifies in parallel. DFlash eliminates the sequential drafting step entirely—it fills an entire block of masked positions in a single forward pass. In coding benchmarks, the main model accepts an average of 6.3 out of 8 proposed tokens per verification cycle. That means six tokens are confirmed at once instead of one at a time.
TileRT unifies the system. It keeps the entire compute pipeline continuously loaded on the GPU—eliminating per-operator launch delays and execution gaps.
Xiaomi refers to this strategy as “extreme model-system codesign,” and the description is fitting: no single technique alone achieves 1,000 tokens per second, but the combined effect of all methods working together does.
MiMo-V2.5-Pro is a top-tier model. We reported on the V2.5 Pro release in April—it performs on par with Claude Opus across most coding benchmarks and costs approximately $0.43 per million input tokens and $0.87 per million output tokens. Opus charges $5 per million input tokens and $25 per million output tokens.
UltraSpeed accelerates the exact MiMo V2.5 Pro model, not a reduced or simplified version.
Inference at this speed transforms how a model can be used. You can run multiple reasoning paths simultaneously instead of waiting for a single response. Fraud detection, trading signal generation, and real-time agent workflows—all of these require low-latency performance that 60 tokens per second simply can’t deliver. At 1,000 tokens per second, they become feasible.
Xiaomi is charging 3 times the standard MiMo-V2.5-Pro rate for roughly 10 times the throughput. The API trial runs from June 9 to 23, requires an application, and prioritizes enterprise and professional developers. The FP4-DFlash checkpoint is already available open-source on Hugging Face for community evaluation.
Daily Debrief Newsletter
Start every day with the top news stories right now, plus original features, a podcast, videos and more.



