Xiaomi's MiMo-V2.5-Pro Challenges Claude Opus with Hours-Long Autonomous Coding, Compiling Projects in Under Five Hours

Xiaomi's newly unveiled open-weight model, MiMo-V2.5-Pro, has demonstrated the ability to write a complete compiler in under five hours. Internal tests indicate its performance on coding benchmarks is comparable to Anthropic's Claude Opus 4.6, while consuming significantly fewer tokens than competing Western models.

MiMo-V2.5-Pro is architected as a Mixture-of-Experts (MoE) model, where only a subset of its total parameters is activated per request. It features 1.02 trillion parameters in total, with 42 billion active for each task. The MiMo team optimized this iteration for long-running tasks that require hours of execution and thousands of tool calls.

Its context window capability is among the industry's highest, with the main version capable of processing up to one million tokens simultaneously. A base version, without additional retraining, supports up to 256,000 tokens.

Xiaomi highlighted MiMo-V2.5-Pro's advancements through three distinct demonstrations. In the first, the model was tasked with building a complete compiler project derived from a Peking University course, a feat Xiaomi notes typically spans several weeks for a computer science student.

The model completed the compiler project in 4.3 hours, progressing through four phases, and improved test coverage from an initial 59 percent on the first compile to a perfect 100 percent. Achieving a perfect score of 233 out of 233 on a hidden test suite, MiMo-V2.5-Pro finalized the project in 4.3 hours, utilizing 672 tool calls. Xiaomi emphasized the model's methodology: it initially scaffolded the entire development pipeline and then iteratively developed each stage. Its inaugural compile run successfully passed 137 of 233 tests. Notably, during a subsequent refactoring phase, the model autonomously diagnosed and corrected a regression it had introduced.

For the second demonstration, MiMo-V2.5-Pro generated a desktop video editor, comprising approximately 8,000 lines of code, based on minimal prompts. This autonomous process spanned 11.5 hours and involved around 1,870 tool calls.

The third demo involved connecting MiMo-V2.5-Pro to a circuit simulator via Claude Code, tasking it with designing a voltage regulator. Within one hour, the design met all six technical specifications simultaneously, with four of these outperforming the model's initial draft by approximately an order of magnitude.

Xiaomi primarily positions MiMo-V2.5-Pro based on its favorable performance-to-token ratio. On Xiaomi's internal ClawEval agent benchmark, the model achieved a 64 percent score using approximately 70,000 tokens per task run. This represents a 40-60 percent reduction in token consumption compared to Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 for equivalent performance levels.

In coding benchmarks, MiMo-V2.5-Pro performed nearly on par with Claude Opus 4.6 on SWE-Bench Pro, and slightly surpassed it on Terminal-Bench 2.0. Specific coding benchmark results include scores of 78.9 on SWE-bench Verified, 57.2 on SWE-Bench Pro, and 68.4 on Terminal-Bench 2.0.