Xiaomi's MiMo Code Claims to Outperform Claude Code in Long-Horizon Tasks

A coding agent that scaffolds a working app over lunch will routinely stall around 30 steps into a production refactor. It locks onto a hypothesis early and keeps patching a wrong assumption, so small errors compound until the run comes apart.

This is a scenario that Xiaomi’s MiMo AI team aims to address. They have open-sourced MiMo Code, a terminal-native harness that the company claims outperforms Anthropic’s Claude Code on agentic tasks running beyond 200 steps. While this benchmark is self-reported, drawn from Xiaomi’s own beta and a survey of 576 developers, it marks a significant shift in agent competition.

The number matters less than the axis of competition: **long-horizon reliability**. Holding a task together across hundreds of dependent steps is the new frontier in coding agents. The field is only now learning to measure how far an agent gets before it loses the thread—a metric known as the **endurance gap**.

Where do long-horizon agents break? Three key failures recur: early hypothesis hardening, compounding errors (where step 40 inherits mistakes from step 12), and context drift. It mimics a long batch job run without checkpoints, where a crash forces a complete restart from scratch.

To grade this gap objectively, researchers at UC Berkeley’s RDI lab, led by Dawn Song and Yiyou Sun, created the Agents' Last Exam benchmark. Shaped by over 250 industry experts, this strict benchmark tests agents on real-world shipped projects, exposing massive performance drops even for the most powerful model configurations, which scored below 50 percent on easy tiers.

[AgentUpdate Depth Analysis] Xiaomi's MiMo Code highlights a critical shift in the AI agent ecosystem from simple code-generation demos to high-endurance, long-horizon execution. While popular tools like Cursor and Devin excel at short-span tasks, they inevitably degrade under "hypothesis lock-in" during deep refactoring. Xiaomi's emphasis on conquering the 200-step barrier targets the most painful bottleneck in real-world software engineering. However, as UC Berkeley’s RDI lab demonstrated with "Agents' Last Exam," current agent architectures still struggle to surpass a 50% success rate on real, complex environments. The path forward for coding agents lies not just in wrapper-level scaffolding, but in integrating test-time compute (like RL-driven reasoning) with robust agent state-restoration mechanisms. For AI agents to achieve true production readiness, they must possess the metacognitive ability to self-correct, roll back faulty assumptions, and manage state transitions gracefully across hundreds of execution steps.

Xiaomi's MiMo Code Claims to Outperform Claude Code in Long-Horizon Tasks

Next Stories to Read

What Your Logs Can't Tell You When AI Agents Act Autonomously

Carney Warns of AI "Model Risk," Comparing Anthropic Ban to 2008 Crisis

Anthropic's Model Shutdown Ignites India's Sovereign AI Movement

Related Tools & Resources

Skill Marketplaces

Antigravity Awesome Skills

Awesome Agent Skills

Anthropic Agent Skills

Recommended Plugins

Agent SDK Dev

Claude Opus 4.5 Migration

Code Review