SOURCE // NEWS

τ0-WM: The Largest Open-Source Embodied World Model Pre-Trained on 17,800h Robot Data

τ0-WM: The Largest Open-Source Embodied World Model Pre-Trained on 17,800h Robot Data

After nearly two years of rapid development in Embodied AI, the industry has welcomed a monumental breakthrough in utilizing large-scale, real-world physical data for pre-training. Led by Jianlan Luo, Associate Professor at Shanghai Innovation Institute and Chief Scientist at Agibot, the team has released τ0-World Model (τ0-WM), the world's largest open-source pre-trained embodied world model. Boasting 5 billion parameters, τ0-WM's pre-training dataset reaches nearly 30,000 hours, featuring a record-breaking 17,800 hours of real-robot teleoperation data as its core backbone, shattering the convention that real-robot data can only be used for late-stage fine-tuning.

Traditional robot perception and control rely heavily on reactive end-to-end policies, where the neural network immediately outputs actions upon receiving visual inputs. While this "reflex-like" approach works well in standard tasks, it falls short during contact-rich, long-horizon, or highly occluded manipulations, where single errors can propagate irreversibly. To address this, τ0-WM incorporates Test-Time Computation (TTC), enabling the robot to perform parallel simulations and compare multiple action paths in a "virtual sandbox" before execution, allowing for deliberate "slow thinking" and active error correction.

The online inference of τ0-WM operates in three distinct phases: First, "Proposal," where the Video-Action Model (VAM) samples multiple candidate actions and generates blurry future frames based on multi-view observations and language prompts; Second, "Rollout," where the action-conditioned video simulator generates fine-grained, multi-view future frames for each candidate to tackle occlusions; Third, "Evaluation & Correction," where the system scores actions using the Re-denoising Consistency Score (RCS). If the scores are insufficient, it triggers Low-quality Action Rectification (LAR) to select the most progress-aligned future frame and regenerate actions. Unlike traditional world models that discard future-prediction modules during deployment, τ0-WM retains explicit future imagination during inference to steer final decisions.

Architecturally, τ0-WM is driven by two shared video diffusion backbone components: the VAM (built on the Wan2.2-5B video generation model) and the action-conditioned video simulator. The pre-training dataset comprises three parts: 17,800 hours of real-robot teleoperation data (providing precise action supervision), 6,500 hours of Universal Manipulation Interface (UMI) data (enhancing behavioral diversity), and 3,000 hours of Egocentric human interaction data (covering long-tail scenarios). These heterogeneous sources are harmonized during training using modality-specific supervision masks.

Evaluation results demonstrate that τ0-WM significantly outperforms benchmarks like π0.5 and Fast-WAM on four long-horizon fine manipulation tasks—Toolbox, School Bag, Badminton, and Faucet—proving the model's exceptional generalization and robustness in complex physical environments.

[AgentUpdate Depth Analysis] The release of τ0-WM marks a pivotal shift for Embodied AI from reactive policies to proactive planning, mirroring the transition from System 1 (fast, reactive) to System 2 (slow, deliberative) thinking in LLMs. By combining 17,800 hours of real-robot pre-training data with Test-Time Computation (TTC), τ0-WM addresses a major bottleneck in robot manipulation: the propagation of irreversible errors in long-horizon tasks. While traditional world models discard future-prediction modules during deployment to boost speed, τ0-WM retains "explicit future imagination" to evaluate and rectify actions in real-time. This methodology provides a crucial paradigm for the broader AI Agent ecosystem: agents operating in the physical world must possess closed-loop self-correction capabilities. τ0-WM's open-source nature lowers the barrier for embodied agent development, paving the way for highly adaptive, general-purpose robotics.