HACRL: Collaborative Reinforcement Learning for Heterogeneous Agents

Researchers have introduced Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a novel Reinforcement Learning from Verifiable Reward (RLVR) problem designed to address the inefficiencies inherent in isolated multi-agent on-policy optimization. HACRL facilitates collaborative optimization during the training phase while maintaining independent execution at inference time. Specifically, heterogeneous agents share verified rollouts during training to mutually enhance their capabilities.

HACRL distinguishes itself from typical LLM-based multi-agent reinforcement learning (MARL) by eliminating the need for coordinated deployment. Furthermore, unlike traditional on-policy or off-policy distillation, it enables bidirectional mutual learning among heterogeneous agents, moving beyond the one-directional teacher-to-student transfer common in homogeneous systems.

To implement this framework, the team proposed HACPO (Heterogeneous Agent Collaborative Policy Optimization), a principled algorithm that enables rollout sharing to maximize sample utilization and cross-agent knowledge transfer. To mitigate the challenges of capability discrepancies and policy distribution shifts, HACPO incorporates four specialized mechanisms with theoretical guarantees on unbiased advantage estimation.

Extensive experiments across various heterogeneous model configurations and reasoning benchmarks demonstrate that HACPO consistently improves all participating agents. Remarkably, HACPO outperformed the GSPO baseline (even when GSPO used double the rollouts) by an average of 3.6%, while simultaneously reducing the rollout cost by 50%. This underscores the framework's effectiveness in optimizing large-scale heterogeneous AI agent systems.

HACRL: Collaborative Reinforcement Learning for Heterogeneous Agents

Next Stories to Read

Google I/O 2026 Dialogues: Unpacking AI Agents, Quantum, and Embodied AI

NVIDIA Unveils Nemotron-Labs Diffusion: A New Paradigm for Parallel Text Generation

Specialization Beats Scale: Why 3B Parameter Models Outperform Frontier APIs