SOURCE // NEWS

ARA: AI Agent System Revolutionizes Scientific Peer-Review with Scalable Reproducibility Assessment

ARA: AI Agent System Revolutionizes Scientific Peer-Review with Scalable Reproducibility Assessment

Scientific #peer review is increasingly challenged by the scale and complexity of modern research output, particularly in its ability to assess #reproducibility. Evaluating reproducibility necessitates reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, often exceeding the capacity of human reviewers.

Addressing this, Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a research paper, ARA leverages AI agents to extract a directed workflow graph that links sources, methods, experiments, and outputs. It then evaluates the reconstructability of this workflow using both structural and content-based scores to provide a comprehensive reproducibility assessment.

The generalizability and consistency of ARA were demonstrated through experiments on 213 ReScience C articles, which represent the largest cross-domain benchmark of human-validated computational reproducibility studies to date. The system showed consistent workflow reconstruction and assessment across various Large Language Models (LLMs), model temperatures, and scientific domains. ARA achieved approximately 61% accuracy on three benchmarks. Notably, it reported the highest accuracy on ReproBench (60.71% versus 36.84%) and GoldStandardDB (61.68% versus 43.56%), highlighting its substantial potential to complement human review at scale and facilitate the evolution of next-generation peer review processes.