Scientific peer review is increasingly challenged by the scale and complexity of modern research output, particularly in its ability to assess reproducibility. Evaluating reproducibility necessitates reconstructing experimental dependencies, methodological choices, data flows, and result-generating procedures, often exceeding the capacity of human reviewers.
Addressing this, Agentic Reproducibility Assessment (ARA) formalizes reproducibility assessment as a structured reasoning task over scientific documents. Given a research paper, ARA leverages AI agents to extract a directed workflow graph that links sources, methods, experiments, and outputs. It then evaluates the reconstructability of this workflow using both structural and content-based scores to provide a comprehensive reproducibility assessment.
The generalizability and consistency of ARA were demonstrated through experiments on 213 ReScience C articles, which represent the largest cross-domain benchmark of human-validated computational reproducibility studies to date. The system showed consistent workflow reconstruction and assessment across various Large Language Models (LLMs), model temperatures, and scientific domains. ARA achieved approximately 61% accuracy on three benchmarks. Notably, it reported the highest accuracy on ReproBench (60.71% versus 36.84%) and GoldStandardDB (61.68% versus 43.56%), highlighting its substantial potential to complement human review at scale and facilitate the evolution of next-generation peer review processes.