AI research agents are rapidly accelerating machine learning research by automating hypothesis generation, execution of experiments, and empirical refinement. However, existing agent strategies span a wide spectrum—from simple greedy hill-climbing to complex tree search and evolutionary optimization—leaving it unclear which strategic choices actually drive performance. Evaluating these strategies has been challenging because existing benchmarks conflate an agent's search topology with its execution infrastructure (e.g., code editors), making it impossible to isolate the true source of performance gains. They also lack process-level metrics to analyze exploration behaviors.
To address this gap, researchers have introduced FML-bench, a novel benchmark comprising 18 fundamental ML research tasks across 10 distinct domains. FML-bench successfully separates agent strategy from execution infrastructure and defines 12 detailed process-level behavioral metrics to analyze search dynamics.
By evaluating six representative agents, the study revealed several counter-intuitive findings:
First, strategy complexity alone does not guarantee superior performance. Surprisingly, a simple greedy hill-climber achieved performance almost parity with the best-performing tree-search agent, with both significantly outperforming the remaining four more complex agents.
Second, this pattern is highly correlated with the "improvement opportunity structure". Greedy search is highly effective when improvement opportunities are dense, whereas tree-search and evolutionary strategies excel when opportunities are sparse. Leveraging this insight, the team designed an adaptive agent that dynamically shifts to broader exploration upon detecting stagnation. This adaptive agent outperformed all six baseline agents.
Third, process-level analysis indicates that early convergence and directionally focused exploration are strongly correlated with final performance, while solution diversity and compute costs show no significant association.
The FML-bench repository is now publicly available, offering a standardized platform for designing and evaluating future AI research agents.
[AgentUpdate Depth Analysis] For a long time, AI Agent evaluations have suffered from the confounding variables of reasoning strategies versus execution environments. Benchmarks like SWE-bench conflate the "cognitive" performance of LLMs with "physical" tool-handling capabilities. FML-bench's decision to decouple search dynamics from infrastructure is a crucial step forward. By isolating search topology, it provides a clean sandbox to study how agents navigate complex decision spaces. Crucially, the success of the adaptive agent highlights the necessity of meta-cognitive layers in future agent architectures. Instead of relying on static prompt chains or fixed tree-search budgets, next-generation AI Agents must actively monitor their own progress, recognize stagnation, and dynamically adjust their cognitive depth. This paradigm shift from static execution to dynamic resource allocation will be pivotal for scaling LLM-based autonomous scientists.