“1% was written by me, and 99% was written by the Agent.”
Deli Chen, a prominent researcher at DeepSeek, recently published a comprehensive academic survey on his personal blog, detailing the landscape of Autonomous Research Agents. Impressively, this 46-page paper, which features 7 figures, 4 tables, and 103 verified references, was almost entirely co-authored by his custom-built agent skill, DeliAutoResearch.
Using DeepSeek-V4-Pro for reasoning and writing, alongside GPT-Image2 for generating diagrams, the paper underwent six major revisions over a six-day period. The pipeline executed approximately 108 Agent loops, consuming 648k tokens and generating 2,234 lines of LaTeX code. Remarkably, Chen estimated that his own cognitive contribution required less than two hours of total "human CPU time"—a workload that would typically take an academic researcher at least a month of intense labor to complete.
The paper introduces a structured taxonomy of L1–L5 autonomy levels for Research Agents, drawing a clear parallel to the SAE autonomous driving levels to bring clarity to the rapidly evolving AI Agent ecosystem:
L1 (Auto-completion): Basic predictive assistance, typified by early GitHub Copilot, which suggests the next line of code based on context.
L2 (Task Execution): Interactive assistants like standard ChatGPT or Claude equipped with tool-use. They can decompose tasks but require step-by-step human intervention and approval.
L3 (Multi-step Execution): Advanced developer agents such as Claude Code or Cursor Agent, capable of executing 10 to 100 sequential steps autonomously, prompting human review only at critical junctures.
L4 (Bounded Autonomous Execution): Goal-driven agents that can design and execute multi-step experiments, write code, and compile papers within a bounded research domain. At this stage, human users only provide the initial objective and evaluate the final output. The current frontier of the industry is beginning to touch this level.
L5 (Fully Autonomous Research): The ultimate vision of self-directed agents that can independently establish research agendas, allocate computing budgets, continuously accumulate cross-disciplinary knowledge, and conduct long-term research. L5 remains unachieved, primarily constrained by the bottleneck of continuous knowledge retention and reliable self-evaluation.
The survey also highlights four fundamental architectural paradigms that power modern agents:
1. Single-Agent Loop: Guided by frameworks like ReAct, Reflexion, LATS, and Tree of Thoughts (ToT). These simple loops are highly efficient but struggle with highly complex, open-ended tasks.
2. Multi-Agent Collaboration: Exemplified by CAMEL, AutoGen, and MetaGPT. These leverage specialized roles and cross-verification to minimize errors, though they often suffer from high token consumption and coordination overhead.
3. Hierarchical Planning: Deployed in cutting-edge agents like Devin and Claude Code. They employ top-down decomposition to orchestrate long-horizon, complex engineering or research objectives.
4. Tool-Augmented Execution: Led by systems like SWE-Agent, where the Agent-Computer Interface (ACI) and sandboxed environments (such as web browsers and code execution environments) strictly define the limits of agent capability.
In its concluding sections, the paper addresses six critical open challenges impeding progress toward L5 autonomy: cognitive loop traps, context window constraints in long-horizon tasks, the lack of automated frameworks to measure scientific novelty, reproducibility issues stemming from LLM stochasticity, safety/ethical dual-use risks, and the compounding costs of high-token workflows.
[AgentUpdate Depth Analysis] Deli Chen's pioneering workflow signals a paradigm shift where AI Agents transition from simple execution assistants to authentic co-researchers. While current tools like Claude Code and Cursor have demonstrated impressive L3/L4 task execution in software engineering, applying autonomous agents to scientific research demands a much higher cognitive threshold. By integrating DeepSeek-V4-Pro with specialized research skills, this approach proves that hierarchical planning and hybrid agent architectures can successfully orchestrate long-horizon, multi-step academic pipelines. The bottleneck of future AI Agent ecosystems is shifting from raw LLM reasoning capabilities to the design of more robust Agent-Computer Interfaces (ACI) and self-correction feedback loops. As we march toward L5 autonomy, the primary challenge will be solving continuous knowledge accumulation and objective self-evaluation, ultimately redefining the division of labor between human cognitive initiative and agentic execution.