DeepSeek researcher Chen Deli recently published a research review paper on his personal blog, revealing that "1% was written by me, and 99% was written by the Agent." He stated that the entire paper was completed with less than two hours of human "CPU time," a significant leap in efficiency compared to the month it would typically take.
This paper, co-authored by Chen Deli and DeepSeek-V4-Pro, primarily leveraged his self-developed skill, DeliAutoResearch, for research and writing, complemented by GPT-Image2 for illustrations. The paper underwent 6 iterations (4 for V1, 1 for V2, 1 for V3) over 6 days. During this process, the agent was called approximately 108 times, consuming 648,000 tokens and generating 2,234 lines of LaTeX code. It incorporates 103 verified references, culminating in a 46-page document (538KB) with 7 charts and 4 tables.
The core contribution of the paper is the proposal of an L1-L5 autonomy classification system for automated research agents. By analyzing four major architectural patterns, the paper compares their performance across dimensions such as scalability, cost, and reliability. Furthermore, based on a six-dimensional feature matrix, the study analyzes 17 mainstream agent systems and outlines six open problems along with corresponding research directions.
Chen Deli suggests that foundational models are shifting AI tools from research assistants to autonomous research agents. To address the current lack of a unified framework, terminology confusion, and inconsistent evaluation standards, he and his AI co-authors developed the L1-L5 autonomy hierarchy. This framework aims to provide a clear taxonomy for the AI Agent field, similar to the SAE levels for autonomous driving.
The classification system categorizes agent autonomy as follows:
- L1: Autocompletion. Represented by early tools like GitHub Copilot, predicting the next line of code.
- L2: Task Execution. Examples include ChatGPT/Claude chatbots integrated with various tools, capable of task decomposition but requiring human approval at each step.
- L3: Multi-step Execution. Such as Claude Code and Cursor Agent, which can autonomously execute 10 to 100 steps, requesting human review only at critical junctures.
- L4: Full Autonomy in a Restricted Domain. Humans provide research objectives and evaluate final outcomes, while the agent autonomously handles multi-step experiments, coding, and paper writing, though it cannot autonomously select research problems. This level represents the current frontier of the industry.
- L5: Fully Self-Defined Research Agenda. The agent can independently choose research topics, allocate resources, continuously accumulate knowledge, and conduct sustained cross-domain research. This is an ideal state not yet achieved, with core bottlenecks identified as continuous knowledge accumulation, reliable self-evaluation, and architectural scalability.
Beyond autonomy levels, the paper also summarizes four mainstream architectural patterns for agents:
- Single Agent Loop: Exemplified by early research like ReAct, Reflexion, LATS, and Tree of Thoughts. It involves a single model iterating through reasoning-action-observation, offering simplicity and efficiency but limited capabilities for complex tasks.
- Multi-Agent Collaboration: Represented by frameworks such as CAMEL, AutoGen, and MetaGPT. Characterized by task division and multi-perspective error correction, though at a higher cost and prone to communication complexities.
- Hierarchical Planning: Seen in Claude Code and Devin, featuring layered planning and task decomposition, suitable for long-duration complex research.
- Tool-Augmented Execution: Illustrated by SWE-Agent, whose core involves integrating tools like code execution environments, web browsing, APIs/databases, and multimodal tools. The design of the Agent-Computer Interface (ACI) directly impacts its performance.
The paper emphasizes that these four modes are not inherently superior or inferior but should be selected based on specific task requirements. In practical applications, hybrid architectures combining the advantages of multiple modes are common. Through a comparative analysis of 17 mainstream autonomous research agents, the paper demonstrates the field's evolution from early general-purpose, fragile prototypes to L4 specialized systems, with code agents showing the highest maturity and scientific agents beginning to produce verifiable new discoveries.
Finally, the paper identifies six critical open problems confronting AI Agent research:
- Cognitive Loop Traps: Agents may get stuck in repetitive, ineffective strategies, lacking self-termination capabilities.
- Context Window Limitations: Fixed context windows (4K-1M tokens) are insufficient for supporting long-duration, complex research.
- Innovation Assessment: The absence of automated methods to measure research originality and value.
- Reproducibility: Model randomness and prompt sensitivity lead to difficulties in reproducing results.
- Safety and Ethics: Concerns include dual-use risks, autonomous self-improvement risks, and academic integrity risks.
- Cost Issues: High per-task costs exacerbate inequalities in scientific research.
Chen Deli also shared his personal experience, noting that the assistance of AI agents allowed him to resume blogging and writing activities that had been shelved due to intensive work. This shift transformed his role from an "executor" to an "initiator," significantly boosting his productivity.