News

New Breakthrough: LLM Agents Inherently Know When to Use Tools, Even Without Explicit Reasoning

New Breakthrough: LLM Agents Inherently Know When to Use Tools, Even Without Explicit Reasoning

Large Language Model (LLM) agents often exhibit a tendency to invoke external tools indiscriminately, even in scenarios where the model could directly provide an answer. Each superfluous tool call results in unnecessary API fees and increased latency. Currently, there is a lack of systematic benchmarks specifically designed to evaluate when a tool call is genuinely required.

To address this gap, researchers have introduced When2Tool, a novel benchmark comprising 18 environments (15 single-hop and 3 multi-hop tasks). These environments are categorized across three dimensions of tool necessity: computational scale, knowledge boundaries, and execution reliability. When2Tool is designed with controlled difficulty levels that establish clear decision boundaries between tasks that necessitate tool use and those that do not.

The study evaluated two families of training-free baseline methods: Prompt-only, which attempts to discourage unnecessary calls by varying the prompt, and Reason-then-Act, which requires the model to explicitly reason about tool necessity before acting. Both baselines demonstrated significant limitations. Prompt-only suppressed necessary calls alongside unnecessary ones, while Reason-then-Act incurred a disproportionately high accuracy cost on more challenging tasks.

To investigate the underlying reasons for these failures, researchers probed the models' hidden states. A critical finding emerged: tool necessity is linearly decodable from the pre-generation representation of the models' hidden states, achieving an AUROC (Area Under the Receiver Operating Characteristic curve) of 0.89-0.96 across six different models. This significantly outperforms the models' own verbalized reasoning, revealing that models inherently "know" when tools are needed but fail to act upon this internal knowledge during response generation.

Building on this insight, the team proposed Probe&Prefill, a new method that leverages a lightweight linear probe to read the hidden-state signal and subsequently prefill the model's response with a steering sentence. Across all tested models, Probe&Prefill successfully reduced tool calls by 48% while incurring only a 1.7% accuracy loss. In contrast, the best performing baseline at a comparable accuracy level managed to reduce only 6% of tool calls, or achieved a similar tool call reduction but with a five-fold higher accuracy loss. The code for this research is publicly available.

↗ Read original source