LLM agents equipped with tool-calling capabilities frequently encounter failures when user instructions are ambiguous or incomplete, leading to erroneous tool invocations and task failures. Current approaches often operate within unstructured language spaces, generating clarifying questions via prompting strategies that lack principled criteria for determining both which questions to ask and when to cease questioning.
To address this challenge, a new research introduces a principled formulation of "structured uncertainty" that directly operates over tool parameters and their domains. This framework cleanly separates "specification uncertainty" (what the user genuinely intends) from "model uncertainty" (what the LLM predicts). The formulation leverages "Expected Value of Perfect Information" (EVPI) to quantify the disambiguation value of each potential clarifying question, balanced against an "aspect-based cost modeling" approach designed to prevent redundant inquiries.
The versatility of this formulation is demonstrated through two key applications. Firstly, SAGE-Agent utilizes structured uncertainty for inference-time question selection. Compared to strong prompting and existing uncertainty-based baselines, SAGE-Agent achieves 7-39% higher coverage on ambiguous tasks while significantly reducing the number of clarification questions by 1.5-2.7 times.
Secondly, the research illustrates that structured uncertainty provides effective training signals. Uncertainty-guided reward modeling, facilitated by uncertainty-weighted GRPO training, substantially boosts the accuracy of "When2Call" models. Specifically, it improves accuracy from 36.5% to 65.2% for a 3B model and from 36.7% to 62.9% for a 7B model. This demonstrates enhanced sample efficiency for reinforcement learning in tool-calling agents.
To enable comprehensive evaluation, the researchers also present ClarifyBench, which stands as the first multi-turn dynamic tool-calling disambiguation benchmark. The collective results establish structured uncertainty as a principled framework that concurrently improves both inference-time interaction efficiency and training-time sample efficiency in tool-augmented agents.