SOURCE // LABS

Boost AI Agent Tool-Calling Accuracy with SFT and DPO on SageMaker

Boost AI Agent Tool-Calling Accuracy with SFT and DPO on SageMaker

AI agents can autonomously handle complex, multi-step tasks, but their effectiveness depends heavily on calling the right tools to retrieve information or take action. When an agent picks the wrong tool, formats parameters incorrectly, or breaks a workflow chain, task completion times grow, error rates rise, support costs increase, and user experiences degrade. As more organizations move agentic applications from pilot to production, having agents that select the right tool for each request is essential for reliable automation.

In this post, you learn how to use Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) together to improve the tool-calling accuracy of a small language model (SLM). The example uses Amazon SageMaker AI training jobs, so you can focus on training code instead of managing your own training infrastructure. You also learn how to evaluate tool-calling accuracy and compare a base model to several fine-tuned variants, so you can make data-driven decisions about model quality.

Fine-tuning methodologies

Supervised fine-tuning (SFT) involves curating a high-quality dataset that aligns closely with the model’s intended function, providing explicit examples of how the model should perform certain tasks or interact with specific tools. This method is particularly effective for teaching the model to recognize the nuances of tool-specific language, commands, and constraints.

Direct Preference Optimization (DPO) refines these interactions by incorporating human feedback or predefined objectives directly into the training loop. DPO aligns the model’s output more closely with target outcomes by emphasizing a preference for certain types of responses or behaviors over others. The training data in DPO contains a “like this, not like that” preference, which optimizes the same goals as reinforcement learning without reward functions or reward models. This approach reduces resource requirements and training time while maintaining high quality.

For example, the HuggingFace TRL library for DPO takes training samples in the following format:

{
    "prompt": ["<array of input samples>"],
    "chosen": "<complete preferred response (j)>",
    "rejected": "<complete non-preferred response (k)>"
}

This feedback-driven approach allows for iterative improvement of the model’s tool-interaction capabilities based on real-world usage patterns in the training data. Together, SFT and DPO form a robust framework for fine-tuning language models to interface with a wide range of digital tools. By using these techniques, you can build AI systems that understand and generate human-like text and that perform complex tasks by autonomously interacting with external applications, broadening the scope and utility of AI in both consumer and enterprise environments.

[AgentUpdate Depth Analysis]While Supervised Fine-Tuning (SFT) establishes the baseline format and syntactical understanding for tool-calling, it often suffers from exposure bias in multi-step agentic workflows. Direct Preference Optimization (DPO) addresses this by providing negative constraints ('this, not that') without the infrastructure overhead of traditional RLHF. Applying this combined SFT+DPO pipeline to Small Language Models (SLMs) is a game-changer for enterprise AI Agent deployments. It demonstrates that sub-10B parameter models, when carefully aligned, can achieve tool-calling reliability comparable to proprietary frontier models. This drastically reduces API costs and inference latency, unlocking a viable path for deploying highly responsive, cost-effective, and privacy-preserving autonomous agents at scale.