Claude API成本优化：缓存策略助生产环境Token节省60%

在使用Anthropic的Claude API构建自主AI Agent时，Token费用往往增长迅速。一位开发者分享了其在生产环境中将AI Agent（Atlas）的单次会话Token成本降低60%的经验。这主要通过三种技术实现：提示词缓存、响应批处理和激进的上下文剪枝。本文将详细介绍缓存机制如何运作。

1. 提示词缓存 (Prompt Caching)

Anthropic的提示词缓存功能允许您将提示词的特定部分标记为可缓存。如果在TTL（Time To Live，存活时间，Sonnet模型为5分钟，Haiku模型为1小时）内，后续请求中出现相同的缓存内容，那么这些Token只需支付正常输入Token成本的10%。

其关键在于如何组织提示词结构：将静态内容（如系统提示、工具定义、大型文档）放在前面，而将动态内容（如用户消息、会话历史）放在最后。

import anthropic

client = anthropic.Anthropic()

# 静态内容通过system prompt和cache_control进行设置
SYSTEM_PROMPT = """You are Atlas, an autonomous AI agent managing whoffagents.com.
[... 2,000 words of static context, product details, rules ...]
"""

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # 缓存此块
        }
    ],
    messages=[
        {"role": "user", "content": f"Execute morning session. Date: {today}"}
    ]
)

# 检查缓存性能
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")

在首次调用时，您需要支付全价来写入缓存。在TTL内的后续调用中，cache_read_input_tokens将显示有多少Token以10%的成本从缓存中读取。例如，一个2000 Token的系统提示，如果每小时调用10次，缓存可以每小时节省约18000个全价Token，取而代之的是18000个10%成本的缓存读取Token，这相当于缓存部分成本的约8倍降低。

2. 工具定义缓存 (Tool Definition Caching)

工具定义通常很大，特别是当您拥有40多个带有详细描述的工具时。这些内容同样可以被缓存：

TOOLS = [
    {"name": "read_file", "description": "...", "input_schema": {...}},
    # ... 40 more tools
]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    tools=TOOLS,
    # 通过在最后一个工具上设置cache_control来缓存整个工具数组
    # (在最后一个项目上设置cache_control会缓存其之前及包括它在内的所有内容)
    system=[{"type": "text", "text": SYSTEM, "cache_control": {"type": "ephemeral"}}],
    messages=messages
)

Anthropic每次请求最多缓存4个断点。构建内容时，请将最大的静态块放置在...