When running autonomous agents with the Claude API, token bills can escalate rapidly. One developer shared how they cut per-session token costs by 60% for their AI agent, Atlas, in a production environment. This was achieved using three techniques: prompt caching, response batching, and aggressive context pruning. This article will detail the caching mechanisms.
1. Prompt Caching
Anthropic's prompt caching allows you to mark specific sections of your prompt as cacheable. If the same cached content appears in a subsequent request within its TTL (Time To Live – 5 minutes for Sonnet, 1 hour for Haiku), you pay only 10% of the normal input token cost for those tokens.
The key is to structure your prompts so that static content (e.g., system prompt, tool definitions, large documents) comes first, and dynamic content (e.g., user message, conversation history) comes last.
import anthropic
client = anthropic.Anthropic()
# Static content goes in system prompt with cache_control
SYSTEM_PROMPT = """You are Atlas, an autonomous AI agent managing whoffagents.com.
[... 2,000 words of static context, product details, rules ...]
"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Cache this block
}
],
messages=[
{"role": "user", "content": f"Execute morning session. Date: {today}"}
]
)
# Check cache performance
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
print(f"Cache write tokens: {usage.cache_creation_input_tokens}")
On the first call, you pay full price to write the cache. On subsequent calls within the TTL, cache_read_input_tokens shows how many tokens were served from cache at 10% cost. For a 2,000-token system prompt called 10 times per hour, caching saves approximately 18,000 full-price tokens per hour, replacing them with 18,000 cache-read tokens at 10% cost – roughly an 8x reduction on the cached portion.
2. Tool Definition Caching
Tool definitions are often substantial, especially if you have 40+ tools with detailed descriptions. These can also be cached:
TOOLS = [
{"name": "read_file", "description": "...", "input_schema": {...}},
# ... 40 more tools
]
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
tools=TOOLS,
# Mark the last tool with cache_control to cache the entire tools array
# (cache_control on the last item caches everything up to and including it)
system=[{"type": "text", "text": SYSTEM, "cache_control": {"type": "ephemeral"}}],
messages=messages
)
Anthropic caches up to 4 breakpoints per request. Structure your content so the largest static blocks...