Episode 17: Q&A β Cache Mechanisms & Token Optimization
This Episode: Cache is the core of saving money. These 7 questions help you fully understand the caching mechanism.
Q8: Is Prompt Cache the same as browser cache?
No. Browser cache stores web resources (images, JS, CSS). Prompt Cache stores API request prefix matching results.
Principle: If two consecutive API requests have identical opening sections, the repeated parts are read from cache. Saves ~90% input cost.
Key limitation: Cache requires an identical continuous prefix from the start. Any change in the middle invalidates everything from that point onward.
Q9: When does the 5-minute cache countdown start?
From the most recent time the LLM returned a response β i.e., lastAssistantResponseAt.
Not when you sent the message β when AI finished its reply. Send another message within 5 minutes β cache hit β save 90% input cost.
Q10: Is "assistant response" the same as LLM response?
Yes. "assistant" is the role name in the API message format:
[
{ "role": "user", "content": "hello" }, β Your message
{ "role": "assistant", "content": "Hello!" } β LLM's reply
]
Q11: How much do cache hits save?
Using Opus model, with 100K input tokens and 80K cache hit:
| Calculation | Cost | |
|---|---|---|
| No cache | 100K Γ $3/MTok | $0.30 |
| 80K cache hit | 80K Γ $0.30/MTok + 20K Γ $3/MTok | $0.084 |
Saves 72%. Over a long session (20 turns), good cache vs bad cache means 3-5Γ cost difference.
Q12: Does cache survive /clear?
No. /clear empties all conversation history. Cache prefix completely changes. No old cache can be hit.
First turn is most expensive: fixed section fully reprocessed. Cache recovers from turn 2 onward.
Q13: How to keep cache from expiring?
Simplest method: send any message before cache expires. Any message works. After LLM responds, the 5-minute TTL resets.
Cache TTL: 4m 12s β No rush
Cache TTL: 0m 30s β About to expire! Send a message to extend
Cache TTL: -expired β Expired, next request reprocesses everything
Q14: Why does cache hit rate suddenly drop?
Three reasons (by probability):
- Auto compression triggered (context > 95%) β prefix changes, cache invalidates from compression point
/compactexecuted β same effect- TTL expired β 5 minutes without new request
First two are "structural invalidation" β unavoidable. The third can be proactively managed (see Q13).
Next Episode: Episode 18 β the final Q&A covering HUD debugging, Memory, and advanced topics.