Large Language Models (LLMs) are featuring increasingly larger context windows, with Claude-sonnet-4-6 offering 200k tokens and GPT-4o providing 128k. While these capacities seem enormous, they often prove insufficient when building complex Retrieval-Augmented Generation (RAG) applications. Passing document context, extensive conversation history, detailed system prompts, and multiple tool definitions simultaneously can quickly exhaust the token budget. Running out of context window mid-conversation represents an unrecoverable failure, making effective token management a critical engineering discipline in production AI applications.
Counting Tokens
Accurate token counting is the foundational step for context window management. Different LLM providers offer distinct strategies:
- Anthropic: The official API provides a dedicated endpoint for token counting. Developers can use the
anthropic.messages.countTokensmethod to precisely calculate input tokens for messages, including system prompts. - OpenAI: Token counting is typically handled locally using the
tiktokenlibrary. This method allows developers to encode text based on a specified model (e.g., 'gpt-4o') and obtain the token count without making an additional API call.
Conversation History Management: Truncation Strategy
A naive approach of simply appending every message to the context indefinitely will inevitably lead to token overflow. A basic yet effective strategy is to truncate older messages:
This approach involves defining a maximum context token budget (e.g., 100,000 tokens) and reserving a portion for the system prompt, the new user message, and the model's response. Within the remaining budget, the algorithm iterates through the conversation history in reverse chronological order (most recent first). If adding a message would exceed the budget, the process stops, and older messages are dropped. This "keep most recent, drop oldest" method efficiently controls token usage, preventing context overflow.
Conversation History Management: Summarization Strategy
Beyond simple truncation, a more advanced strategy involves summarizing older parts of the conversation. Instead of merely dropping messages, this method leverages a less expensive LLM (e.g., Anthropic's Claude-haiku) to compress historical information:
In practice, a segment of the conversation history (e.g., the oldest half) is formatted and sent to a dedicated summarization LLM. This LLM generates a concise summary, preserving key facts and decisions from the interaction. This summary then acts as a "condensed" historical record, which can be included in the new context window, significantly reducing token consumption while retaining crucial information. This strategy is particularly useful for AI applications requiring long-term memory within strict context length limitations.