第 30 期 | 全栈复盘与未来展望:进军更高阶的 AI 应用 (EN)
Subtitle: Say Goodbye to "Goldfish Memory" – Building a Context-Aware Support Assistant
Welcome back, developers, to the LangChain Masterclass. I'm your host.
In previous sessions, we built the foundational Q&A skeleton for our Intelligent Support Copilot. Many of you reached out in the community saying: "I followed the tutorial, and the bot answers questions, but it acts like an idiot—wait, no, like a goldfish!"
Why? Because when you ask, "Why hasn't my order 12345 shipped yet?" it replies, "Checking that for you." But if you immediately follow up with, "Then just cancel and refund it," it gets completely confused and asks, "What would you like to refund?"
Large Language Models (LLMs) are inherently stateless. They are like super-brains with zero memory. Every time you speak to them, it's a "first meeting." In a real-world customer support scenario, if users had to repeat their order number and context with every single message, they would probably smash their keyboards.
In today's lesson, we're going to solve this core pain point. I'll take you deep into LangChain's Memory mechanism. Using the modern LCEL (LangChain Expression Language) architecture, we'll implant a "hippocampus" into our support copilot, transforming it into a truly context-aware assistant.
🎯 Learning Objectives
- Understand the Essence of "Memory": Learn how to overcome the stateless nature of LLMs and understand exactly how Chat History works.
- Master Modern LangChain Memory Architecture: Ditch the legacy
ConversationChainand master the production-gradeRunnableWithMessageHistory. - Implement Multi-turn Support Conversations: Build multi-turn ticket processing based on in-memory and simulated persistent storage within our Copilot project.
- Token Cost Optimization Strategies: Learn to use sliding windows and summary memory to prevent "memory bloat" from causing OOM (Out of Memory) errors or bankrupting your API budget.
📖 Principle Analysis
Before diving into the code, let's establish our architectural mindset. Many beginners think that "adding memory to an LLM" involves flipping some magical switch inside the model. Wrong! LLM memory relies entirely on "force-feeding" context.
Since the LLM has no memory, every time we ask a question, we bundle the past chat history and the current question (Human Input) together and send them to the LLM all at once. Seeing this context, the LLM "pretends" to remember.
In LangChain, this process is abstracted into highly reusable components. Let's look at the architecture diagram below:
sequenceDiagram
participant U as 🙎♂️ User
participant R as 🤖 RunnableWithMessageHistory (LangChain)
participant M as 🗄️ MessageHistory (Storage)
participant P as 📝 Prompt Template
participant L as 🧠 LLM
U->>R: 1. "Then just cancel and refund it" (Session ID: user_001)
activate R
R->>M: 2. Fetch chat history via Session ID
M-->>R: 3. Return: [User: "Has order 12345 shipped?", AI: "Not yet"]
R->>P: 4. Assemble: History + System Prompt + Current Question
P-->>R: 5. Generate complete Prompt
R->>L: 6. Submit request to LLM
L-->>R: 7. Return: "Understood, processing the refund for order 12345."
R->>M: 8. Append this Q&A pair to storage
R-->>U: 9. Return final answer
deactivate RCore Concepts Explained:
- Session ID: In a support system, thousands of users might be online simultaneously. We must use a
session_idto distinguish whether this is Alice's ticket or Bob's ticket. - MessageHistory: This is an abstraction of the storage medium. During development and testing, we store it in memory (
ChatMessageHistory); in production, we must store it in Redis or a database (e.g.,RedisChatMessageHistory). - Prompt Injection: LangChain uses the
MessagesPlaceholderto dynamically inject the fetched historical messages between the System Prompt and the Human Prompt right before sending the request to the LLM.
💻 Practical Code Drill
Enough talk, show me the code. We will use the latest LangChain Core interfaces (LCEL) to refactor our Support Copilot.
Environment Setup: Please ensure you have the latest libraries installed:
pip install langchain-core langchain-openai
Step 1: Build a Memory-Enabled Support Chain
In this demo, we'll simulate an after-sales refund scenario.
import os
from typing import Dict
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.chat_history import BaseChatMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
# Ensure your API KEY is configured
# os.environ["OPENAI_API_KEY"] = "your-api-key"
# 1. Simulate a production in-memory database to store sessions for different users
# The dictionary Key is session_id, and the Value is the history object
store: Dict[str, BaseChatMessageHistory] = {}
def get_session_history(session_id: str) -> BaseChatMessageHistory:
"""
This is a core callback function.
When a user sends a message, LangChain calls this function to fetch the corresponding history.
If the session_id doesn't exist, it creates a new one.
"""
if session_id not in store:
# In actual production, this should read history from Redis or MySQL
store[session_id] = ChatMessageHistory()
print(f"[System Log] Created a new memory store for session {session_id}.")
return store[session_id]
# 2. Define the System Prompt for the Support Copilot
# Note the MessagesPlaceholder here; it's the "slot" for memory injection
prompt = ChatPromptTemplate.from_messages([
("system", "You are a top-tier e-commerce support assistant named 'Ash'. Your attitude must be extremely polite and professional. "
"If a user mentions a refund, you must first confirm their order number."),
MessagesPlaceholder(variable_name="chat_history"), # 🌟 Core: Placeholder for historical messages
("human", "{question}") # The current user's question
])
# 3. Instantiate the LLM
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.3)
# 4. Assemble the base Chain using LCEL
chain = prompt | llm
# 5. 🌟 Core Magic: Wrap the base Chain with RunnableWithMessageHistory
# It automatically intercepts input, fetches memory, injects the prompt, and writes the output back to memory
copilot_with_memory = RunnableWithMessageHistory(
chain,
get_session_history,
input_messages_key="question", # Tells the model which dictionary key contains the user's input
history_messages_key="chat_history" # Tells the model which variable to stuff the history into
)
# ==========================================
# 🎬 Simulate a Real Support Scenario
# ==========================================
print("=== 👩🦰 User A (Session: user_A_101) Connects ===")
response1 = copilot_with_memory.invoke(
{"question": "Hi, the mechanical keyboard I bought is broken, I want to return it."},
config={"configurable": {"session_id": "user_A_101"}} # Pass the session ID
)
print(f"🤖 Ash: {response1.content}\n")
# Expected output: Asks for the order number
print("=== 👩🦰 User A Replies ===")
# Note: We don't mention "return" or "keyboard" here, just the order number
response2 = copilot_with_memory.invoke(
{"question": "The order number is KB20231024."},
config={"configurable": {"session_id": "user_A_101"}}
)
print(f"🤖 Ash: {response2.content}\n")
# Expected output: Based on context, Ash knows this order number is for returning the mechanical keyboard
print("=== 👨🦱 User B (Session: user_B_999) Suddenly Connects ===")
# Test session isolation: User B should know nothing about User A's situation
response3 = copilot_with_memory.invoke(
{"question": "I remembered the order number wrong earlier, it's KB20231025!"},
config={"configurable": {"session_id": "user_B_999"}}
)
print(f"🤖 Ash: {response3.content}\n")
# Expected output: Ash will be confused because in user_B_999's memory, the previous conversation doesn't exist
Anatomy of the Execution Principle:
The most elegant design in this code is the synergy between get_session_history and RunnableWithMessageHistory.
As architects, we must understand the importance of decoupling. LangChain completely separates "how to handle the LLM" from "how to store memory." Today, you can test with an in-memory dictionary store = {}. Tomorrow, when the system goes live, you only need to swap the logic inside get_session_history with code that connects to Redis (e.g., using RedisChatMessageHistory). Not a single line of your core logic needs to change! That is the beauty of advanced architecture.
🚧 Pitfalls and Best Practices
Throughout my career, I've seen too many production incidents caused by mishandled "memory." For support bots, longer memory isn't always better. Here are three fatal pitfalls you must avoid:
Pitfall 1: Infinite Memory Growth Leading to Token Bankruptcy (OOM)
Symptom: The support bot chats perfectly at first, but around the 20th turn, it suddenly throws a context_length_exceeded error, or your OpenAI bill explodes at the end of the month.
Cause: ChatMessageHistory appends infinitely by default. As mentioned earlier, memory is "force-fed." The more you chat, the longer the text sent to the LLM becomes, causing token consumption to rise exponentially.
Advanced Solution: In a production environment, you must never use infinite-length memory. You need to introduce a "Sliding Window" or "Summary Memory."
- Sliding Window (Window Memory): Retain only the most recent N conversation turns. For support scenarios, keeping the last 5-10 turns is usually sufficient.
- Summary Memory: Run a small background LLM task to periodically compress the previous 20 turns into a summary (e.g., "User bought a keyboard, is requesting a return, order number confirmed"), and then send this summary to the LLM as context.
(Note: In the LCEL architecture, you can easily implement a sliding window by slicing the messages list like messages[-10:] right before get_session_history returns.)
Pitfall 2: Memory Leaks and Stateless Deployment Conflicts
Symptom: It runs perfectly locally, but once deployed to K8s or a Serverless platform, the bot occasionally suffers from amnesia or crosses wires with other users.
Cause: Beginners love using a global variable like store = {} to store memory, just like in the demo. However, production environments are typically multi-node and multi-process (e.g., Gunicorn running 4 workers). A user's first message hits Worker A and is saved in its memory, but their next message is load-balanced to Worker B. Worker B's memory has absolutely no record of this session_id.
Advanced Solution: Separate compute and storage. Never store state in the application container's memory. Always use Redis, PostgreSQL, or MongoDB to persist ChatMessageHistory.
Pitfall 3: "System Prompt Amnesia" Due to Long Memory
Symptom: The support bot is strictly instructed to "never use profanity." But after a user sends 30 consecutive abusive messages, the bot's defenses break down, and it starts swearing back. Cause: The attention mechanism of LLMs has a bias; it usually remembers the "beginning" and "end" of a prompt best. If an incredibly long chat history is inserted in the middle, the weight of the System Prompt at the very top gets diluted, causing the LLM to "forget who it is." Advanced Solution:
- Strictly ensure the System Prompt is at the very beginning, just like in our code.
- Adopt a System Prompt Reminder strategy: Right before the final Human Message, insert a brief System Message reminder (e.g., "[System Reminder: Please maintain a polite customer service attitude]") to forcefully pull the model's attention back.
📝 Lesson Summary
In today's lesson, we took a deep dive into the art of giving LLMs memory.
We clarified the essence of "memory as context injection," discarded legacy code that easily generates technical debt, and breathed life into our Intelligent Support Copilot using RunnableWithMessageHistory, which perfectly aligns with modern LangChain philosophy. At the same time, from an architect's perspective, we examined three major production pitfalls: token consumption, distributed storage, and attention dilution.
Now, your support assistant Ash is no longer a goldfish with a seven-second memory. It can calmly handle multi-turn follow-up questions from users and process complex return and exchange contexts.
But is having memory alone enough? If a user asks, "What is your latest Black Friday return policy?" Ash might remember who the user is, but its brain doesn't contain your company's latest internal documents. It will either spout nonsense (hallucinate) or simply apologize.
How do we equip the support bot with "domain expertise"? How do we enable it to read your company's PDFs, Word documents, and internal Wikis? In the next session, we will enter the most exciting chapter of LangChain—The collision of RAG (Retrieval-Augmented Generation) and Vector Databases.
See you next time! Stay passionate and keep coding!