第 12 期 | 工具调用 (Tool Calling)：让 LLM 具备行动力 (EN) — LangChain Masterclass: Zero to Production AI Applications

🎯 Learning Objectives for This Session

Hey there, future LangChain full-stack masters! Welcome to Part 5 of the LangChain Masterclass. Today, we're tackling a notorious bottleneck that almost every conversational AI faces: Memory. If your Intelligent Support Copilot treats every single message like a first-time encounter, the user experience will be an absolute disaster. In this session, we will:

Uncover the root of AI "Amnesia": Understand why Large Language Models (LLMs) are inherently stateless and why memory is crucial for fluid conversations.
Master the LangChain Memory Family: Dive deep into the various memory types provided by LangChain, such as Buffer, Window, and Summary, and understand their core mechanisms and ideal use cases.
Inject "Long-Term Memory" into our Copilot: Through hands-on coding, learn how to integrate different Memory modules into our "Intelligent Support Knowledge Base" project, giving our assistant true context awareness.
Navigate Memory Management Pitfalls: Explore the challenges of managing memory in production environments—such as token limits, costs, persistence, and multi-user concurrency—and discover advanced solutions.

📖 Core Concepts

AI "Amnesia": Why are LLMs Inherently Stateless?

As we all know, Large Language Models (LLMs) are the brightest stars in today's AI landscape. They can generate stunning text, answer complex questions, and even write creatively. But there's a "little secret" you might not have noticed: LLMs themselves are stateless.

What does this mean? Simply put, every time you send a request to an LLM, it treats it as a brand-new, isolated event. It doesn't "remember" what you said in your last request, nor does it know how many turns of conversation have already occurred. Every API call is a one-off transaction.

Imagine talking to someone who instantly loses their memory after every sentence, completely forgetting what you just discussed. If you tell customer support, "I want to reset my password," and they reply, "Sure, what would you like to reset?" then you say, "My username," and they ask, "What username?"—wouldn't you want to flip the table?

In the context of LangChain, every time we invoke a Chain or an Agent, we pass a complete Prompt to the LLM. If we want the LLM to remember the conversation history, we must bundle that history into the current Prompt. This is the fundamental reason the Memory module exists. It acts as a "memory center" responsible for:

Store: Recording the dialogue between the user and the AI.
Retrieve: Extracting relevant historical dialogue at the start of a new conversation turn.
Inject: Formatting the extracted history and injecting it as context into the Prompt sent to the LLM.
Update: Appending the latest exchange to the memory after a conversation turn is completed.

LangChain Memory Architecture Overview: Building the Copilot's "Brain"

LangChain's Memory module was born to cure LLM amnesia. It provides a suite of out-of-the-box tools that grant your AI applications context awareness. The core idea is to dynamically pass the conversation history to the LLM as part of the input.

The diagram below illustrates the workflow of the Memory module within our Intelligent Support Copilot project:

graph TD
    subgraph User Interaction
        User[User Request: I want to check my order] --> FrontEnd(Copilot Frontend/API)
    end

    subgraph LangChain Core Processing
        FrontEnd --> LC_Chain{LangChain Agent/Chain}

        LC_Chain -- 1. Get latest user input --> MemModule[Memory Module]
        MemModule -- 2. Retrieve conversation history --> LC_Chain
        LC_Chain -- 3. Combine current input & history to build full Prompt --> LLM_API(LLM API)
        LLM_API -- 4. Generate response --> LC_Chain
        LC_Chain -- 5. Store current conversation turn --> MemModule
        MemModule -- 6. Update memory state --> DB(Persistent Storage, Optional)
        LC_Chain -- 7. Return response --> FrontEnd
    end

    FrontEnd --> UserResponse[Copilot Response: Please provide your order number]

Workflow Breakdown:

User Request: The user sends a message via the frontend interface.
LangChain Chain/Agent Receives: The core logic of the support copilot (encapsulated by a LangChain Chain or Agent) receives the request.
Memory Module Intervenes: Before sending the request to the LLM, the Chain/Agent interacts with the Memory module.
Retrieve History: The Memory module retrieves relevant conversation history based on the current session ID.
Build Prompt: The current user input and the retrieved history are combined to form a Prompt containing the full context.
Call LLM: The constructed Prompt is sent to the LLM API.
LLM Generates Response: The LLM generates an intelligent reply based on the context.
Update Memory: After receiving the LLM's response, the Chain/Agent saves both the user's input and the LLM's reply into the Memory module, updating the history.
Optional Persistence: If persistence is configured, the Memory module saves the state to a database, ensuring memory is retained across sessions or server restarts.
Return Response: The copilot returns the reply to the user.

Through this workflow, our Intelligent Support Copilot gains "memory," allowing it to understand the context of multi-turn conversations and provide a more natural, coherent service.

Core LangChain Memory Types: Which "Brain Circuit" Should Your Copilot Use?

LangChain offers several Memory types, each with unique advantages and use cases. Understanding how they work is key to choosing the right "brain circuit."

1. `ConversationBufferMemory`: The Simple and Intuitive "Memory Buffer"

Mechanism: It acts like an infinitely large notebook, recording every single word from the user and the AI exactly as spoken, passing the entire history directly to the LLM.
Pros: Simple to implement, retains complete information, and loses zero details.
Cons: As the conversation grows, the history becomes longer, quickly exceeding the LLM's token limit and driving up API costs.
Ideal for: Short conversations, testing phases, or scenarios where context completeness is paramount and the number of turns is strictly controlled.

2. `ConversationBufferWindowMemory`: The "Sliding Memory" with a Limited Window

Mechanism: Similar to ConversationBufferMemory, but it only retains the last k turns of the conversation. When a new message arrives, the oldest is "pushed out" of the window, acting like a sliding window.
Pros: Effectively controls history length, prevents token overflow, and reduces costs.
Cons: Early conversations beyond the k limit are completely forgotten, potentially losing important context.
Ideal for: Most customer support scenarios where context is needed, token costs must be managed, and users typically only care about the most recent exchanges.

3. `ConversationSummaryMemory`: The "Memory Summary" that Simplifies Complexity

Mechanism: Instead of storing raw dialogue, it periodically uses an LLM to "summarize" the history into a concise overview. New inputs are sent to the LLM alongside this summary.
Pros: Drastically compresses history length, saving significant tokens. Great for long conversations requiring long-term memory.
Cons: The summarization process itself consumes LLM tokens, and minor details might be lost. The quality of the summary heavily depends on the LLM's performance.
Ideal for: Long-running sessions, token-sensitive environments, and support scenarios where exact phrasing isn't critical (e.g., tracking a complex, long-term support ticket).

4. `ConversationSummaryBufferMemory`: The "Hybrid Memory" of Summaries and Windows

Mechanism: Combines the strengths of ConversationBufferWindowMemory and ConversationSummaryMemory. It keeps the raw dialogue for the most recent k turns (window memory) while summarizing older exchanges (summary memory). Summarization is triggered when the raw dialogue exceeds a specified token limit.
Pros: Balances the granular detail of recent chats with the conciseness of long-term history. An ideal choice for complex scenarios.
Cons: Implementation is slightly more complex as it manages two memory modes.
Ideal for: Complex support workflows that require attention to recent details while keeping track of the early conversational background.

5. `VectorStoreRetrieverMemory`: The "External Knowledge Base Memory" via Semantic Matching

Mechanism: This type doesn't store text directly. Instead, it embeds conversation turns into vectors and stores them in a VectorStore. When memory is needed, it performs a semantic search based on the current input to find the most relevant historical snippets.
Pros: Can handle massive amounts of historical data, retrieving only the most relevant context to avoid token explosions. Achieves true "long-term" and "selective" memory.
Cons: Requires additional vector database infrastructure. Retrieval quality depends on the embedding model and vector DB performance. Implementation is more complex.
Ideal for: Scenarios where the copilot needs to retrieve information from massive chat histories, user documents, or product manuals to provide highly accurate assistance. (This is exactly the direction we will explore later in our "Intelligent Support Knowledge Base" project!)

In this session, we will focus primarily on the first three fundamental and commonly used Memory types to build a solid foundation.

💻 Hands-On Coding (Application in the Copilot Project)

Alright, theory sounds cool, but code is king! Now, let's apply these memory modules to our "Intelligent Support Knowledge Base" project. Imagine a user chatting with our copilot, asking about product features and troubleshooting.

We will use Python and LangChain to build these memory-enabled conversation chains. For demonstration purposes, we'll use a MockLLM to simulate the behavior of a Large Language Model, so you can run the code without configuring a real API key. Of course, in an actual project, you would simply replace this with ChatOpenAI or another real LLM instance.

import os
from langchain.memory import (
    ConversationBufferMemory,
    ConversationBufferWindowMemory,
    ConversationSummaryMemory,
    ConversationSummaryBufferMemory
)
from langchain_core.prompts import PromptTemplate
from langchain.chains import LLMChain, ConversationChain
from langchain_core.messages import AIMessage, HumanMessage

# For demonstration, we use a Mock LLM so you don't need to configure a real API Key
# In production, you would replace this with ChatOpenAI or another LLM
class MockLLM:
    """
    A simple mock LLM used to demonstrate LangChain Memory modules.
    It simulates common replies based on input and simulates summarization behavior.
    """
    def __init__(self, response_delay=0):
        self.response_delay = response_delay

    def invoke(self, prompt: str) -> str:
        """
        Simulate LLM invocation, providing preset replies based on prompt content.
        """
        import time
        time.sleep(self.response_delay) # Simulate network latency or computation time

        # Check if the prompt contains history, usually injected via memory_key
        if "历史对话:" in prompt:
            # Simply extract the history part to avoid overly verbose replies
            history_start = prompt.find("历史对话:")
            history_end = prompt.find("\n当前用户:")
            if history_start != -1 and history_end != -1 and history_start < history_end:
                history = prompt[history_start:history_end].strip()
            else:
                history = "无历史对话"
        else:
            history = "无历史对话"

        if "总结以下对话:" in prompt:
            # Simulate summarization behavior
            return "这是一段关于用户咨询产品功能和故障排除的对话总结。"
        elif "你好" in prompt or "Hello" in prompt:
            return "你好！我是你的智能客服小助手，很高兴为你服务。有什么可以帮你的吗？"
        elif "重置密码" in prompt:
            return "重置密码请访问我们的官方网站，点击'忘记密码'链接，按照指示操作即可。"
        elif "用户名" in prompt:
            return "找回用户名需要您提供注册时的邮箱或手机号进行验证，请问您方便提供吗？"
        elif "产品功能" in prompt:
            return "我们的产品主要有A、B、C三大核心功能，您具体想了解哪一个呢？"
        elif "故障" in prompt or "出问题" in prompt:
            return "很抱歉给您带来不便。请问您遇到了什么具体的问题？我将尝试为您排查。"
        elif "谢谢" in prompt or "感谢" in prompt:
            return "不客气！很高兴能帮到您。还有其他问题吗？"
        else:
            # Default reply, including simple feedback on the current input
            return f"我收到了你的消息：'{prompt.split('当前用户:')[-1].strip() if '当前用户:' in prompt else prompt}'。基于我们之前的交流（{history}），请问还有什么可以为您解答的？"

# Instantiate our mock LLM
llm = MockLLM()

# --- 1. ConversationBufferMemory: The most direct memory approach ---
print("--- 演示 ConversationBufferMemory ---")
# Prompt template containing a placeholder to inject history
template_buffer = """
你是一个友好的智能客服小助手，请根据历史对话和当前用户提问，给出专业且有帮助的回复。

历史对话:
{history}
当前用户: {input}
智能客服:
"""
prompt_buffer = PromptTemplate.from_template(template_buffer)

# Instantiate the memory module
# memory_key defaults to 'history', output_key defaults to 'output'
buffer_memory = ConversationBufferMemory(memory_key="history")

# Build a ConversationChain, which automatically handles prompt and memory
# verbose=True prints the detailed execution process of the Chain for debugging
buffer_conversation = ConversationChain(
    llm=llm,
    memory=buffer_memory,
    prompt=prompt_buffer,
    verbose=True
)

# Simulate multi-turn conversation
print("\n--- 第一轮对话 ---")
response1 = buffer_conversation.invoke({"input": "你好，我想咨询一下你们的产品功能。"})
print(f"客服回复: {response1['response']}")
# Check memory content
print(f"\n当前记忆内容:\n{buffer_memory.load_memory_variables({})}")

print("\n--- 第二轮对话 ---")
response2 = buffer_conversation.invoke({"input": "主要有哪些核心功能呢？"})
print(f"客服回复: {response2['response']}")
print(f"\n当前记忆内容:\n{buffer_memory.load_memory_variables({})}")

print("\n--- 第三轮对话 (模拟无关问题，看记忆是否完整) ---")
response3 = buffer_conversation.invoke({"input": "我最近电脑有点卡，该怎么办？"}) # This is a question unrelated to product features
print(f"客服回复: {response3['response']}")
print(f"\n当前记忆内容:\n{buffer_memory.load_memory_variables({})}")
# As you can see, buffer_memory records all conversations completely, including unrelated ones.
# As conversation turns increase, 'history' will get longer and longer.

# --- 2. ConversationBufferWindowMemory: Window memory, controlling length ---
print("\n\n--- 演示 ConversationBufferWindowMemory ---")
# The prompt template can be reused because only the memory strategy changes; the injected history format remains the same
template_window = """
你是一个友好的智能客服小助手，请根据最近的对话历史和当前用户提问，给出专业且有帮助的回复。

最近对话:
{history}
当前用户: {input}
智能客服:
"""
prompt_window = PromptTemplate.from_template(template_window)

# Instantiate window memory module, keeping only the last 2 turns (4 messages: 2 User + 2 AI)
window_memory = ConversationBufferWindowMemory(memory_key="history", k=2)

window_conversation = ConversationChain(
    llm=llm,
    memory=window_memory,
    prompt=prompt_window,
    verbose=True
)

# Simulate multi-turn conversation
print("\n--- 第一轮对话 ---")
window_conversation.invoke({"input": "你好，我遇到一个产品登录问题。"})
print(f"当前记忆内容:\n{window_memory.load_memory_variables({})}")

print("\n--- 第二轮对话 ---")
window_conversation.invoke({"input": "我输入了正确的用户名和密码，但一直提示错误。"})
print(f"当前记忆内容:\n{window_memory.load_memory_variables({})}")

print("\n--- 第三轮对话 (超出窗口，最旧的被移除) ---")
window_conversation.invoke({"input": "请问是网络问题还是账户被锁定？"})
print(f"当前记忆内容:\n{window_memory.load_memory_variables({})}")
# Notice that the first turn has been removed from memory; only the last two turns are kept.

print("\n--- 第四轮对话 (继续超出窗口) ---")
window_conversation.invoke({"input": "那我要怎么排查呢？"})
print(f"当前记忆内容:\n{window_memory.load_memory_variables({})}")
# Only the last two turns remain in memory.

# --- 3. ConversationSummaryMemory: Summary memory, saving Tokens ---
print("\n\n--- 演示 ConversationSummaryMemory ---")
# Summary memory requires an LLM to generate summaries, so our MockLLM needs to handle "summarize" requests
template_summary = """
你是一个友好的智能客服小助手，请根据对话的总结和当前用户提问，给出专业且有帮助的回复。

对话总结:
{history}
当前用户: {input}
智能客服:
"""
prompt_summary = PromptTemplate.from_template(template_summary)

# Instantiate summary memory module, requiring an LLM for summarization
summary_memory = ConversationSummaryMemory(llm=llm, memory_key="history")

summary_conversation = ConversationChain(
    llm=llm,
    memory=summary_memory,
    prompt=prompt_summary,
    verbose=True
)

# Simulate multi-turn conversation
print("\n--- 第一轮对话 ---")
summary_conversation.invoke({"input": "你好，我的订单状态显示已发货，但我还没收到。"})
print(f"当前记忆内容:\n{summary_memory.load_memory_variables({})}")
# At this point, history should still be empty because it hasn't reached the point of needing a summary.

print("\n--- 第二轮对话 ---")
summary_conversation.invoke({"input": "订单号是 ABC123456789。"})
print(f"当前记忆内容:\n{summary_memory.load_memory_variables({})}")
# At this point, memory might have already done a preliminary summary of the first two turns internally.

print("\n--- 第三轮对话 (触发总结) ---")
# Summary memory accumulates dialogue internally; upon reaching a certain length or after each turn, it calls the LLM to summarize
# In this MockLLM scenario, we assume it summarizes after the second turn
summary_conversation.invoke({"input": "请帮我查询一下物流信息。"})
print(f"当前记忆内容:\n{summary_memory.load_memory_variables({})}")
# Observe the history; it is no longer raw dialogue but a summary generated by the LLM.
# A real ConversationSummaryMemory updates the summary after every conversation turn.

# --- 4. ConversationSummaryBufferMemory: Combining window and summary ---
print("\n\n--- 演示 ConversationSummaryBufferMemory ---")
# The template is similar to summary because it ultimately presents history in a summarized form
template_summary_buffer = """
你是一个友好的智能客服小助手，请根据对话的总结和最近的对话历史，结合当前用户提问，给出专业且有帮助的回复。

对话总结:
{history}
当前用户: {input}
智能客服:
"""
prompt_summary_buffer = PromptTemplate.from_template(template_summary_buffer)

# Instantiate summary buffer memory module; max_token_limit controls when summarization is triggered
# Once max_token_limit is exceeded, the oldest complete dialogues are summarized to free up space.
summary_buffer_memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=100, # 100 tokens here is a mock value; in reality, it's calculated based on the LLM's tokenizer
    memory_key="history"
)

summary_buffer_conversation = ConversationChain(
    llm=llm,
    memory=summary_buffer_memory,
    prompt=prompt_summary_buffer,
    verbose=True
)

# Simulate multi-turn conversation
print("\n--- 第一轮对话 ---")
summary_buffer_conversation.invoke({"input": "你好，我想了解一下你们的服务条款。"})
print(f"当前记忆内容:\n{summary_buffer_memory.load_memory_variables({})}")
# At this point, it might still be complete dialogue

print("\n--- 第二轮对话 ---")
summary_buffer_conversation.invoke({"input": "特别是关于退款政策的部分。"})
print(f"当前记忆内容:\n{summary_buffer_memory.load_memory_variables({})}")

print("\n--- 第三轮对话 (可能触发总结或部分总结) ---")
summary_buffer_conversation.invoke({"input": "如果我在购买后7天内申请退款，能全额退吗？"})
print(f"当前记忆内容:\n{summary_buffer_memory.load_memory_variables({})}")
# Observe the history: you'll find it keeps recent complete dialogues while older ones are summarized.
# If the conversation continues and the complete dialogue portion exceeds max_token_limit, the oldest complete dialogue will be summarized by the LLM into shorter text,
# and added to the summary section, thereby freeing up space for new complete dialogues.

Code Breakdown & Copilot Application:

MockLLM: We used this to simulate real LLM behavior, including responding to specific keywords and handling "summarize" requests. In your actual project, this would be ChatOpenAI(temperature=0.7) or another LLM.
ConversationBufferMemory: The most basic memory. Suitable for the copilot handling simple, short-term inquiries. For example: "What are your business hours?" "Monday to Friday, 9 AM to 6 PM." "Thanks." In this scenario, even the complete memory won't consume too many tokens.
ConversationBufferWindowMemory: One of the most commonly used memory types. For a support copilot, users typically only care about the last few turns. For instance, if a user asks about product features and then asks about pricing, the copilot only needs to remember the recent feature discussion, not an inquiry from a month ago. k=2 means keeping only the last 2 turns (User asks + AI answers).
ConversationSummaryMemory: When a user engages in lengthy, complex troubleshooting with the copilot—like a diagnostic process spanning hours or days—raw memory inflates rapidly. This is where ConversationSummaryMemory shines. It periodically condenses previous chats into a short text block, ensuring that no matter how long the conversation lasts, the historical context sent to the LLM remains at a manageable length, drastically saving token costs.
ConversationSummaryBufferMemory: The smart combination of the previous two. When troubleshooting a complex issue, the user might need the copilot to remember the exact details of the last few steps (window memory) while also understanding the general background of the problem (summary memory). This module perfectly balances granular detail with high-level summaries, making it a powerful tool for building advanced support copilots.

Through these hands-on exercises, you should now have a much more intuitive understanding of LangChain's Memory modules. Choosing the right memory strategy depends on the type of conversations your copilot will handle, the expected conversation length, and your token cost considerations.

Pitfalls and How to Avoid Them

As an experienced architect, I must warn you: while Memory is powerful, it is not without its "pitfalls." If you aren't careful, you could easily fall into a cost black hole, hit performance bottlenecks, or even trigger data leaks.

1. Token Limits and Costs: Beware of "Memory Overflow"

The Pitfall: Using ConversationBufferMemory without limits, or setting the k value too high in ConversationBufferWindowMemory, will cause the historical token count to inflate rapidly. This not only risks exceeding the LLM's context window limit (causing broken conversations or degraded performance) but also sharply increases your API costs. LLMs charge by the token; the longer the history, the more expensive every single call becomes.
How to Avoid:
- Choose the Right Memory Type: For most support scenarios, ConversationBufferWindowMemory is a great starting point; setting k to 2-5 turns is usually sufficient. For long conversations, prioritize ConversationSummaryMemory or ConversationSummaryBufferMemory.
- Monitor Token Usage: During development and testing, integrate a token counter (like tiktoken) to estimate the token count per call. Combine this with your LLM provider's pricing model to forecast costs.
- Limit Single-Turn Length: When designing Prompts, explicitly instruct the LLM to keep replies concise, or use Agent logic to cap the output length.

2. Memory Persistence: Does Memory Survive a Server Restart?

The Pitfall: LangChain's default Memory modules (like ConversationBufferMemory) are merely in-memory objects. If your Python process restarts, the server crashes, or the user switches devices (e.g., from web to mobile), all conversation history is lost. This severely degrades the user experience.

How to Avoid:

Integrate External Storage: In production, you almost always need to persist memory to external storage. LangChain provides various options for persisting Memory, such as PostgresChatMessageHistory, RedisChatMessageHistory, etc.
Implement Session ID Management: Generate a unique ID for each user or session and link it to the chat records stored in the database. This way, when a user returns, you can reload the conversation history based on their session ID.

Example (Conceptual code, requires actual DB connection):

# from langchain_community.chat_message_histories import RedisChatMessageHistory
# from langchain.memory import ConversationBufferWindowMemory
#
# # Assuming you have configured a Redis connection
# session_id = "user_123_session_abc" # Unique ID for each user/session
# message_history = RedisChatMessageHistory(session_id=session_id, url="redis://localhost:6379/0")
#
# persisted_memory = ConversationBufferWindowMemory(
#     memory_key="history",
#     k=5,
#     chat_memory=message_history # Integrate Redis history into Memory
# )

第 12 期 | 工具调用 (Tool Calling)：让 LLM 具备行动力 (EN)