第 27 期 | 生产环境排坑：处理超长响应与重试策略 (EN) — LangChain Masterclass: Zero to Production AI Applications

🎯 Learning Objectives for this Session

Hey, future AI masters! I'm your old friend, a 10-year veteran in the AI trenches, and your most enthusiastic mentor. Welcome to Part 1 of the LangChain Full-Stack Masterclass! This isn't just some high-level theoretical course; we are going to build a production-grade "Intelligent Support Knowledge Base" from scratch, step by step. Today, we're diving straight into the core secret of this intelligent copilot—its brain, the Large Language Model (LLM).

By the end of this session, you will:

Understand the essence of LLMs: Go beyond the buzzwords and truly grasp the "brain" role LLMs play in intelligent support and their core working principles.
Master LLM integration with LangChain: Learn how to use LangChain as an "operating system" to easily connect and invoke various mainstream LLMs, instantly giving your support copilot the ability to think.
Understand LLM parameter tuning: Learn how key parameters like temperature affect LLM behavior, and formulate initial strategies for your support scenarios to make responses more accurate and human-like.
Build the foundation for AI applications: Lay a solid groundwork for developing more complex AI applications later, knowing how to bootstrap your intelligent system from the ground up.

📖 Deep Dive into the Concepts

Alright, enough chit-chat, let's get down to business!

The "Brain" of Intelligent Support: What exactly is an LLM?

Imagine what makes an excellent customer support representative? Listening, understanding, thinking, and providing accurate, helpful answers, right? Now, if we "digitize" these capabilities, we get an LLM.

LLM (Large Language Model), as the name suggests, is a "large" language model. This "largeness" is reflected in two aspects:

Massive Data: They are trained on colossal amounts of text data encompassing almost all human knowledge, expressions, and logical relationships.
Massive Parameters: The number of internal parameters is staggering, often reaching billions or even trillions, enabling them to capture incredibly complex and subtle patterns in language.

In our "Intelligent Support Knowledge Base" project, the LLM is the central brain responsible for "thinking" and "decision-making". When a user throws a question at it, it first needs to understand the user's intent (e.g., asking about an order, after-sales service, or product features), then search the knowledge base based on that understanding, and finally formulate an answer in human-readable language.

Transformer: The Secret Weapon Behind LLMs' Power

You might ask, "We've had various Natural Language Processing (NLP) models before, why are LLMs suddenly so powerful?" The answer is the Transformer architecture.

Before Transformers, mainstream models (like RNNs and LSTMs) processed text like short-sighted workers—looking at one word at a time and remembering only a tiny bit of previous information, making them prone to "forgetting" when dealing with long texts. However, in customer support scenarios, user queries can be lengthy, detailed, or even disjointed. If a model can't grasp the "big picture", it can't understand the context.

The core innovation of the Transformer is the "Self-Attention" mechanism. You can think of it like this: when an LLM processes a word, it doesn't just look at the word itself; it simultaneously "scans" the entire sentence or even paragraph, assigning an "attention weight" to each word. Words more closely related to the current word receive higher attention.

For example: A user asks, "My order number is XYZ123, placed yesterday, has it shipped yet?" When the LLM processes the word "shipped", the self-attention mechanism immediately focuses on key information like "order number XYZ123" and "placed yesterday", thereby understanding that this is a query about a specific order's status. This mechanism allows LLMs to efficiently capture long-range dependencies and truly "read" the context.

LangChain: The "Operating System" for LLMs

Now that we understand the power of LLMs, what role does LangChain play? LangChain is not an LLM itself; it's more like a tailor-made "operating system" and "toolkit" for LLMs. It provides a standardized set of interfaces, components, and chain structures, allowing you to:

Unified Integration: Whether it's OpenAI's GPT, Google's Gemini, or open-source models on HuggingFace, LangChain lets you invoke them using a unified approach.
Context Management: Enables the LLM to remember multi-turn conversation history, acting like a support agent with a memory.
Tool Integration: Empowers the LLM to not just "talk" but also "act", such as calling external APIs to check orders, searching knowledge bases, sending emails, etc.
Complex Workflow Construction: Chains together multiple LLM calls, tool usages, and data processing steps to form complex intelligent workflows.

In our intelligent support project, LangChain is the "nervous system" that connects the LLM brain, external knowledge bases, user interfaces, order systems, and all other components.

Simplified Diagram of the LLM Workflow in Intelligent Support

The Mermaid diagram below provides a simplified view of the LLM's position and workflow within our intelligent support project:

graph TD
    A[User Query] --> B{LangChain Framework}
    B --> C[LLM (e.g., OpenAI GPT)]
    C -- Understand Intent & Generate Response --> D[Intelligent Support Copilot Core]
    D --> E[Reply to User]

    subgraph Simplified Internal LLM Workflow
        C --> C1[Tokenizer: Text Segmentation]
        C1 --> C2[Embedding: Semantic Vectorization]
        C2 --> C3[Transformer Blocks: Core Inference]
        C3 --> C4[Output Layer: Text Generation]
        C4 --> C
    end

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#f9f,stroke:#333,stroke-width:2px
    style C fill:#ccf,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style D fill:#ddf,stroke:#333,stroke-width:2px

The user's question is passed to the LLM via LangChain. Inside the LLM, it goes through tokenization, vectorization, and complex Transformer inference to finally generate a response, which is then returned to the user via LangChain. At this foundational stage, the LLM acts like a "know-it-all" capable of understanding and answering questions.

💻 Hands-On Coding: Practical Application in the Support Project

Enough theory, it's time to make our "Intelligent Support" move! In this session, we'll build the most basic "brain" for our support copilot—directly invoking an LLM to answer questions.

We will use the langchain-openai package to connect to OpenAI's models. If you want to use other models, LangChain provides similar interfaces, such as langchain-google-genai, langchain-huggingface, etc.

Prerequisites:

Ensure you have an OpenAI API Key. If not, register and get one from the OpenAI Official Website.
Strongly recommended: Set your API Key as an environment variable rather than hardcoding it in your code to ensure security.
- Linux/macOS: export OPENAI_API_KEY='your_api_key'
- Windows (CMD): set OPENAI_API_KEY='your_api_key'
- Windows (PowerShell): $env:OPENAI_API_KEY='your_api_key' Alternatively, you can create a .env file in your project root directory containing OPENAI_API_KEY='your_api_key', and use python-dotenv or dotenv to load it.

Python Hands-On Code

First, install the necessary libraries:

pip install langchain-openai python-dotenv

Then, create a basic_llm_copilot.py file:

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage

# 1. Load environment variables (ensure the .env file is in the project root)
load_dotenv()

# Check if the API Key is set
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY 环境变量未设置，请检查 .env 文件或系统环境变量。")

# 2. Initialize the LLM
# We use ChatOpenAI because it is better suited for conversational scenarios.
# model_name can be "gpt-3.5-turbo", "gpt-4", "gpt-4o", etc.
# The temperature parameter controls the randomness/creativity of the output.
# For support scenarios, we usually want accurate and stable responses, so we set temperature to a low value.
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.3)

print("智能客服助手已启动！输入 '退出' 结束对话。")

# 3. Simulate the intelligent support conversation loop
def run_copilot():
    while True:
        user_query = input("\n[你]: ")
        if user_query.lower() == '退出':
            print("[客服助手]: 感谢您的使用，再见！")
            break

        print("[客服助手]: 思考中...")

        try:
            # Build the message list. SystemMessage sets the assistant's role and behavior.
            # HumanMessage is the user's input.
            messages = [
                SystemMessage(
                    content="你是一个专业的智能客服助手，专注于回答用户关于产品、订单、售后等问题。请用友好、简洁、专业的语气进行回复。"
                ),
                HumanMessage(
                    content=user_query
                ),
            ]

            # Invoke the LLM to generate a response
            response = llm.invoke(messages)

            # Print the LLM's response
            print(f"[客服助手]: {response.content}")

        except Exception as e:
            print(f"[客服助手]: 抱歉，处理您的请求时发生错误：{e}")
            print("请检查您的网络连接或 API Key 是否有效。")

if __name__ == "__main__":
    run_copilot()

Run the Python code:

python basic_llm_copilot.py

Example Conversation (Translated to English for reference):

Intelligent Support Copilot started! Type 'exit' to end the conversation.

[You]: What are the features of your product?
[Copilot]: Thinking...
[Copilot]: Hello! The main features of our product include high performance, strong ease of use, and excellent stability. It is designed to enhance user experience and can meet your needs in XXX. Which specific feature are you interested in?

[You]: My order number is X123456789, has it shipped yet?
[Copilot]: Thinking...
[Copilot]: Hello! Regarding order number X123456789, the system currently shows it is processing and is expected to ship within 24 hours. You will receive an SMS notification once shipped, and you can check the logistics information on the order details page. Please wait patiently.

[You]: exit
[Copilot]: Thank you for using our service. Goodbye!

TypeScript Hands-On Code

First, create a new Node.js project and install the necessary libraries:

mkdir langchain-copilot-ts && cd langchain-copilot-ts
npm init -y
npm install @langchain/openai langchain dotenv readline-sync
npm install -D typescript @types/node @types/readline-sync
npx tsc --init

Modify tsconfig.json:

"target": "es2021"
"module": "commonjs" (or esnext depending on your project config)
"outDir": "./dist"
"esModuleInterop": true
"forceConsistentCasingInFileNames": true
"strict": true

Then, create a src/basic_llm_copilot.ts file:

import 'dotenv/config'; // Import dotenv config, ensure this is before other imports
import { ChatOpenAI } from '@langchain/openai';
import { HumanMessage, SystemMessage } from '@langchain/core/messages';
import * as readlineSync from 'readline-sync'; // Used to read user input synchronously

// 1. Check if the API Key is set
if (!process.env.OPENAI_API_KEY) {
    throw new Error("OPENAI_API_KEY 环境变量未设置，请检查 .env 文件或系统环境变量。");
}

// 2. Initialize the LLM
// Use the ChatOpenAI instance, which is better suited for conversations.
// modelName can be "gpt-3.5-turbo", "gpt-4", "gpt-4o", etc.
// The temperature parameter controls the randomness/creativity of the output.
// Support scenarios usually require stable and accurate responses, so we set a lower temperature.
const llm = new ChatOpenAI({
    modelName: "gpt-3.5-turbo",
    temperature: 0.3,
});

console.log("智能客服助手已启动！输入 '退出' 结束对话。");

// 3. Simulate the intelligent support conversation loop
async function runCopilot() {
    while (true) {
        const userQuery = readlineSync.question("\n[你]: ");
        if (userQuery.toLowerCase() === '退出') {
            console.log("[客服助手]: 感谢您的使用，再见！");
            break;
        }

        console.log("[客服助手]: 思考中...");

        try {
            // Build the message list. SystemMessage sets the assistant's role and behavior.
            // HumanMessage is the user's input.
            const messages = [
                new SystemMessage(
                    "你是一个专业的智能客服助手，专注于回答用户关于产品、订单、售后等问题。请用友好、简洁、专业的语气进行回复。"
                ),
                new HumanMessage(
                    userQuery
                ),
            ];

            // Invoke the LLM to generate a response
            const response = await llm.invoke(messages);

            // Print the LLM's response
            console.log(`[客服助手]: ${response.content}`);

        } catch (e: any) {
            console.error(`[客服助手]: 抱歉，处理您的请求时发生错误：${e.message}`);
            console.log("请检查您的网络连接或 API Key 是否有效。");
        }
    }
}

// Run the copilot
runCopilot();

Run the TypeScript code:

npx ts-node src/basic_llm_copilot.ts

Or compile first then run:

npx tsc
node dist/basic_llm_copilot.js

Example Conversation: (Similar to Python)

Intelligent Support Copilot started! Type 'exit' to end the conversation.

[You]: What are the features of your product?
[Copilot]: Thinking...
[Copilot]: Hello! The main features of our product include high performance, strong ease of use, and excellent stability. It is designed to enhance user experience and can meet your needs in XXX. Which specific feature are you interested in?

[You]: My order number is X123456789, has it shipped yet?
[Copilot]: Thinking...
[Copilot]: Hello! Regarding order number X123456789, the system currently shows it is processing and is expected to ship within 24 hours. You will receive an SMS notification once shipped, and you can check the logistics information on the order details page. Please wait patiently.

[You]: exit
[Copilot]: Thank you for using our service. Goodbye!

Through this code, we've successfully given our intelligent support copilot a "brain" capable of understanding and providing basic answers to user questions! Although it can't access a real order system yet, it already demonstrates the powerful language understanding and generation capabilities of LLMs.

坑与避坑指南 (Pitfalls & Best Practices)

As an architect with ten years of experience, I've seen too many beginners stumble here. Don't worry, I've dug up these pitfalls for you so you can avoid them!

API Key Security: Hardcoding is the root of all evil!
- Pitfall: Directly writing OPENAI_API_KEY = "sk-xxxxxxxx" in your code and pushing it to GitHub. Congratulations, your key will be stolen quickly, and your wallet will bleed.
- Best Practice: Always use environment variables or configuration files to manage sensitive information. Our code's use of load_dotenv() and process.env.OPENAI_API_KEY is the best practice. In production environments, you should use cloud provider secret management services (like AWS Secrets Manager, Azure Key Vault).
Model Selection: More expensive isn't always better, nor is newer!
- Pitfall: Blindly rushing to the newest, most powerful model (like gpt-4o), thinking it will definitely yield the best results.
- Best Practice: Different models have different capabilities, costs, and latencies.
  - gpt-3.5-turbo: The king of cost-effectiveness, fast, suitable for most daily support Q&A and text summarization scenarios.
  - gpt-4 / gpt-4o: More capable, excelling in logical reasoning and complex task processing, but more expensive with relatively higher latency.
- Recommendation: Start testing with gpt-3.5-turbo. If the results don't meet your needs, then consider upgrading to a more powerful model. Choose the best "compute-cost" balance point based on your support scenario.
The temperature Parameter: Support isn't poetry; it needs to be rigorous!
- Pitfall: Setting temperature randomly, or setting it very high for the sake of "innovation".
- Best Practice: temperature controls the randomness and creativity of the model's output.
  - temperature = 0 (or close to 0): Output is the most deterministic and conservative, suitable for scenarios requiring precise facts and avoiding hallucinations (e.g., checking order status, knowledge base Q&A).
  - temperature = 0.7 or higher: Output is more diverse and creative, suitable for content creation, brainstorming, etc.
- Support Scenario: Our goal is to provide accurate, consistent answers. Therefore, temperature is usually recommended to be set between 0.0 and 0.5 to prevent the LLM from improvising and fabricating non-existent information.
Hallucination: LLMs can "talk nonsense with a straight face"!
- Pitfall: Blindly trusting all LLM outputs, thinking it knows everything.
- Best Practice: LLMs predict the next word based on probabilities; they don't truly "understand" facts. When lacking factual information, they will "fabricate" content that sounds reasonable but is actually wrong based on patterns in their training data.
- Initial Mitigation:
  - Clear System Instructions (SystemMessage): Strictly define the LLM's role and behavior, such as "You are a support agent, you can only answer based on the provided information, do not guess."
  - Low temperature: Reduce randomness.
  - Future Solution (RAG): This is the ultimate weapon to solve hallucinations. We will dive deep into how to combine external knowledge bases to provide accurate information in upcoming lessons, so stay tuned!
Token Limits and Costs: Make every penny count!
- Pitfall: Ignoring the number of input and output tokens, leading to errors when conversations get long, or a massive bill.
- Best Practice: LLMs have input token limits; exceeding them causes errors. Also, every API call is billed by the token (both input and output count).
- Recommendation:
  - Monitor token usage: LangChain provides callback mechanisms to monitor token usage.
  - Concise Prompts: Avoid redundant information; get straight to the core.
  - Summarize conversation history: In multi-turn conversations, periodically summarize the history, keeping only key information to reduce token consumption. We will explore this deeply in the upcoming "Memory Management" lesson.

📝 Session Summary

Congratulations on taking the first step towards building a LangChain full-stack AI application! In this session, we demystified the core principles of LLMs as the "brain" of intelligent support, especially how the Transformer architecture grants them powerful language understanding capabilities. We also learned how to use LangChain to quickly integrate LLMs, giving our support copilot its most basic "thinking" ability.

Most importantly, we previewed the most common pitfalls in development and mastered initial best practices to avoid them—these are lessons I learned the hard way with my own money!

Recapping the core of this session:

LLMs are the "brain" of intelligent support: Understanding user intent and generating responses.
Transformers are the foundation of LLM power: The core is the self-attention mechanism, enabling the understanding of long contexts.
LangChain is the "operating system" for LLMs: Providing unified interfaces to connect everything.
Hands-on: Used ChatOpenAI to quickly build an intelligent support copilot capable of basic conversation.
Pitfalls: API Key security, model selection, temperature tuning, hallucinations, and token costs.

Looking Ahead to the Next Session: While our support copilot can talk now, it has the "memory of a goldfish"—it forgets the previous sentence as soon as you ask the next one. This is far from enough in real-world support scenarios! In the next session, we will dive deep into LangChain's Memory module, giving our intelligent support true "long-term memory" so it can remember the context of multi-turn conversations and become a smarter, more coherent conversational partner!

Are you ready? See you in the next session!