第 07 期 | 切分艺术：Text Splitters 与上下文保全 (EN) — LangChain Masterclass: Zero to Production AI Applications

🎯 Learning Objectives for This Session

Welcome back to the "LangChain Full-Stack Masterclass," future AI architects! In the last session, we explored the skeleton of LangChain. In this session, we'll start fleshing out that skeleton with real "intelligence." Imagine an excellent customer support agent—what is their core competency? It's not just being articulate; it's "knowing a lot, remembering well, and finding answers quickly." In this session, we are going to equip our intelligent support copilot with exactly these capabilities!

By the end of this session, you will:

Deeply understand the core philosophy of RAG (Retrieval-Augmented Generation), and its critical role in solving pain points like LLM "hallucinations" and "outdated knowledge."
Master the use of LangChain Document Loaders, enabling you to efficiently import unstructured data in various formats (PDFs, web pages, local files, etc.) into our intelligent support system.
Learn to utilize LangChain Retrievers to accurately extract relevant information from massive knowledge bases, ensuring the support copilot can always find the best-matching answers to user queries.
Seamlessly integrate document loading with retrievers to build a scalable, highly efficient knowledge retrieval foundation for intelligent support, giving your AI true "memory" and "search" capabilities.

📖 Concept Breakdown

RAG: Curing LLM "Hallucinations" and "Amnesia"

Remember what we discussed earlier? LLMs are powerful, but they have their own "Achilles' heel":

Hallucination: They might confidently make up facts.
Knowledge Cutoff: Their training data has an expiration date, leaving them unaware of the latest information.
Proprietary Data Missing: They know nothing about your company's internal regulations, product manuals, or historical support tickets.

For our "Intelligent Support Knowledge Base" project, this is fatal! If a support agent just spouts nonsense or knows nothing about their own products, users will definitely complain.

RAG (Retrieval-Augmented Generation) is the "silver bullet" for these problems. Its core idea is simple: Before the LLM generates an answer, it first goes to an external, authoritative, real-time knowledge base to find the most relevant reference materials. It then feeds these materials along with the user's question to the LLM, asking the LLM to generate an answer based on those references.

Imagine your intelligent support copilot:

When a user asks: "How do I use the new XX feature of your product?"

An LLM without RAG might answer: "I'm sorry, I don't know about the XX feature." Or worse, "The XX feature is used to brew coffee." (Total fabrication!)

What would an LLM with RAG do?

Retrieve: The copilot first searches the company's product manuals, FAQ pages, and technical docs for all materials related to the "XX feature."
Extract: It finds a few of the most relevant document snippets.
Generate: Then, it hands these snippets and the user's question to the LLM: "The user is asking how to use the XX feature. Here are the relevant materials I found. Please generate a concise and clear answer based on them."

This way, the LLM has an "open-book exam." Its answers will be more accurate, timely, and aligned with our company's actual situation.

The RAG Workflow in the Intelligent Support Project

In our "Intelligent Support Knowledge Base" project, the complete RAG workflow can be broken down into the following core stages. In this session, we will focus on the first two: Data Ingestion and Retrieval.

graph TD
    subgraph Ingestion
        A[Raw Knowledge Base: PDF, DOCX, Web Pages, Databases] --> B(LangChain Document Loaders)
        B --> C(Documents)
        C --> D(Text Splitters)
        D --> E(Text Chunks)
        E --> F[Embedding Model]
        F --> G[Vector Store]
    end

    subgraph Retrieval & Generation
        H[User Query] --> I(Embedding Model)
        I --> J(Query Vector)
        J --> K[LangChain Retrievers]
        K -- Retrieve Similar Vectors --> G
        G -- Return Top-K Relevant Text Chunks --> K
        K --> L(Top-K Relevant Document Chunks)
        L --> M[LLM]
        M -- Generate Answer based on Query & Chunks --> N(Final Answer)
        N --> H
    end

    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#ccf,stroke:#333,stroke-width:2px
    style D fill:#ddf,stroke:#333,stroke-width:2px
    style E fill:#eef,stroke:#333,stroke-width:2px
    style F fill:#ffc,stroke:#333,stroke-width:2px
    style G fill:#cfc,stroke:#333,stroke-width:2px

    style H fill:#f9f,stroke:#333,stroke-width:2px
    style I fill:#ffc,stroke:#333,stroke-width:2px
    style J fill:#eef,stroke:#333,stroke-width:2px
    style K fill:#bbf,stroke:#333,stroke-width:2px
    style L fill:#ddf,stroke:#333,stroke-width:2px
    style M fill:#fcc,stroke:#333,stroke-width:2px
    style N fill:#efe,stroke:#333,stroke-width:2px

As you can see from the diagram above, Document Loaders are responsible for taking our scattered knowledge (PDF product manuals, HTML FAQ pages, Markdown technical docs, or even database records) and uniformly loading them into Document objects recognized by LangChain. These Document objects contain page_content (the actual text) and metadata (source, page number, etc.).

The loaded documents are usually too large to be fed directly into an LLM, so they need to be sliced into smaller, more manageable Text Chunks using Text Splitters.

Next, these text chunks are converted into Vectors by an Embedding Model and stored in a Vector Store. This step is crucial; it encodes the semantic information of the text into numerical form, laying the groundwork for subsequent semantic retrieval.

When a user asks a question, their query is also converted into a Query Vector by the embedding model. Enter the Retrievers! They take this query vector and search the vector store for the most semantically similar text chunks. This is what we mean by "finding answers quickly and accurately."

Finally, these retrieved relevant text chunks, along with the user's original question, are fed as context to the Large Language Model (LLM) to generate the final answer.

Deep Dive into LangChain Document Loaders

LangChain provides a massive array of DocumentLoaders. Their purpose is to provide a unified interface to convert data from different sources and formats into LangChain Document objects. Think of them as super translators: whether your knowledge is a cryptic PDF or a sprawling web page, they translate it into "plain text" that the LLM can understand.

Commonly used loaders include:

PyPDFLoader: Loads documents from PDF files.
WebBaseLoader: Loads content from web URLs.
DirectoryLoader: Loads various files from a local directory (combining Glob patterns and specific Loader types).
CSVLoader, JSONLoader: Loads structured data.
EvernoteLoader, NotionLoader, ConfluenceLoader: Loads from various note-taking or collaboration platforms.

Their common method is load(), which returns a List[Document].

Deep Dive into LangChain Retrievers

The retriever is the "eyes" and "hands" of RAG. It is responsible for "seeing" and "grabbing" the most relevant materials from your knowledge base based on the user's question.

The most core and commonly used retriever is the VectorStoreRetriever. Here is how it works:

Receives a user query string.
Converts this query string into a vector using an embedding model.
Uses a vector similarity search algorithm (like cosine similarity) in the underlying VectorStore to find the k most similar document chunks to the query vector.
Returns these document chunks.

Key Parameters:

vectorstore: Must be linked to a VectorStore instance that already stores document vectors.
search_type: The retrieval type, defaulting to similarity (similarity search). It can also be mmr (Maximal Marginal Relevance, used to increase the diversity of retrieval results while maintaining relevance).
search_kwargs: Additional parameters passed to the underlying VectorStore's search method. The most common is k (how many document chunks to return).

💻 Practical Code Drill (Application in the Support Copilot Project)

Alright, theory is great, but rolling up our sleeves and coding is better! Now, we will apply these concepts to our "Intelligent Support Knowledge Base" project.

Scenario Setup: Our intelligent support copilot needs to answer questions about company products. This product information is scattered across:

Product Manual (PDF): product_manual.pdf
Frequently Asked Questions (FAQ Web Page): https://www.example.com/faq (We'll use a simulated URL here)

We will demonstrate how to load these documents, split them, store them in a vector database, and finally retrieve relevant information using a retriever.

1. Environment Setup and Dependency Installation

First, ensure your Python environment is ready and install the necessary libraries.

pip install -q langchain-community langchain-openai pypdf beautifulsoup4 chromadb tiktoken

langchain-community: Contains various Document Loaders and Vector Stores.
langchain-openai: Used for OpenAI's embedding models and LLMs.
pypdf: Used for processing PDF files.
beautifulsoup4: Used for parsing HTML web content (required by WebBaseLoader).
chromadb: A lightweight local vector database, perfect for learning and prototyping.
tiktoken: OpenAI's token counting tool.

2. Set the OpenAI API Key

Ensure you have set the OPENAI_API_KEY environment variable.

import os
# Please replace with your actual API key, or ensure the environment variable is set
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

3. Create Mock Documents (or Use Real Ones)

For demonstration purposes, let's first create a mock PDF file. For the web page, we'll directly use a public URL.

a. Create product_manual.pdf

You can manually create a simple PDF file with the following content:

# product_manual.pdf (Content Example)
## 产品 A 使用指南

产品 A 是一款创新的智能家居设备，旨在提升您的生活品质。

### 安装步骤
1. 打开包装，取出产品 A 主机及配件。
2. 将产品 A 放置在平稳的表面。
3. 连接电源线，并确保指示灯亮起。
4. 下载并安装“智能管家”App。
5. 按照 App 指引完成设备配对。

### 常见问题
- **Q: 产品 A 无法开机怎么办？**
  A: 请检查电源连接是否牢固，或尝试更换电源插座。如果问题依旧，请联系客服。
- **Q: 如何重置产品 A？**
  A: 在设备通电状态下，长按设备背面的重置按钮 5 秒，直到指示灯闪烁。

## 产品 B 功能介绍

产品 B 是一款高效的办公助手，提升您的工作效率。

### 主要功能
- 智能日程管理
- 会议纪要自动生成
- 任务提醒与协作

### 故障排除
- **Q: 产品 B 无法连接网络？**
  A: 检查网络设置，确保 Wi-Fi 密码正确。重启设备和路由器后重试。

Save the above content as product_manual.txt, then use any text-to-PDF tool to convert it to product_manual.pdf, and make sure to place it in the same directory as your Python script.

b. Mock FAQ Web Page (Using a real public page as an example)

We will use a page from the official LangChain documentation as an example to demonstrate the capabilities of WebBaseLoader. faq_url = "https://www.langchain.com/blog/rag-is-all-you-need" (This page has rich content, suitable for demonstration)

4. Practical Code: Document Loading, Splitting, Embedding, and Retrieval

from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

import os

# Ensure your OpenAI API key is set
# os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY 环境变量未设置。请设置你的API密钥。")

print("--- 1. 文档加载阶段 ---")

# --- 1.1 Load PDF documents using PyPDFLoader ---
pdf_path = "product_manual.pdf"
try:
    pdf_loader = PyPDFLoader(pdf_path)
    pdf_docs = pdf_loader.load()
    print(f"成功加载 {len(pdf_docs)} 页 PDF 文档。")
    # Print preview of the first page content
    if pdf_docs:
        print(f"PDF 第一页内容预览: \n{pdf_docs[0].page_content[:200]}...")
except Exception as e:
    print(f"加载 PDF 文档失败: {e}")
    # If the PDF file does not exist, create an empty Document list to avoid subsequent errors
    pdf_docs = []
    # For demonstration, if the PDF does not exist, we create a mock Document
    if not os.path.exists(pdf_path):
        print(f"警告：'{pdf_path}' 不存在，将使用模拟 PDF 内容。")
        pdf_docs.append(Document(page_content="""
        ## 产品 A 使用指南
        产品 A 是一款创新的智能家居设备，旨在提升您的生活品质。
        ### 安装步骤
        1. 打开包装，取出产品 A 主机及配件。
        2. 将产品 A 放置在平稳的表面。
        3. 连接电源线，并确保指示灯亮起。
        4. 下载并安装“智能管家”App。
        5. 按照 App 指引完成设备配对。
        ### 常见问题
        - Q: 产品 A 无法开机怎么办？A: 请检查电源连接是否牢固，或尝试更换电源插座。
        ## 产品 B 功能介绍
        产品 B 是一款高效的办公助手，提升您的工作效率。
        ### 主要功能
        - 智能日程管理
        - 会议纪要自动生成
        - 任务提醒与协作
        ### 故障排除
        - Q: 产品 B 无法连接网络？A: 检查网络设置，确保 Wi-Fi 密码正确。重启设备和路由器后重试。
        """, metadata={"source": "simulated_product_manual.pdf"}))


# --- 1.2 Load web page documents using WebBaseLoader ---
faq_url = "https://www.langchain.com/blog/rag-is-all-you-need" # Example URL
try:
    web_loader = WebBaseLoader(faq_url)
    web_docs = web_loader.load()
    print(f"成功加载 {len(web_docs)} 个网页文档。")
    # Print preview of the first web page content
    if web_docs:
        print(f"网页内容预览: \n{web_docs[0].page_content[:200]}...")
except Exception as e:
    print(f"加载网页文档失败: {e}")
    web_docs = []
    # If loading fails, also create a mock Document
    print(f"警告：无法从 '{faq_url}' 加载，将使用模拟网页内容。")
    web_docs.append(Document(page_content="""
    智能客服系统 FAQ:
    Q: 如何联系客服？A: 您可以通过电话 400-123-4567 或在线聊天联系我们。
    Q: 订单状态查询？A: 请登录您的账户，在“我的订单”中查询。
    Q: 退换货政策？A: 购买后7天内可无理由退换货，详情请参考官网政策。
    """, metadata={"source": "simulated_faq_webpage"}))


# Merge all loaded documents
all_docs = pdf_docs + web_docs
if not all_docs:
    print("没有可用于处理的文档，请检查PDF文件和网络连接。")
    exit() # Exit directly if there are no documents

print("\n--- 2. 文本分割阶段 ---")

# --- 2.1 Initialize text splitter ---
# RecursiveCharacterTextSplitter tries to split recursively by different characters (like paragraphs, sentences, words), which yields better results
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # Maximum length of each text chunk
    chunk_overlap=200,    # Overlap length between text chunks, helps preserve context
    length_function=len   # Length calculation function, defaults to len
)

# --- 2.2 Split documents ---
chunks = text_splitter.split_documents(all_docs)
print(f"原始文档被分割成 {len(chunks)} 个文本块。")
if chunks:
    print(f"第一个文本块内容预览: \n{chunks[0].page_content[:200]}...")


print("\n--- 3. 嵌入模型与向量存储阶段 ---")

# --- 3.1 Initialize embedding model ---
# We use OpenAI's text-embedding-ada-002 model
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# --- 3.2 Initialize Chroma vector store and add text chunks ---
# The persist_directory parameter persists the vector store to the local file system, so it can be loaded directly next time
persist_directory = './chroma_db_for_copilot'
# If the directory does not exist, Chroma will create it automatically
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory=persist_directory
)
# Persist storage
vectorstore.persist()
print(f"文本块已成功嵌入并存储到 Chroma 向量数据库 ({persist_directory})。")


print("\n--- 4. 检索器应用阶段 ---")

# --- 4.1 Initialize retriever ---
# Create a retriever from the vector store
# search_kwargs={"k": 3} means retrieving the 3 most similar document chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
print("检索器已初始化，将返回最相似的 3 个文档块。")

# --- 4.2 Simulate user queries and perform retrieval ---
user_query_1 = "产品 A 怎么安装？"
retrieved_docs_1 = retriever.invoke(user_query_1)
print(f"\n用户查询: '{user_query_1}'")
print(f"检索到 {len(retrieved_docs_1)} 个相关文档:")
for i, doc in enumerate(retrieved_docs_1):
    print(f"--- 文档 {i+1} (来源: {doc.metadata.get('source', '未知')}, 页面: {doc.metadata.get('page', '未知')}) ---")
    print(doc.page_content[:300] + "...") # Print partial content

user_query_2 = "如何联系客服或者查询订单状态？"
retrieved_docs_2 = retriever.invoke(user_query_2)
print(f"\n用户查询: '{user_query_2}'")
print(f"检索到 {len(retrieved_docs_2)} 个相关文档:")
for i, doc in enumerate(retrieved_docs_2):
    print(f"--- 文档 {i+1} (来源: {doc.metadata.get('source', '未知')}, 页面: {doc.metadata.get('page', '未知')}) ---")
    print(doc.page_content[:300] + "...")

user_query_3 = "RAG是什么？"
retrieved_docs_3 = retriever.invoke(user_query_3)
print(f"\n用户查询: '{user_query_3}'")
print(f"检索到 {len(retrieved_docs_3)} 个相关文档:")
for i, doc in enumerate(retrieved_docs_3):
    print(f"--- 文档 {i+1} (来源: {doc.metadata.get('source', '未知')}, 页面: {doc.metadata.get('page', '未知')}) ---")
    print(doc.page_content[:300] + "...")

# --- 5. (Optional) Combine with LLM to form a simple RAG chain ---
# This step is the focus of the next session; here is just a simple preview
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

print("\n--- 5. (可选) 结合 LLM 进行生成 ---")

# Define an LLM
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Define a prompt template
template = """
你是一个专业的智能客服助手。请根据提供的上下文信息，简洁、准确地回答用户的问题。
如果上下文没有提到相关信息，请礼貌地说明你无法回答。

上下文信息:
{context}

用户问题:
{question}

答案:
"""
prompt = ChatPromptTemplate.from_template(template)

# Build the RAG chain
# retriever | format_docs | prompt | llm | output_parser
# The format_docs helper function converts Document objects to strings
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print(f"\n--- 使用 RAG 链回答问题: '{user_query_1}' ---")
response_1 = rag_chain.invoke(user_query_1)
print(response_1)

print(f"\n--- 使用 RAG 链回答问题: '{user_query_2}' ---")
response_2 = rag_chain.invoke(user_query_2)
print(response_2)

print(f"\n--- 使用 RAG 链回答问题: '{user_query_3}' ---")
response_3 = rag_chain.invoke(user_query_3)
print(response_3)

Code Analysis:

PyPDFLoader and WebBaseLoader: Load local PDF files and remote web page content, respectively. They both return a list of Document objects, where each Document contains page_content (the actual text) and metadata (file path, URL, page number, etc.). Note that we added error handling and mock content in case the file or network is unavailable.
RecursiveCharacterTextSplitter: This is a powerful tool for text splitting. It attempts to split text recursively using multiple characters (like \n\n, \n, ) until each chunk is smaller than chunk_size. The chunk_overlap parameter is crucial; it ensures overlapping sections between adjacent text chunks, which helps the LLM avoid losing information when processing context that spans across chunk boundaries.
OpenAIEmbeddings: We use OpenAI's text-embedding-ada-002 model to convert text chunks into high-dimensional vectors. This model excels at semantic understanding and is the cornerstone of building RAG.
Chroma.from_documents: This is the critical step that connects the text chunks, the embedding model, and the vector store. It iterates through all text chunks, uses the embeddings model to generate their vectors, and then stores these vectors along with the original text chunks into the Chroma vector database. persist_directory ensures the data can be saved locally, eliminating the need to regenerate it next time.
vectorstore.as_retriever(search_kwargs={"k": 3}): Creates a retriever from the Chroma vector store. k=3 means that for every query, the retriever will return the 3 most semantically similar document chunks. This k value needs to be adjusted based on your actual application scenario; too large might introduce noise, while too small might lack sufficient information.
retriever.invoke(user_query): This is the core method of the retriever. When you pass in a user query, it executes the vector similarity search described above and returns a list of Document objects. These are the "reference materials" we found from the knowledge base.
RAG Chain (Optional): Finally, we briefly demonstrated how to combine the retriever with an LLM. RunnablePassthrough() allows the user's question to be passed directly to the question slot. The retriever | format_docs part first retrieves relevant documents via the retriever