Issue 28 | Extracting Complex Data (Structured Outputs)
Fellow architects, welcome back to our "LangGraph Multi-Agent Expert Course". I am your old friend.
After the hard-fought battles of the previous 27 episodes, our "AI Content Agency" has begun to take shape. The Planner strategizes, the Researcher gathers massive information, and the Writer writes furiously. Looking at the endless scrolling text in the terminal, do you have the illusion that "AI has already ruled the world"?
But as a developer with 10 years of experience, I must pour a basin of cold water on you: If your AI system can only output a blob of plain text, then it is practically unusable in engineering.
Imagine, what do your downstream systems (such as the company's CMS content management system, database, or front-end visualization dashboard) need? They need a precise Title, a structured Outline, Citations in array format, and an integer Word Count.
If you are still using regular expressions (Regex) to parse the strings output by large models, or begging the large model in the Prompt: "Please, absolutely, swear to only output JSON format, do not include any extra nonsense"... then this episode is here to save your code and your hairline.
Today, we will inject a "soul contract" into the final checkpoint of our Agency—the Editor Agent: Structured Outputs. We want to transform the final product of the graph from uncontrollable natural language into a data structure with a strict JSON Schema.
🎯 Learning Objectives for this Episode
After taking this class, I require you to master the following points and be able to apply them immediately to your production environment:
- Cognitive Upgrade: Thoroughly understand the underlying logical differences between Prompt Engineering constraints, JSON Mode, and Tool Calling (Structured Outputs).
- Core API Mastery: Proficiently use the
.with_structured_output()method in LangChain/LangGraph. - Architecture Refactoring: Introduce Pydantic data validation to the Editor Agent of the AI Content Agency, outputting a final Payload containing strict fields such as title, outline, citations, and word count.
- State Graph Integration: Seamlessly write structured data into the State of LangGraph, perfectly connecting with the API interfaces of traditional software.
📖 Principle Analysis
Before writing code, let's discuss the "philosophy". Why is extracting complex data so difficult?
The essence of Large Language Models (LLMs) is a "probability prediction machine". It naturally likes to improvise. However, our traditional software engineering emphasizes "Determinism". Structured Outputs is a bridge built between probability and determinism.
The industry has gone through three stages to make LLMs output structured data:
- Stage 1: Prompt Constraints (Slash-and-burn). Writing
Output as JSON onlyin the prompt. As a result, the LLM often replies: "Sure, here is your JSON:...", directly crashing the JSON parser. - Stage 2: JSON Mode (Semi-automatic rifle). Major APIs provide
response_format={ "type": "json_object" }. This guarantees that the output is definitely valid JSON, but it does not guarantee that the fields inside the JSON are correct. You wanttitle, it might outputheading. - Stage 3: Function/Tool Calling Forced Schema (Modern warfare). We define the required JSON structure as a "Function Signature". In order to call this function, the LLM must strictly fill in the data according to the parameter types we defined (Pydantic Schema). This is currently the most stable and elegant solution!
How does this mechanism operate in the business flow of our AI Content Agency? Please look at the architecture diagram below:
graph TD
subgraph LangGraph State Flow
A[State: Draft generated by Writer] --> B(Editor Agent Node)
R[State: Reference Links provided by Researcher] --> B
end
subgraph Editor Agent Internal Logic
B -->|1. Assemble Prompt and Context| C{LLM with Structured Output}
C -.->|2. Underlying conversion to Tool Calling| D[OpenAI / Anthropic API]
D -.->|3. Return JSON conforming to Schema| E[Pydantic Validation]
end
E -->|Validation Failed Auto-retry| C
E -->|Validation Successful| F[State: Final Article Payload]
F --> G[(Downstream Systems: CMS / Database / API)]
classDef state fill:#e1f5fe,stroke:#01579b,stroke-width:2px;
classDef agent fill:#fff3e0,stroke:#e65100,stroke-width:2px;
classDef core fill:#e8f5e9,stroke:#1b5e20,stroke-width:2px;
class A,R,F state;
class B agent;
class C,E core;Diagram Explanation:
- The Editor Agent receives the plain text draft and scattered reference materials.
- We no longer call
llm.invoke()directly, but instead callllm.with_structured_output(ArticleSchema)bound with a Pydantic model. - The underlying large model will be forced to fill the extracted content into the slots specified by
ArticleSchema. - Pydantic will perform strong validation locally (Are the types correct? Are the required fields present?).
- Finally, it outputs a perfect Python object, saves it into the Graph State, and makes it directly available for downstream use.
💻 Practical Code Drill
Enough talk, Show me the code.
We will use Python, LangGraph, and Pydantic to refactor the Editor node. Please ensure you have installed langchain-openai, langgraph, and pydantic.
Step 1: Define a Strict Data Contract (Pydantic Schema)
This is the core of structured outputs. You need to treat the large model as a "form-filling machine"; the more rigorously this form is designed, the better the large model will fill it out.
from pydantic import BaseModel, Field
from typing import List
# Define the structure we expect the Editor to ultimately output
class ArticlePayload(BaseModel):
"""The article data structure finally delivered to the CMS system"""
title: str = Field(
description="The final title of the article, required to be catchy, SEO-compliant, and under 20 words"
)
outline: List[str] = Field(
description="The outline hierarchy of the article, extracting all H2 and H3 headings to form an array"
)
citations: List[str] = Field(
description="Citation links or literature sources extracted from the original text and reference materials, or an empty array if none"
)
word_count: int = Field(
description="Approximate word count of the main body (integer)"
)
final_content: str = Field(
description="The final Markdown formatted body content after polishing by the editor"
)
seo_keywords: List[str] = Field(
description="Extract 3-5 core SEO keywords"
)
Instructor's Note: Pay attention to the description here! In traditional development, comments are for humans to read; but in large model development, Pydantic's description is part of the Prompt for the AI to read! The clearer you write here, the more accurate the data extracted by the AI will be.
Step 2: Define the State of LangGraph
We need to reserve a place for this structured object in the global state.
from typing import TypedDict, Optional
from langgraph.graph import StateGraph, END
class AgencyState(TypedDict):
writer_draft: str # Draft passed from the Writer
research_links: List[str] # Reference materials passed from the Researcher
# 👇 Here is the core of this episode, we store the structured object into the State
final_delivery: Optional[ArticlePayload]
Step 3: Write the Editor Node Logic
This is the moment to witness a miracle. We will use the .with_structured_output() method.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
def editor_node(state: AgencyState):
print("--- 👨⚖️ Editor Agent starts working: Extracting and refactoring complex data ---")
draft = state.get("writer_draft", "")
links = state.get("research_links", [])
# 1. Instantiate the LLM (It is recommended to use a model with good Tool Calling support, such as GPT-4o)
# Note: Set temperature to 0, because what we need now is deterministic data extraction, not divergent creation
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# 2. Bind structured output!!! (Core magic)
# This line of code will convert ArticlePayload into OpenAI's Function Calling Schema under the hood
structured_llm = llm.with_structured_output(ArticlePayload)
# 3. Write the Editor's Prompt
prompt = ChatPromptTemplate.from_messages([
("system", """You are a strict top-tier AI editor-in-chief.
Your task is to review the Writer's draft, combine it with reference materials, and output the final structured article data.
Please strictly follow the requirements to extract the title, outline, citations, word count, SEO keywords, and perform the final polish on the body text."""),
("user", "[Draft Content]:\n{draft}\n\n[Reference Materials]:\n{links}")
])
# 4. Assemble the Chain and execute
chain = prompt | structured_llm
# Execute the call. Note: The result here is no longer a string, but a native ArticlePayload object!
result: ArticlePayload = chain.invoke({
"draft": draft,
"links": "\n".join(links)
})
print(f"✅ Extraction complete! Title: {result.title}, Word Count: {result.word_count}")
# 5. Update State
return {"final_delivery": result}
Step 4: Assemble the Graph and Simulate Execution
Let's run the graph and see the effect.
# Build the graph
workflow = StateGraph(AgencyState)
workflow.add_node("Editor", editor_node)
# For simplicity, we directly set the Editor as the entry and exit point
workflow.set_entry_point("Editor")
workflow.add_edge("Editor", END)
app = workflow.compile()
# === Simulate Execution Demo ===
if __name__ == "__main__":
# Simulate upstream data passed from the Writer and Researcher
mock_state = {
"writer_draft": """
# Why do we need multi-agent systems?
In today's AI field, monolithic large models have encountered bottlenecks. Multi-Agent Systems greatly enhance the ability to solve complex tasks through division of labor and collaboration.
## Core Advantages
1. Distributed Computing
2. Role Specialization
In short, multi-agent is the future.
""",
"research_links": [
"https://arxiv.org/abs/1234.5678",
"https://github.com/langchain-ai/langgraph"
]
}
# Run the graph
final_state = app.invoke(mock_state)
# Verify the output result
delivery = final_state["final_delivery"]
print("\n" + "="*40)
print("🚀 Final output JSON structure (Simulated sending to CMS):")
print("="*40)
# Because delivery is a Pydantic object, we can directly call model_dump_json()
print(delivery.model_dump_json(indent=2))
Expected Console Output:
{
"title": "Multi-Agent Systems: The Future Path to Breaking Monolithic Large Model Bottlenecks",
"outline": [
"Why do we need multi-agent systems?",
"Core Advantages"
],
"citations": [
"https://arxiv.org/abs/1234.5678",
"https://github.com/langchain-ai/langgraph"
],
"word_count": 85,
"final_content": "# Why do we need multi-agent systems?\n\nIn today's AI field, monolithic Large Language Models (LLMs) are gradually revealing bottlenecks when handling extremely complex business logic. Multi-Agent Systems have emerged as the times require, greatly raising the ceiling for systems to solve complex tasks by introducing a division of labor and collaboration mechanism.\n\n## Core Advantages\n\n1. **Distributed Computing and Reasoning**: Breaking down complex problems to be processed in parallel by different Agents.\n2. **Role Specialization**: Similar to a human team, Planner, Researcher, and Writer each have their own duties, reducing hallucinations.\n\nIn summary, the multi-agent architecture is undoubtedly an important cornerstone towards AGI.",
"seo_keywords": [
"Multi-Agent Systems",
"Multi-Agent",
"Large Model Bottleneck",
"AI Architecture"
]
}
Look! It is no longer messy plain text, but a perfectly typed JSON data that can be directly stored in the database and directly rendered by the front-end! This is what an industrial-grade AI application should look like.
Pitfalls and Avoidance Guide (Troubleshooting Experience from a High-Level Perspective)
As your mentor, I cannot just teach you how to write beautiful Demos. In real production environments, structured outputs often hide hidden dangers. Here is an avoidance guide I summarized through blood and tears:
💣 Pitfall 1: The Large Model "Acting Smart" Leads to Schema Validation Failure
Sometimes the large model will wrap the JSON in ```json ... ```, or if it feels it cannot find certain required fields, it simply won't return them, causing Pydantic to throw a ValidationError.
🛡️ Avoidance Strategy:
- Use OpenAI's Strict Mode: If you are using the latest OpenAI models, LangChain natively supports OpenAI's Structured Outputs (strict=True) under the hood. This guarantees 100% compliance with the Schema at the API level.
- Fault Tolerance and Retry Mechanism: In LangGraph, you can catch the
ValidationErrorand route it back to the Editor node as an Edge, feeding the error message to the large model as a Prompt: "The JSON you just generated reported an error. The reason for the error is the missing title field. Please correct it and output again."
💣 Pitfall 2: Stuffing Too Much Logic into One Big Schema
I have seen students define a massive Pydantic model containing over 50 fields and nested 4 layers deep, expecting the large model to completely deconstruct a tens-of-thousands-of-words article into this structure all at once. The result is not only slow, but the hallucinations are extremely severe. 🛡️ Avoidance Strategy: Divide and Conquer. Do not let the Editor do everything at once. You can split the nodes:
Extract_Meta_Node: Only responsible for extracting SEO keywords and word count.Format_Content_Node: Only responsible for formatting Markdown. Keeping the Schema flat is the secret to stable output from large models.
💣 Pitfall 3: Not All Models Support .with_structured_output()
If you have deployed a weaker open-source model locally (such as Llama-2-7B), calling this method may throw an error, or output garbage data that completely fails to meet expectations.
🛡️ Avoidance Strategy:
For models that do not support native Tool Calling, LangChain provides a fallback solution using include_raw=True or JsonOutputParser. But to speak from the bottom of my heart, when handling complex structured outputs, please be sure to use top-tier models like GPT-4o or Claude 3.5 Sonnet. When it comes to data contracts, saving money on tokens often leads to massive engineering maintenance costs.
📝 Episode Summary
Today, we completed the most engineering-valuable refactoring in the AI Content Agency architecture.
We learned:
- Why structured outputs are the bridge connecting AI and traditional software.
- How to use Pydantic to define rigorous data contracts (
ArticlePayload). - How to force LLMs to output structured data through LangChain's
.with_structured_output()magic. - How to integrate this process into LangGraph's State to form a closed loop.
Now, our Editor Agent is no longer just a chatbot that only "talks nonsense", but a core microservice capable of producing standard data interfaces.
Teaser for Next Episode: Although the data structure is perfect, what if the boss is not satisfied with the title extracted by the Editor? AI ultimately requires human supervision. In Episode 29, we will introduce one of LangGraph's most fascinating features: Human-in-the-loop. We will make the graph pause after the Editor outputs data, waiting for the editor-in-chief (you) to click "Approve" or make modifications before continuing the flow.
Fellow architects, type out today's code and experience the thrill of having data precisely under control. See you next episode!