Issue 22 | Observability System: Seamlessly Attaching LangSmith Probes to Graph
🎯 Learning Objectives
Hey, future AI architects! Welcome to Issue 22 of the "LangGraph Multi-Agent Expert Course". Reaching this stage, your "AI Universal Content Creation Agency" must have taken shape, with Agents performing their respective duties and operating quite well under LangGraph's orchestration. However, I bet you have definitely encountered this dilemma:
A certain Agent suddenly starts "talking nonsense", or the workflow gets stuck at a certain node, or the output is completely irrelevant, yet you have absolutely no idea which step went wrong, or which LLM slacked off under which Edge? Does it feel like defusing a bomb blindfolded? Don't panic, this lesson is here to give you "X-ray glasses"!
In this issue, we will delve into the art of "observation" and "debugging" for complex LangGraph workflows, focusing specifically on that invaluable treasure in the LangChain ecosystem—LangSmith.
After completing this issue, you will:
- Thoroughly understand: Why observation and debugging are the lifeline of development efficiency in complex systems driven by multi-agents and LLMs.
- Master the core: The basic concepts and working principles of LangSmith, and its seamless integration mechanism within LangGraph.
- Practical drill: Mount LangSmith probes for our "AI Universal Content Creation Agency" project to track every decision and action of Agents like Planner, Researcher, Writer, and Editor in real-time.
- Efficient troubleshooting: Learn to use LangSmith's visual Trace feature to locate collaboration issues between Agents, LLM Prompt defects, or tool calling errors in seconds, bidding farewell to the era of "blind box debugging".
📖 Principle Analysis
In the LLM Era, Are You Still Debugging with "Print"? Wake Up!
In traditional software development, we have various IDEs, Debuggers, and logging systems to help us understand the execution flow of code. But in the LLM era, especially when building multi-agent systems, traditional debugging methods suddenly become pale and powerless. Why?
- Black Box Effect: The internal reasoning process of an LLM is highly opaque. You give it a Prompt, and it spits out a Response; it is very difficult to directly peek into what happened in between.
- Non-determinism: LLM outputs are often not 100% deterministic. Even with the same Prompt, subtle differences may occur at different times or under different temperature parameters. This makes reproducing issues difficult.
- Complex Chains: Multi-agent systems orchestrated by LangGraph often involve multiple LLM calls, multiple tool usages, and state transfers and decisions among multiple Agents. A tiny error or misunderstanding can be amplified along the chain, ultimately leading to a complete system crash or output deviation.
- State Transitions: The core of LangGraph is a state machine, where state flows and mutates between nodes. If you cannot clearly see what state each node received and what state it outputted, debugging is simply a nightmare.
Imagine your Planner Agent plans an unreliable title, causing the Researcher to fail to find relevant materials, the Writer to just make things up, and the Editor to revise based on the fabricated content. Ultimately, you get an outrageous article. If you don't have a good observation tool, you wouldn't even know the problem originated in the Planner stage. You might still be repeatedly modifying the Writer's Prompt—isn't this just "treating the foot when the head aches"?
LangSmith: The "X-ray Machine" for Your Multi-Agent System
LangSmith was born to solve these pain points. It is a developer platform officially launched by LangChain, designed to help you debug, test, evaluate, and monitor LLM-based applications. For complex multi-agent orchestration like LangGraph, LangSmith is simply a "match made in heaven".
Its core idea is: capture and visualize every "Run" of your LLM application. Whether it's a single LLM call, a Chain, a Tool, or our complex LangGraph process, LangSmith can break it down into a series of traceable events and clearly display them in a tree structure (or Trace).
How does LangSmith collaborate with LangChain/LangGraph?
The LangChain/LangGraph libraries have an integrated Tracing mechanism internally. Once you set the corresponding environment variables, any LLM call, Tool call, or Chain execution initiated through the LangChain library will be automatically "intercepted" and sent to the LangSmith backend service. This means you can gain powerful observation capabilities with almost no modifications to your business logic code!
What can it show you?
- Complete Call Chain (Trace): From user input to final output, every LLM call, Agent decision, and tool usage is recorded.
- Input/Output: The detailed input Prompt and LLM output for each step.
- Intermediate Steps: The Agent's thought process, tool call parameters, and results.
- Latency and Cost: The execution time of each step, as well as the estimated token usage and cost.
- Error Information: If a step fails, LangSmith will highlight it and provide the error stack trace.
This is like installing countless miniature cameras and microphones in your "AI Universal Content Creation Agency", keeping every Agent's "thoughts" and "actions" in plain sight.
Mermaid Diagram: How LangSmith "Monitors" Your Agency
Let's use a Mermaid diagram to intuitively see how LangSmith penetrates your multi-agent workflow:
graph TD
subgraph AI Universal Content Creation Agency Workflow
start[User Request] --> Planner(Planner Agent);
Planner -- Plan Result --> Research(Researcher Agent);
Research -- Research Report --> Write(Writer Agent);
Write -- Draft --> Edit(Editor Agent);
Edit -- Final Draft --> end_node[Final Content Output];
end
subgraph LangSmith Observation System
direction LR
A[LangSmith UI] --> B{Real-time Observation & Analysis}
B --> C(Performance Bottleneck Localization)
B --> D(Prompt Optimization Suggestions)
B --> E(Agent Behavior Understanding)
B --> F(Cost and Latency Tracking)
B --> G(Rapid Error Troubleshooting)
end
style A fill:#f9f,stroke:#333,stroke-width:2px,color:#333
style B fill:#e0e0e0,stroke:#333,stroke-width:1px,color:#333
style C fill:#e0e0e0,stroke:#333,stroke-width:1px,color:#333
style D fill:#e0e0e0,stroke:#333,stroke-width:1px,color:#333
style E fill:#e0e0e0,stroke:#333,stroke-width:1px,color:#333
style F fill:#e0e0e0,stroke:#333,stroke-width:1px,color:#333
style G fill:#e0e0e0,stroke:#333,stroke-width:1px,color:#333
start -- Trigger --> LS_Tracer_Start(LangSmith Tracer);
Planner -- Call/Execute --> LS_Tracer_Planner(LangSmith Tracer);
Research -- Call/Execute --> LS_Tracer_Research(LangSmith Tracer);
Write -- Call/Execute --> LS_Tracer_Write(LangSmith Tracer);
Edit -- Call/Execute --> LS_Tracer_Edit(LangSmith Tracer);
end_node -- End --> LS_Tracer_End(LangSmith Tracer);
LS_Tracer_Start -- Record Trace --> LangSmith_Backend(LangSmith Backend Service);
LS_Tracer_Planner -- Record Trace --> LangSmith_Backend;
LS_Tracer_Research -- Record Trace --> LangSmith_Backend;
LS_Tracer_Write -- Record Trace --> LangSmith_Backend;
LS_Tracer_Edit -- Record Trace --> LangSmith_Backend;
LS_Tracer_End -- Record Trace --> LangSmith_Backend;
LangSmith_Backend -- Data Display --> A;As seen from the diagram, the execution of every Agent and the call of every LangChain component (including the entire execution of LangGraph) will be captured by the LangSmith Tracer, and these events will be sent to LangSmith's backend service. Ultimately, we can see the detailed execution trajectory of the entire workflow on LangSmith's Web UI through an intuitive graphical interface. This is the transparent transformation of your "AI Universal Content Creation Agency"!
💻 Practical Code Drill
Now, let's truly integrate LangSmith into our "AI Universal Content Creation Agency" project. For the sake of demonstration simplicity, we will build a simplified LangGraph workflow: the user provides a topic, the Planner Agent plans the content outline, and the Writer Agent writes the draft based on the outline.
1. LangSmith Account and API Key Configuration
First, you need to go to the LangSmith official website to register an account. After registering, you can find your API Key on your personal settings page.
To let LangChain know where to send the Trace, you need to set a few environment variables. The simplest way is to set them in your .env file (if you use python-dotenv) or directly in the environment where you run the script:
# .env file content example
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY="sk-YOUR_LANGSMITH_API_KEY"
LANGCHAIN_PROJECT="AI Content Agency - Dev" # Optional, used to organize your projects in the LangSmith UI
Important Note: LANGCHAIN_TRACING_V2=true is the key to enabling the LangChain V2 Tracing mechanism. LANGCHAIN_API_KEY is your LangSmith API key. LANGCHAIN_PROJECT is an optional name used to categorize all your runs under this project in the LangSmith UI for easy management. It is strongly recommended to set this, otherwise your LangSmith interface will be a mess.
2. Install Necessary Libraries
Ensure that libraries like LangChain, LangGraph, OpenAI, as well as langsmith itself, are installed in your environment:
pip install -qU langchain langgraph langchain_openai langsmith python-dotenv
3. Build a Simplified Agency Graph
We will simulate a simple content creation process: Planner -> Writer.
import os
from dotenv import load_dotenv
from typing import TypedDict, Annotated, List, Union
import operator
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
# Load environment variables
load_dotenv()
# ====================================================================================
# 1. Define the state of our AI content creation agency (AgencyState)
# ====================================================================================
class AgencyState(TypedDict):
"""
Structure representing the current state of the AI content creation agency.
It tracks all key information from the user request to the final content output.
"""
topic: str # The original topic requested by the user for content creation
plan: str # The content outline/plan generated by the Planner Agent
draft: str # The content draft written by the Writer Agent
messages: Annotated[List[BaseMessage], operator.add] # Message history used for inter-Agent communication
# ====================================================================================
# 2. Initialize the Large Language Model (LLM)
# ====================================================================================
# It is recommended to use GPT-4o or GPT-4-turbo, as they perform better in understanding complex instructions
# If you don't have access to these models, you can try gpt-3.5-turbo, but the results may not be as expected
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
# ====================================================================================
# 3. Define Agent node functions
# Each function receives the AgencyState and returns the modified state or messages.
# ====================================================================================
def planner_agent(state: AgencyState) -> AgencyState:
"""
Planner Agent: Generates a detailed content outline based on the topic provided by the user.
"""
print("---Planner Agent: Planning content outline---")
topic = state["topic"]
messages = state["messages"]
# Planner's Prompt
planner_prompt = PromptTemplate.from_template(
"""
You are an experienced content planner. Your task is to generate a detailed and logically clear outline for a long article based on the given topic.
Topic: {topic}
Please ensure the outline includes the following elements:
1. Catchy title
2. Introduction
3. At least 3-5 main section headings
4. 2-3 subsection headings under each main section
5. Conclusion
6. Call to Action (optional)
Please output the outline content directly without any additional explanation.
"""
)
# Build the planner chain
planner_chain = planner_prompt | llm | StrOutputParser()
# Invoke the planner chain to generate the outline
plan_output = planner_chain.invoke({"topic": topic})
# Update state: save the plan result and add the planner's output to the message history
return {
"plan": plan_output,
"messages": [AIMessage(content=f"Planner has completed the plan:\n{plan_output}")]
}
def writer_agent(state: AgencyState) -> AgencyState:
"""
Writer Agent: Writes the article draft based on the outline provided by the planner.
"""
print("---Writer Agent: Writing article draft---")
topic = state["topic"]
plan = state["plan"]
messages = state["messages"]
# Writer's Prompt
writer_prompt = PromptTemplate.from_template(
"""
You are a professional writer. Your task is to write a high-quality, informative, and engaging article draft based on the following topic and detailed outline.
Topic: {topic}
Outline:
{plan}
Please follow the outline structure, expand on each section and subsection, and make the article coherent and insightful.
Word count requirement: At least 800 words.
Please output the article draft directly without any additional explanation.
"""
)
# Build the writer chain
writer_chain = writer_prompt | llm | StrOutputParser()
# Invoke the writer chain to generate the draft
draft_output = writer_chain.invoke({"topic": topic, "plan": plan})
# Update state: save the draft and add the writer's output to the message history
return {
"draft": draft_output,
"messages": [AIMessage(content=f"Writer has completed the draft:\n{draft_output}")]
}
# ====================================================================================
# 4. Build the LangGraph workflow
# ====================================================================================
def create_agency_workflow():
"""
Create and compile the LangGraph workflow for the AI content creation agency.
"""
workflow = StateGraph(AgencyState)
# Add nodes
workflow.add_node("planner", planner_agent)
workflow.add_node("writer", writer_agent)
# Set entry point
workflow.set_entry_point("planner")
# Define edges: Planner hands over to Writer after completion
workflow.add_edge("planner", "writer")
# End the process after Writer completes
workflow.add_edge("writer", END)
# Compile the graph
app = workflow.compile()
return app
# ====================================================================================
# 5. Run the workflow and observe LangSmith
# ====================================================================================
if __name__ == "__main__":
print("---AI Content Creation Agency started, LangSmith observation enabled---")
# Create workflow
app = create_agency_workflow()
# Define initial state: User requests a topic
initial_state = {
"topic": "Applications and Challenges of Artificial Intelligence in Healthcare",
"plan": "",
"draft": "",
"messages": [HumanMessage(content="Please help me write an article about the application of artificial intelligence in healthcare.")]
}
# Run workflow
# LangGraph will automatically capture and send the Trace to LangSmith
final_state = app.invoke(initial_state)
print("\n---Workflow execution completed---")
print("\nFinal article draft:")
print(final_state["draft"])
print("\nPlease visit the LangSmith UI (https://www.langsmith.com/app/) to view the detailed Trace of this run.")
print(f"You can find the project named '{os.getenv('LANGCHAIN_PROJECT', 'default')}' under 'Projects'.")
4. Run the Code and Observe the LangSmith UI
- Ensure your
.envfile is configured correctly and matches your LangSmith API Key. - Run the above Python script.
python your_script_name.py - The script will start executing, and you will see outputs like
---Planner Agent: Planning content outline---and---Writer Agent: Writing article draft---in the console. - Open your browser and visit the LangSmith UI:
https://www.langsmith.com/app/. - In the left navigation bar, click "Projects". You should be able to see the
LANGCHAIN_PROJECTname you set (e.g.,AI Content Agency - Dev). Click to enter the project. - You will see a list containing the Trace you just ran. Each Trace represents one
app.invoke()call.
What will you see in the LangSmith UI?
Click on the latest Trace, and you will enter a detailed view:
- Timeline View: A timeline view that clearly displays the execution order and time consumed for each component (including the entire LangGraph run, the
plannernode, thewriternode, and theChatOpenAILLM calls within each node). - Graph View: For LangGraph, this view is particularly cool. It graphically displays your Agent process and highlights the path taken by the current Trace.
- Detailed Steps:
- LangGraph Run: The top-level Run will show the input (
topic,messages) and final output (draft,messages) of the entire Graph. - Node Runs: Clicking on the
plannernode orwriternode will show the input (statedictionary) and output (statedictionary) of that node when called as a function. - LLM Calls: Inside each node, you will see the
ChatOpenAIcall. Clicking on it allows you to see the complete Prompt sent to the LLM (messagesarray) and the raw Response returned by the LLM (content). You can also see the token usage and estimated cost.
- LangGraph Run: The top-level Run will show the input (
Where is the value of LangSmith?
Suppose the article written by your Writer Agent is not good enough. In LangSmith, you can:
- Check the Planner Agent's output: Is there a problem with the outline provided by the Planner itself? Did it misunderstand the topic?
- Check the Writer Agent's input Prompt: Did the Writer's Prompt fail to correctly utilize the Planner's outline? Are the Prompt instructions unclear?
- Check the actual input/output of the Writer Agent calling the LLM: Is the LLM performing poorly under a specific Prompt?
Through this visual, in-depth inspection, you can quickly locate the problem—whether the Planner's Prompt needs optimization, the Writer's logic is flawed, or it's an issue with the LLM itself. Say goodbye to blind guessing and embrace scientific debugging!
Pitfalls and Avoidance Guide
As a senior mentor, I have seen too many students fall into pitfalls in this "treasure land" of LangSmith. Come, let me point out a few "bright paths" for you to avoid detours:
"If the environment is dead, everything stops": Environment variables are the lifeline!
- Pitfall: The most common problem is not correctly setting
LANGCHAIN_TRACING_V2=trueandLANGCHAIN_API_KEY. Many people only set the API Key but forget to enable V2 Tracing, resulting in a blank LangSmith interface. - Avoidance: Make absolutely sure that both environment variables are correctly set and effective in your script's running environment. Using
python-dotenvis a good habit, but also ensure the.envfile is loaded correctly. Addingprint(os.getenv("LANGCHAIN_TRACING_V2"))at the beginning of the script to check is a good method.
- Pitfall: The most common problem is not correctly setting
"Project names flying around, interface in a mess": Make good use of
LANGCHAIN_PROJECT.- Pitfall: Not setting
LANGCHAIN_PROJECT, or using a different project name for every run. The LangSmith UI will default to putting all Traces under thedefaultproject, or create a bunch of scattered projects. When you have many projects, finding a specific Trace is like finding a needle in a haystack. - Avoidance: Set a fixed and meaningful
LANGCHAIN_PROJECTname for each of your independent projects (like our "AI Universal Content Creation Agency"). When developing different feature branches, consider usingLANGCHAIN_PROJECT="AI Content Agency - FeatureX"to differentiate them.
- Pitfall: Not setting
"Data security is paramount": Never upload sensitive information.
- Pitfall: During development, accidentally sending Prompts or LLM outputs containing sensitive information like user privacy or company secrets to LangSmith.
- Avoidance: LangSmith is a powerful debugging tool, but it is a cloud service. Please be extremely cautious in production environments or when handling sensitive data. For highly sensitive information, consider desensitization before sending it to LangSmith. LangSmith also offers a self-hosted option (LangSmith On-premise), but for most developers, the cloud service is more convenient.
"Traces too long, dazzling the eyes": Learn to filter and search.
- Pitfall: When your workflow is very complex or runs too many times, a single Trace might contain hundreds of steps, making it very painful to view on the LangSmith UI.
- Avoidance: The LangSmith UI provides powerful filtering, searching, and grouping features. You can filter by Agent name, LLM type, status, time range, etc. Learning to use these features can greatly improve your debugging efficiency. Additionally, you can add
tagsto your Runs for easier subsequent lookup.
"Where do custom components go?": Manually integrate Tracer.
- Pitfall: If you use completely custom Python functions in LangGraph nodes instead of the
Runnablecomponents provided by LangChain, the sub-steps inside these custom logics (e.g., a custom external API call) might not be automatically captured by LangSmith. - Avoidance: For your completely custom components that do not inherit from
Runnable, if you want their internal operations to be tracked as well, you need to manually introduceLangChainTracer. For example, you can usewith get_tracer().start_as_current_span("my_custom_tool_call") as span:to wrap your custom logic, and addinputsandoutputsto thespan. This requires a deeper understanding of LangChain'sCallbackManagermechanism.
- Pitfall: If you use completely custom Python functions in LangGraph nodes instead of the
"Performance impact must not be ignored": Traces also have overhead.
- Pitfall: Although the overhead of LangSmith Traces is usually very small, in scenarios with extremely high concurrency or extremely strict latency requirements, network transmission and data recording might introduce minor additional latency.
- Avoidance: When deploying in a production environment, evaluate the performance impact brought by LangSmith Tracing. Usually, it is enabled during the development and testing phases. In production environments, consider sampling and tracking only a portion of the requests, or dynamically enabling it when problems occur.
"Cost calculation is for reference only": Deviations from actual bills.
- Pitfall: The cost displayed by LangSmith is usually an estimation based on the number of tokens. It might have subtle differences from your actual LLM provider's bill (e.g., API call counts, concurrency discounts, precise billing methods for different models, etc.).
- Avoidance: Use LangSmith's cost data as a tool for quick reference and trend analysis, but the final financial accounting still needs to be based on the LLM provider