Issue 29 | Test Graph: How Should We Automate Acceptance Testing for Agent Performance?
🎯 Learning Objectives for This Episode
Future top-tier AI architects, welcome back to our "LangGraph Multi-Agent Expert Course". I am your instructor.
Our "AI Content Agency" project has advanced to Episode 29. Looking back at the previous 28 episodes, we built from scratch the strategizing Planner, the deep-digging Researcher, the eloquent Writer, and the nitpicking Editor. The entire LangGraph flow is silky smooth. Watching the green text jumping in the terminal, do you feel that the job is done, ready to package and deploy, and perhaps even starting to fantasize about a promotion and a raise?
Wake up!
In traditional software engineering, writing code without writing tests is considered "unprofessional"; in large language model (LLM) application development, building multi-agents without automated evaluation is called "planting landmines". The output of LLMs is non-deterministic. What tested perfectly yesterday might generate a bunch of illogical hallucinations today. How do you prove that your Researcher actually found the right information? How do you guarantee that your Writer isn't just making things up?
Guessing? Relying on feelings? Eyeballing it? That's mysticism, not engineering.
Today, we are going to solve this ultimate question: How do we perform automated acceptance testing on a LangGraph multi-agent workflow?
Through this episode, you will master the following core skills:
- Mindset Breakthrough: Understand the failure of traditional unit testing (
assert a == b) in the Agent era, and master the core concept of "LLM as a Judge". - Golden Standard Construction: Learn to define and build a "Golden Dataset" for testing in the AI Content Agency project.
- Introducing the RAGAS Evaluation Framework: Proficiently use the RAGAS framework to calculate "Context Precision/Recall" for the Researcher node, and "Faithfulness/Answer Relevance" for the Writer node.
- Automated Testing Pipeline: Write scripts to seamlessly integrate LangGraph execution with RAGAS evaluation, achieving one-click automated acceptance.
📖 Principle Analysis
Why Can't We Use Traditional Unit Testing?
In traditional code, testing an addition function is simple: assert add(1, 1) == 2.
But in our Agency, a user inputs: "Write an article about the application of quantum computing in the financial sector."
The article generated by the Writer is different every time—different word counts, different sentence structures. You cannot use == to assert.
Therefore, we need to introduce the concept of Eval (Evaluation). One of the most mature approaches in the industry right now is utilizing the RAGAS (Retrieval Augmented Generation Assessment) framework. Although it has RAG in its name, its core metrics perfectly align with our multi-agent architecture.
Mapping RAGAS Core Metrics to Agency Roles
In our AI Content Agency, we can actually view the entire process as an advanced RAG variant:
- Researcher Node = Retriever. It is responsible for finding information online or in a local knowledge base.
- Writer/Editor Node = Generator. It is responsible for writing and polishing articles based on the retrieved information.
Correspondingly, we need to evaluate the following four core dimensions:
- Dimension 1: Context Precision - Evaluating the Researcher
- In plain terms: Is the information found by the Researcher actually useful? Did it stuff in a bunch of garbage information?
- Dimension 2: Context Recall - Evaluating the Researcher
- In plain terms: Did the Researcher find all the crucial information necessary to answer the user's question? Did it miss any core knowledge points?
- Dimension 3: Faithfulness - Evaluating the Writer
- In plain terms: Is the article written by the Writer entirely based on the information found by the Researcher? Did it make things up on its own (hallucination)?
- Dimension 4: Answer Relevance - Evaluating the Editor/Final Output
- In plain terms: Does the final submitted article truly address the goals initially set by the Planner and the user's needs? Is it answering a different question?
Automated Acceptance Workflow (Mermaid Diagram)
To achieve automation, we need to build a test script. Its execution logic is as follows:
sequenceDiagram
participant TestScript as Automated Test Script
participant Dataset as Golden Dataset
participant LangGraph as AI Content Agency (Graph)
participant RAGAS as RAGAS Evaluation Engine
participant Report as Test Report
TestScript->>Dataset: 1. Load test cases (Question & Ground Truth)
loop Loop through each test case
TestScript->>LangGraph: 2. Pass Question to trigger workflow
activate LangGraph
LangGraph-->>LangGraph: Planner strategizes
LangGraph-->>LangGraph: Researcher retrieves (generates Contexts)
LangGraph-->>LangGraph: Writer/Editor generates (generates Answer)
LangGraph-->>TestScript: 3. Return final State (contains Contexts and Answer)
deactivate LangGraph
TestScript->>TestScript: 4. Assemble evaluation data row
end
TestScript->>RAGAS: 5. Submit complete dataset for batch evaluation (LLM as a Judge)
activate RAGAS
RAGAS-->>RAGAS: Calculate Precision, Recall, Faithfulness, Relevance
RAGAS-->>TestScript: 6. Return scoring matrix
deactivate RAGAS
TestScript->>Report: 7. Generate and output acceptance reportDo you see it? Our test script acts like a "ruthless overseer," holding a book of standard answers (the Golden Dataset), continuously assigning tasks to the Agency, and then packaging the intermediate results (Contexts) and final results (Answer) produced by the Agency, throwing them to another, more advanced LLM (the RAGAS Judge) for scoring.
💻 Practical Code Walkthrough
Next, we will implement this testing pipeline using Python. To allow everyone to run this code directly, I am using a simplified Mock function to replace the massive Graph code from the previous 28 episodes, but the interface and State structure are completely identical.
Preparation
You need to install the following dependencies:
pip install ragas langchain-openai datasets langgraph
Core Test Script agency_evaluator.py
import os
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
# ==========================================
# 0. Environment Variable Configuration (Ensure your API KEY is configured)
# ==========================================
os.environ["OPENAI_API_KEY"] = "sk-your-openai-api-key"
# ==========================================
# 1. Define the Golden Dataset
# ==========================================
# In actual engineering, this is usually a hand-crafted JSON/CSV file
# Containing edge-case test cases representing various business scenarios
golden_dataset = [
{
"question": "Please write a short article introducing the core state management mechanism of LangGraph.",
# ground_truth is the core points of the standard answer written by an expert, used to calculate Recall
"ground_truth": "LangGraph manages state through StateGraph. It requires defining a TypedDict as the State. Each node receives the current State and returns the State fields to be updated. State updates can be overwrites or appends (via Annotated and operator)."
},
{
"question": "What is LLM hallucination? How do we avoid it in our Agency project?",
"ground_truth": "LLM hallucination refers to the model generating seemingly plausible but actually incorrect or unfounded information. In the Agency project, this is avoided by introducing a Researcher node for RAG retrieval, requiring the Writer node to create strictly based on the retrieved Context, and using an Editor node for fact-checking."
}
]
# ==========================================
# 2. Mock the LangGraph Agency we built in the previous 28 episodes
# ==========================================
# Here we use a Mock function to replace the actual Graph execution,
# but please note: its input and output structure is completely identical to the real LangGraph State!
def run_agency_graph(question: str) -> dict:
"""
Mock the execution process of the AI Content Agency
In a real scenario, this should be: return app.invoke({"question": question})
"""
print(f"🤖 [Agency Graph] Processing task: {question}")
# Mock the context materials (Contexts) found by the Researcher
if "LangGraph" in question:
contexts = [
"LangGraph is a multi-agent framework launched by LangChain.",
"In LangGraph, StateGraph is the core, defining the state structure through TypedDict.",
"Node functions receive the state and return incremental updates to the state."
]
# Mock the final article (Answer) generated by the Writer/Editor
answer = "The core state management mechanism of LangGraph relies on StateGraph. Developers must first define a State structure based on TypedDict. During the execution of the graph, each node receives the current global State, executes its own logic, and returns a dictionary to update the global State. This mechanism ensures stable sharing and passing of information among multiple agents."
else:
contexts = [
"LLM hallucination refers to the model talking nonsense in a serious manner.",
"Retrieval-Augmented Generation (RAG) can effectively reduce hallucinations because the model has external knowledge references."
]
answer = "LLM hallucination refers to the model generating false information. In our Agency project, we mainly avoid this through RAG technology. The Researcher will first find information, and then the Writer only writes the article based on the information, so it won't make things up."
return {
"question": question,
"contexts": contexts, # Researcher's output
"answer": answer # Editor's final output
}
# ==========================================
# 3. Automated Testing and Evaluation Pipeline
# ==========================================
def run_evaluation_pipeline():
print("🚀 Starting Agency automated acceptance testing...\n")
evaluation_data = {
"question": [],
"answer": [],
"contexts": [],
"ground_truth": []
}
# Step 1: Iterate through the golden dataset, run the graph to collect data
for item in golden_dataset:
question = item["question"]
ground_truth = item["ground_truth"]
# Run the graph! Get the actual performance of the Agency
final_state = run_agency_graph(question)
# Assemble the data format required by RAGAS
evaluation_data["question"].append(question)
evaluation_data["answer"].append(final_state["answer"])
evaluation_data["contexts"].append(final_state["contexts"])
evaluation_data["ground_truth"].append(ground_truth)
# Step 2: Convert to HuggingFace Dataset format
dataset = Dataset.from_dict(evaluation_data)
print("\n📊 Data collection complete, calling RAGAS for LLM scoring (LLM as a Judge)...")
# Step 3: Configure the Judge LLM (for stability, the judge is usually at the GPT-4o level)
judge_llm = ChatOpenAI(model="gpt-4o")
judge_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Step 4: Execute evaluation
# We focus on: Faithfulness (Writer), Answer Relevancy (Editor), Context Precision (Researcher), Context Recall (Researcher)
result = evaluate(
dataset=dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
],
llm=judge_llm,
embeddings=judge_embeddings
)
# Step 5: Output acceptance report
print("\n✅ Test acceptance complete! The final evaluation report is as follows:")
print("=" * 50)
# result is actually a dictionary containing the average scores of each metric
for metric_name, score in result.items():
print(f"🎯 {metric_name.capitalize().ljust(20)}: {score:.4f}")
print("=" * 50)
# Set passing score assertions (This is our CI/CD blocking logic)
# If the article written by the Writer has too many hallucinations (Faithfulness < 0.8), the test fails immediately!
assert result["faithfulness"] >= 0.8, "❌ Test Failed: The faithfulness of the Writer node is below 0.8, indicating a severe risk of hallucination!"
assert result["context_recall"] >= 0.7, "❌ Test Failed: The recall rate of the Researcher node is insufficient, missing key information!"
print("🎉 Congratulations! The Agency workflow has passed all automated acceptance metrics! Ready to merge into the main branch!")
if __name__ == "__main__":
run_evaluation_pipeline()
Code Execution Analysis
When you run this code, you will see that it first calls run_agency_graph to get the mocked graph state. Then, RAGAS will use GPT-4o as the judge in the background to analyze the contexts, answer, and ground_truth you passed in.
The final output might look something like this:
🎯 Faithfulness : 0.9500 (Indicates the Writer is well-behaved and didn't make things up outside the context)
🎯 Answer_relevancy : 0.9200 (Indicates the Editor gatekept well, closely addressing the user's question)
🎯 Context_precision : 0.8800 (Indicates the information found by the Researcher has a high signal-to-noise ratio)
🎯 Context_recall : 0.8500 (Indicates the Researcher basically found all the required knowledge points)
With this script, every time you modify the Researcher's Prompt or adjust the Writer's LLM parameters in the future, you just need to run the script once. If the scores plummet, you know you broke the code.
Pitfalls and Best Practices
As your instructor, I have seen too many teams step into the same mud pits when introducing LLM automated testing. The following are high-level troubleshooting experiences, worth their weight in gold. I recommend memorizing them:
1. The "Bias" Trap of the Judge (LLM Judge Bias)
Pitfall: You use GPT-3.5 to generate an article, and then use GPT-3.5 as the judge to score it. You will find the scores are surprisingly high! This is because the same model tends to favor its own generated text style. Best Practice: The judge model must be a tier higher than the generation model, or at least from a different family. If your Agency uses Claude 3.5 Sonnet for generation, try to use GPT-4o as the Judge during evaluation. Maintain the independence and authority of the referee.
2. Timeouts and Bankruptcy Caused by Brute-Forcing Context
Pitfall: Your Researcher node is too aggressive, retrieving 20 web pages at once and returning 50,000 words of Context. You stuff these 50,000 words along with the question into the RAGAS evaluation library. The result? Either the evaluation times out and fails, or you have a heart attack when you see your OpenAI bill at the end of the month.
Best Practice: Before passing the State into the test flow, create an interceptor. Limit the token count of contexts (e.g., truncate to the first 5000 tokens). The purpose of evaluation is to see if the "core pipeline" works, not to test long-context limits at this stage.
3. The "Schrödinger's State" of Test Results (Flaky Tests)
Pitfall: With the exact same test code, running faithfulness yesterday yielded 0.85 (Pass), but today it became 0.78 (Fail). CI/CD alarms go off constantly, and the team loses trust in the tests. LLMs are non-deterministic!
Best Practice:
- You must set the
temperatureof the Judge LLM to0. - Do not obsess over the absolute score of a single run. In CI/CD, the Golden Dataset should contain at least 20-50 test cases, and you should use the average score for assertions (like
result["faithfulness"] >= 0.8in the code). A large sample size can effectively smooth out the variance of single generations.
4. Don't Confuse "Testing" with "Monitoring"
Pitfall: Some students think RAGAS is so useful that they deploy it to the production environment, performing a RAGAS evaluation on every single online user request. As a result, system latency increases by 10 seconds, and all the users run away. Best Practice: Remember that the title of this episode is "Automated Acceptance". RAGAS is an offline testing tool, run before you release a new version. Online real-time monitoring (which we will cover in the next episode) requires more lightweight solutions (like LangSmith tracing). You absolutely must not block the main flow online with real-time LLM scoring.
📝 Episode Summary
Today, we installed the final line of defense for our "AI Content Agency"—an automated acceptance pipeline based on RAGAS.
We broke through the mindset limitations of traditional unit testing and introduced the "LLM as a Judge" evaluation paradigm. By extracting the core states (contexts and answer) during the LangGraph execution process, we quantitatively scored the Researcher's Precision/Recall and the Writer/Editor's generation quality (Faithfulness/Relevancy).
You are now not just a developer who knows how to write Prompts and wire Graphs; you possess the quality control mindset of a high-level AI architect. Your code is finally no longer mysticism, but an engineering crystallization supported by data.
Spoiler Alert: The graph is written, and the tests have passed. In the next episode, which is the grand finale of our entire series (Episode 30)! We will draw our swords, deploy this massive Agency to the production environment, and integrate LangSmith for the ultimate practical combat of full-link real-time monitoring and Human-in-the-loop intervention.
Everyone, get today's test script running successfully, and I'll see you at the top! Class dismissed!