lesson-29

20 MIN READ | UPDATED: 2026-05-07

🎯 Learning Objectives for This Episode

Welcome back to our LangGraph Multi-Agent Masterclass, future top-tier AI architects. I am your instructor.

Our "AI Content Agency" project has now reached Episode 29. Looking back at the previous 28 episodes, we built from scratch the strategizing Planner, the deep-digging Researcher, the eloquent Writer, and the nitpicking Editor. The entire LangGraph workflow is silky smooth. Watching the green text continuously scrolling in your terminal, you might feel like the job is done, ready to package for production, and perhaps even daydreaming about a promotion and a raise.

Wake up!

In traditional software engineering, writing code without writing tests is considered "completely irresponsible"; but in LLM application development, building multi-agent systems without automated evaluation is "planting landmines". The outputs of Large Language Models are non-deterministic. What tested perfectly yesterday might generate a pile of illogical hallucinations today. How do you prove your Researcher actually found the right materials? How do you guarantee your Writer isn't just making things up?

Guesswork? Gut feeling? Eyeballing it? That's mysticism, not engineering.

Today, we are going to solve this ultimate question: How do we perform automated acceptance testing on LangGraph multi-agent workflows?

By the end of this episode, you will have mastered the following core skills:

Mindset Breakthrough: Understand why traditional unit tests (assert a == b) fail in the Agent era, and master the core concept of "LLM as a Judge".
Golden Standard Construction: Learn to define and build a "Golden Dataset" for the AI Content Agency project.
Introducing the RAGAS Evaluation Framework: Proficiently use the RAGAS framework to calculate "Context Precision/Recall" for the Researcher node, and "Faithfulness/Answer Relevance" for the Writer node.
Automated Testing Pipeline: Write scripts to seamlessly integrate LangGraph execution with RAGAS evaluation, achieving one-click automated acceptance testing.

📖 Principle Analysis

Why Can't We Use Traditional Unit Tests?

In traditional code, testing an addition function is simple: assert add(1, 1) == 2. But in our Agency, if a user inputs: "Write an article about the application of quantum computing in finance", the article generated by the Writer will be different every time—different word counts, different sentence structures. You cannot use == to assert correctness.

Therefore, we need to introduce the concept of Eval (Evaluation). One of the most mature approaches in the industry right now is utilizing the RAGAS (Retrieval Augmented Generation Assessment) framework. Although it has RAG in its name, its core metrics perfectly align with our multi-agent architecture.

Mapping RAGAS Core Metrics to Agency Roles

In our AI Content Agency, we can actually view the entire process as an advanced RAG variant:

Researcher Node = Retriever. It is responsible for finding materials on the web or in local knowledge bases.
Writer/Editor Node = Generator. It is responsible for writing and polishing articles based on those materials.

Correspondingly, we need to evaluate the following four core dimensions:

Dimension 1: Context Precision - Evaluates the Researcher
- In plain English: Is the material found by the Researcher actually useful? Did it stuff in a bunch of garbage information?
Dimension 2: Context Recall - Evaluates the Researcher
- In plain English: Did the Researcher find all the crucial information required to answer the user's prompt? Did it miss any core knowledge points?
Dimension 3: Faithfulness - Evaluates the Writer
- In plain English: Is the article written by the Writer based entirely on the materials found by the Researcher? Did it fabricate anything (hallucinations)?
Dimension 4: Answer Relevance - Evaluates the Editor/Final Output
- In plain English: Does the final submitted article truly address the goal initially set by the Planner and the user's needs? Is it off-topic?

Automated Acceptance Workflow (Mermaid Diagram)

To achieve automation, we need to build a test script. Its execution logic is as follows:

sequenceDiagram
    participant TestScript as Automated Test Script
    participant Dataset as Golden Dataset
    participant LangGraph as AI Content Agency (Graph)
    participant RAGAS as RAGAS Evaluation Engine
    participant Report as Test Report

    TestScript->>Dataset: 1. Load test cases (Question & Ground Truth)
    loop Iterate through each test case
        TestScript->>LangGraph: 2. Pass Question to trigger workflow
        activate LangGraph
        LangGraph-->>LangGraph: Planner strategizes
        LangGraph-->>LangGraph: Researcher retrieves (generates Contexts)
        LangGraph-->>LangGraph: Writer/Editor generates (generates Answer)
        LangGraph-->>TestScript: 3. Return final State (includes Contexts and Answer)
        deactivate LangGraph
        TestScript->>TestScript: 4. Assemble evaluation data row
    end
    
    TestScript->>RAGAS: 5. Submit full dataset for batch evaluation (LLM as a Judge)
    activate RAGAS
    RAGAS-->>RAGAS: Calculate Precision, Recall, Faithfulness, Relevance
    RAGAS-->>TestScript: 6. Return scoring matrix
    deactivate RAGAS
    TestScript->>Report: 7. Generate and output acceptance report

Make sense? Our test script acts as a "ruthless supervisor," holding a book of standard answers (the Golden Dataset), continuously assigning tasks to the Agency, and then packaging the intermediate results (Contexts) and final results (Answer) produced by the Agency, tossing them to another, higher-level LLM (the RAGAS Judge) for scoring.

💻 Practical Code Walkthrough

Next, let's implement this testing pipeline in Python. To ensure everyone can run this code directly, I've used a streamlined Mock function to replace the massive Graph code we built over the last 28 episodes, but the interface and State structure are completely identical.

Preparation

You need to install the following dependencies:

pip install ragas langchain-openai datasets langgraph

Core Test Script `agency_evaluator.py`

import os
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

# ==========================================
# 0. Environment Variable Configuration (Ensure your API KEY is set)
# ==========================================
os.environ["OPENAI_API_KEY"] = "sk-your-openai-api-key"

# ==========================================
# 1. Define the Golden Dataset
# ==========================================
# In real-world engineering, this is usually a hand-crafted JSON/CSV file
# containing edge-case test scenarios representing various business requirements.
golden_dataset = [
    {
        "question": "Please write a short article introducing the core state management mechanism of LangGraph.",
        # ground_truth is the core points of a standard answer written by an expert, used to calculate Recall
        "ground_truth": "LangGraph manages state through StateGraph. It requires defining a TypedDict as the State. Each node receives the current State and returns the State fields to be updated. State updates can be overwrites or appends (via Annotated and operator)."
    },
    {
        "question": "What are LLM hallucinations? How do we avoid them in our Agency project?",
        "ground_truth": "LLM hallucinations refer to the model generating seemingly plausible but actually incorrect or baseless information. In the Agency project, this is avoided by introducing a Researcher node for RAG retrieval, requiring the Writer node to create content strictly based on the retrieved Context, and using an Editor node for fact-checking."
    }
]

# ==========================================
# 2. Mock the LangGraph Agency we built in the previous 28 episodes
# ==========================================
# Here we use a Mock function instead of running the real Graph,
# but note: its input/output structure is exactly identical to the real LangGraph State!
def run_agency_graph(question: str) -> dict:
    """
    Mock the execution process of the AI Content Agency
    In a real scenario, this would be: return app.invoke({"question": question})
    """
    print(f"🤖 [Agency Graph] Processing task: {question}")
    
    # Mock the context materials found by the Researcher (Contexts)
    if "LangGraph" in question:
        contexts = [
            "LangGraph is a multi-agent framework launched by LangChain.",
            "In LangGraph, StateGraph is the core, defining the state structure via TypedDict.",
            "Node functions receive the state and return incremental state updates."
        ]
        # Mock the final article generated by the Writer/Editor (Answer)
        answer = "LangGraph's core state management mechanism relies on StateGraph. Developers must first define a State structure based on TypedDict. During the graph's execution, every node receives the current global State, executes its logic, and returns a dictionary to update the global State. This mechanism ensures stable information sharing and passing between multiple agents."
    else:
        contexts = [
            "LLM hallucinations refer to the model confidently spouting nonsense.",
            "Retrieval-Augmented Generation (RAG) can effectively reduce hallucinations because the model has external knowledge references."
        ]
        answer = "LLM hallucinations refer to the model generating false information. In our Agency project, we primarily use RAG technology to avoid this. The Researcher first finds materials, and then the Writer only writes articles based on those materials, preventing fabrication."
        
    return {
        "question": question,
        "contexts": contexts, # Researcher's output
        "answer": answer      # Editor's final output
    }

# ==========================================
# 3. Automated Testing and Evaluation Pipeline
# ==========================================
def run_evaluation_pipeline():
    print("🚀 Starting Agency automated acceptance testing...\n")
    
    evaluation_data = {
        "question": [],
        "answer": [],
        "contexts": [],
        "ground_truth": []
    }
    
    # Step 1: Iterate through the golden dataset, run the graph to collect data
    for item in golden_dataset:
        question = item["question"]
        ground_truth = item["ground_truth"]
        
        # Run the graph! Get the Agency's real performance
        final_state = run_agency_graph(question)
        
        # Assemble the data format required by RAGAS
        evaluation_data["question"].append(question)
        evaluation_data["answer"].append(final_state["answer"])
        evaluation_data["contexts"].append(final_state["contexts"])
        evaluation_data["ground_truth"].append(ground_truth)
        
    # Step 2: Convert to HuggingFace Dataset format
    dataset = Dataset.from_dict(evaluation_data)
    print("\n📊 Data collection complete. Calling RAGAS for LLM scoring (LLM as a Judge)...")
    
    # Step 3: Configure the Judge LLM (For stability, the judge is usually GPT-4o level)
    judge_llm = ChatOpenAI(model="gpt-4o")
    judge_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
    
    # Step 4: Execute evaluation
    # We focus on: Faithfulness (Writer), Answer Relevancy (Editor), Context Precision (Researcher), Context Recall (Researcher)
    result = evaluate(
        dataset=dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
        ],
        llm=judge_llm,
        embeddings=judge_embeddings
    )
    
    # Step 5: Output acceptance report
    print("\n✅ Acceptance testing complete! Final evaluation report:")
    print("=" * 50)
    # result is actually a dictionary containing the average scores of each metric
    for metric_name, score in result.items():
        print(f"🎯 {metric_name.capitalize().ljust(20)}: {score:.4f}")
    print("=" * 50)
    
    # Set passing threshold assertions (This is our CI/CD blocking logic)
    # If the Writer's article has too many hallucinations (faithfulness < 0.8), the test fails immediately!
    assert result["faithfulness"] >= 0.8, "❌ Test Failed: Writer node faithfulness is below 0.8, high risk of severe hallucinations!"
    assert result["context_recall"] >= 0.7, "❌ Test Failed: Researcher node recall is insufficient, missing key information!"
    
    print("🎉 Congratulations! The Agency workflow passed all automated acceptance metrics! Ready to merge into the main branch!")

if __name__ == "__main__":
    run_evaluation_pipeline()

Code Execution Breakdown

When you run this code, you will see it first calls run_agency_graph to get the mocked graph state. Then, RAGAS runs in the background using GPT-4o as the judge to analyze the contexts, answer, and ground_truth you passed in. The final output might look something like this:

🎯 Faithfulness        : 0.9500  (Shows the Writer behaved well and didn't fabricate outside the context)
🎯 Answer_relevancy    : 0.9200  (Shows the Editor gatekept well, staying strictly on topic)
🎯 Context_precision   : 0.8800  (Shows the Researcher's materials have a high signal-to-noise ratio)
🎯 Context_recall      : 0.8500  (Shows the Researcher found almost all the necessary knowledge points)

With this script, every time you modify the Researcher's Prompt or adjust the Writer's LLM parameters in the future, you just need to run the script once. If the scores plummet, you'll know you broke the code.

Pitfalls & Best Practices

As your instructor, I've seen too many teams fall into the same mud pits when introducing automated LLM testing. The following are high-level troubleshooting experiences worth their weight in gold. I highly recommend memorizing them:

1. The "LLM Judge Bias" Trap

Pitfall: You use GPT-3.5 to generate articles, and then use GPT-3.5 as the judge to score them. You'll find the scores are surprisingly high! This is because a model tends to favor text generated in its own style. Best Practice: The judge model must be a tier higher than the generator model, or at least from a different model family. If your Agency uses Claude 3.5 Sonnet for generation, try to use GPT-4o as the Judge during evaluation. Maintain the independence and authority of the referee.

2. Timeouts and Bankruptcy from Brute-Force Context Stuffing

Pitfall: Your Researcher node is too aggressive; it retrieves 20 web pages at once and returns 50,000 words of Context. You stuff these 50,000 words along with the question into the RAGAS evaluation library. The result? Either the evaluation times out and fails, or you have a heart attack when you see your OpenAI bill at the end of the month. Best Practice: Implement an interceptor before passing the State into the test flow. Limit the token count of contexts (e.g., truncate to the first 5,000 tokens). The goal of evaluation is to see if the "core pipeline" works, not to test extreme long-context limits at this stage.

3. The "Schrödinger's State" of Test Results (Flaky Tests)

Pitfall: With the exact same test code, yesterday's faithfulness was 0.85 (Pass), but today it's 0.78 (Fail). Your CI/CD pipeline is constantly sounding alarms, and the team loses trust in the tests. LLMs are non-deterministic! Best Practice:

You must set the temperature of the Judge LLM to 0.
Don't obsess over the absolute score of a single run. In CI/CD, your golden dataset should contain at least 20-50 test cases, and you should assert against the average score (like result["faithfulness"] >= 0.8 in the code). A large sample size effectively smooths out the variance of single generations.

4. Don't Confuse "Testing" with "Monitoring"

Pitfall: Some developers think RAGAS is so great that they deploy it to the production environment, running a RAGAS evaluation on every single live user request. As a result, system latency increases by 10 seconds, and all the users abandon the app. Best Practice: Remember, the title of this episode is "Automated Acceptance Testing." RAGAS is an offline testing tool meant to be run before you release a new version. Real-time online monitoring (which we will cover in the next episode) requires much more lightweight solutions (like LangSmith tracing). You must never block the main flow in production with real-time LLM scoring.

📝 Episode Summary

Today, we installed the final line of defense for our "AI Content Agency"—an automated acceptance pipeline based on RAGAS.

We broke through the mindset limitations of traditional unit testing and introduced the "LLM as a Judge" evaluation paradigm. By extracting the core states (contexts and answer) during the LangGraph execution process, we quantitatively scored the Researcher's Precision/Recall and the Writer/Editor's generation quality (Faithfulness/Relevancy).

You are now no longer just a developer who knows how to write Prompts and connect Graph nodes; you possess the quality assurance mindset of a high-level AI architect. Your code is finally no longer mysticism, but an engineering crystallization backed by data.

Spoiler Alert: The graph is written, and the tests are passing. Next episode is the Grand Finale (Episode 30) of our entire masterclass series! We will draw our swords, deploy this massive Agency into a production environment, and integrate LangSmith for full-stack real-time monitoring and the ultimate Human-in-the-loop practical application.

Everyone, get today's test script running, and I'll see you at the top! Class dismissed!

← PREVIOUS LESSON lesson-28

NEXT LESSON → lesson-30