Issue 14 | Time Machine! Building Checkpoints Using SqliteSaver/MemorySaver

Updated on 4/14/2026

🎯 Learning Objectives for this Issue

Welcome, future AI architects, to Issue 14 of the "LangGraph Multi-Agent Expert Course"! Do you remember the ambitious "AI Universal Content Creation Agency" we built previously? It can already plan, research, write, and even edit. But now, we need to give it a superpower—a "time machine" that allows it to "travel" back in time, or rather, resume from where it left off.

By the end of this lesson, you will be able to:

  1. Thoroughly understand the core value of LangGraph Checkpoints: Why it is the "killer feature" for building robust and efficient multi-agent systems, and its application scenarios in real-world projects.
  2. Master the configuration and usage of SqliteSaver and MemorySaver: Learn how to introduce these two persistent storage solutions into your LangGraph applications and understand their respective pros and cons.
  3. Introduce persistent state storage to your "AI Content Agency": Ensure that even if the system crashes, the network disconnects, or the process restarts, our agents can seamlessly resume work from where they left off, avoiding redundant effort and wasted resources.
  4. Master practical skills for restoring workflows from historical state snapshots: Learn how to use the thread_id mechanism to accurately locate and restore the execution state of a specific session.

Ready? Let's embark on this LangGraph time machine journey together!

📖 Principle Analysis

Staying on Track: Why Do We Need Checkpoints?

Imagine your "AI Universal Content Creation Agency" is working around the clock to generate a 10,000-word article for you. The Planner has just finished a complex structural plan, the Researcher is navigating through massive amounts of data, and the Writer is halfway through writing... Suddenly, the server goes offline! A power outage occurs! Or your program crashes due to a minor bug!

If you don't have Checkpoints, congratulations, all the hard work your agents previously did—those expensive LLM calls, the time-consuming data processing—is completely down the drain. You have to start from scratch, wasting not only precious computing resources and LLM tokens (which cost real money!), but also a massive amount of time. This is simply a recipe for disaster!

This is the core value of LangGraph Checkpoints: It acts like an auto-save feature in a video game, silently saving a complete copy of your "game progress" at every critical node of your multi-agent workflow. No matter what unexpected situation occurs, you can "load the save" at any time and continue from where you last saved, truly achieving "staying on track without dropping the chain."

The Core Working Mechanism of Checkpoints

LangGraph's Checkpoints mechanism is built upon its core State abstraction. In LangGraph, the execution state of the entire multi-agent workflow is encapsulated within a State object. Whenever an Agent (node) finishes executing, it receives the current State, processes it according to its logic, and then returns a new State update.

The working principle of Checkpoints is: whenever the State is updated, LangGraph automatically stores a snapshot of this latest State. When your application needs to recover, it loads the latest State of a specific session (identified by thread_id) from storage, and the workflow can then resume execution from this State.

It's like shooting a movie: after every take (Agent execution), the director records the current scene, actor positions, and prop placements (State) (saving a snapshot). If filming is interrupted, the next time they start, they only need to refer to the records to accurately restore the state to exactly how it was when filming stopped.

Two Common Savers: In-Memory and Persistent

LangGraph provides several implementations of BaseCheckpointSaver, among which the most commonly used and easiest to get started with are:

  1. MemorySaver (In-Memory Storage):
    • Characteristics: As the name suggests, it stores all state snapshots in the program's memory.
    • Pros: Extremely fast, simple to configure, suitable for development debugging, testing, or short-lived tasks that do not require persistence.
    • Cons: Non-persistent! Once the program process terminates, all stored states are lost. Just like a computer's RAM, it is cleared upon power loss.
  2. SqliteSaver (SQLite File Storage):
    • Characteristics: Stores state snapshots in a local SQLite database file. SQLite is a lightweight, embedded database that does not require a separate server process.
    • Pros: Persistent! States are written to a disk file, so even if the program process terminates, it can be restored from the file upon the next startup. Configuration is relatively simple, and performance is sufficient for most small to medium-sized applications.
    • Cons: For ultra-large-scale, high-concurrency, or distributed deployment scenarios, SQLite might not be the best choice. File I/O performance could become a bottleneck.

In addition, LangGraph also supports more powerful persistence solutions like PostgresSaver and RedisSaver, which are suitable for production environments with higher requirements for performance, scalability, and high availability. However, for our current "AI Content Agency" project, SqliteSaver is more than sufficient and perfectly demonstrates the core concepts of persistent storage.

Mermaid Diagram: Checkpoints Workflow

Let's use a Mermaid diagram to visually see how Checkpoints work in our "AI Content Agency".

graph TD
    subgraph AI Content Agency Workflow
        A[Start Task e.g., Generate Social Media Content] --> B{Agent: Planner};
        B -- Update AgencyState --> C(Checkpoint: Save Current State Snapshot);
        C --> D{Agent: Researcher};
        D -- Update AgencyState --> E(Checkpoint: Save Current State Snapshot);
        E --> F{Agent: Writer};
        F -- Update AgencyState --> G(Checkpoint: Save Current State Snapshot);
        G --> H{Agent: Editor};
        H -- Update AgencyState --> I(Checkpoint: Save Current State Snapshot);
        I --> J[Task Completed];
    end

    subgraph Exception and Recovery Mechanism
        K[System Interruption/Error/Process Restart];
        K --> L{Restart Application};
        L -- Use the same thread_id --> M[Load AgencyState from the latest Checkpoint];
        M --> N[Resume Workflow Execution];
    end

    style C fill:#bbf,stroke:#333,stroke-width:2px,color:#000;
    style E fill:#bbf,stroke:#333,stroke-width:2px,color:#000;
    style G fill:#bbf,stroke:#333,stroke-width:2px,color:#000;
    style I fill:#bbf,stroke:#333,stroke-width:2px,color:#000;
    style M fill:#fcc,stroke:#333,stroke-width:2px,color:#000;
    style N fill:#afa,stroke:#333,stroke-width:2px,color:#000;

Diagram Explanation:

  • AI Content Agency Workflow: Our agents (Planner, Researcher, Writer, Editor) execute tasks sequentially. Whenever an Agent completes its work and updates the AgencyState, the Checkpoint node is triggered to save the current AgencyState in its entirety.
  • Exception and Recovery Mechanism: If a system interruption or error (K) occurs during the execution of any Agent, when the application restarts (L), we can pass in the same thread_id to let LangGraph automatically load (M) the latest AgencyState from the Checkpoint. The workflow can then resume (N) from after the Agent where it was interrupted, rather than starting from scratch.

See that? Checkpoints are like giving your multi-agent system an "immortal body," greatly enhancing the system's resilience and reliability. This is an indispensable key feature for any complex, long-running AI application!

💻 Practical Code Walkthrough (Specific Application in the Agency Project)

Now, let's apply these principles to our "AI Universal Content Creation Agency" project. We will simulate the process of a Planner agent conducting content planning, simulate an interruption midway, and then use SqliteSaver to recover from the point of interruption.

1. Define AgencyState

First, we need an AgencyState capable of carrying the working state of our agency. To demonstrate Checkpoints, we will simplify it to only include planning tasks and completed tasks.

import operator
from typing import Annotated, TypedDict, List
from langchain_core.messages import BaseMessage # Although not used in this issue, kept for AgencyState's generality
from langgraph.graph import StateGraph, END
from langgraph.checkpoint import MemorySaver, SqliteSaver # Import the two Savers
import os
import time
import json # Used for data serialization in SqliteSaver

# 1. Define the working state of our AI content agency
# TypedDict defines the state structure, Annotated combined with operator.add indicates the list type is cumulative
class AgencyState(TypedDict):
    """
    Represents the overall working state of the AI content agency.
    """
    # List of tasks to be planned
    planning_tasks: Annotated[List[str], operator.add]
    # List of completed planning tasks
    completed_plans: Annotated[List[str], operator.add]
    # Name of the task currently being processed
    current_task: str
    # Agent execution path (used for tracking and debugging)
    agent_path: Annotated[List[str], operator.add]

# Clean up any existing old database files to ensure a fresh start for each run
# In a real production environment, do not delete arbitrarily; this is just for demonstration convenience
if os.path.exists("agency_checkpoints.sqlite"):
    os.remove("agency_checkpoints.sqlite")
    print("Cleaned up the old agency_checkpoints.sqlite file.")

2. Define Agent Node (PlannerAgent)

We will create a simplified PlannerAgent that simulates the execution of planning tasks and uses time.sleep() to simulate time-consuming operations, so we can "catch" the moment it is interrupted.

# 2. Define a simplified Planner Agent node
class PlannerAgent:
    """
    Content planner agent, responsible for breaking down large content tasks into smaller planning steps.
    """
    def __init__(self, name: str):
        self.name = name

    def plan(self, state: AgencyState) -> AgencyState:
        """
        Core logic of the Planner Agent: receives the current state, plans tasks, and returns the updated state.
        """
        print(f"\n[{self.name}] Processing task: '{state['current_task']}'...")
        time.sleep(1.5) # Simulate thinking time during the planning process

        new_tasks = []
        # Simulate different planning outputs based on the current task
        if state['current_task'] == "Generate Social Media Content Plan":
            print(f"[{self.name}] Breaking down 'Generate Social Media Content Plan'...")
            new_tasks = ["Determine Target Audience", "Brainstorm Topics", "Write Draft Outline"]
        elif state['current_task'] == "Determine Target Audience":
            print(f"[{self.name}] Breaking down 'Determine Target Audience'...")
            new_tasks = ["Analyze User Personas", "Define Audience Characteristics"]
        elif state['current_task'] == "Brainstorm Topics":
            print(f"[{self.name}] Breaking down 'Brainstorm Topics'...")
            new_tasks = ["Brainstorm Trending Topics", "Market Trend Analysis"]
        elif state['current_task'] == "Write Draft Outline":
            print(f"[{self.name}] Breaking down 'Write Draft Outline'...")
            new_tasks = ["Determine Article Structure", "Assign Chapter Tasks"]
        else:
            # If the current task has no further subtasks, it is considered complete
            print(f"[{self.name}] Completed task: '{state['current_task']}'")
            return {
                "completed_plans": [state['current_task']], # Add the current task to the completed list
                "planning_tasks": [], # Clear tasks to be planned, as this branch only handles single task completion
                "current_task": "", # Clear the current task
                "agent_path": [self.name]
            }

        # If there are new subtasks, update the state
        if new_tasks:
            next_task = new_tasks.pop(0) # Pop the first one as the next current task
            return {
                "planning_tasks": new_tasks, # Put the remaining subtasks into the to-be-planned list
                "current_task": next_task, # Update the current task
                "agent_path": [self.name]
            }
        else:
            # Theoretically, it won't reach here because the completion case is handled above
            return {"agent_path": [self.name]}

3. Build Graph and Integrate Checkpoints

We will create a simple linear workflow executed in a loop by the PlannerAgent until all planning tasks are completed. The key is that we pass the SqliteSaver instance into the checkpointer parameter of the StateGraph.

# 3. Build LangGraph Workflow
def create_agency_workflow(checkpointer=None):
    """
    Create the AI content agency workflow and integrate the Checkpointer as needed.
    """
    workflow = StateGraph(AgencyState)
    planner = PlannerAgent("Content Planner")

    # Add Planner node
    workflow.add_node("planner", planner.plan)

    # Define routing logic: if there are still planning tasks, continue to hand them over to the Planner; otherwise, end.
    def route_next_task(state: AgencyState):
        """
        Determine the next route based on the current state.
        """
        # Check planning_tasks and current_task to ensure correct completion judgment
        remaining_tasks = state.get('planning_tasks', [])
        current_task_val = state.get('current_task', '')

        # If the current task is completed (and no new subtasks are generated), and the to-be-planned list is empty
        if not current_task_val and not remaining_tasks:
            print(f"[Router] All planning tasks are completed, workflow ends.")
            return END
        
        # If the current task is completed, but there are still tasks in the to-be-planned list
        # Or the current task is not yet completed, but it has generated new subtasks (i.e., current_task is updated to a new subtask)
        # At this time, if planning_tasks still has content, or current_task still has a value, it should continue
        if current_task_val or remaining_tasks:
             print(f"[Router] There are still tasks, continue to hand over to Planner for processing.")
             return "planner"
        else:
            print(f"[Router] Unknown state, default to end.")
            return END # Fallback, theoretically won't reach here

    workflow.set_entry_point("planner") # Set entry point to planner
    workflow.add_conditional_edges(
        "planner",
        route_next_task,
        {"planner": "planner", END: END} # Route to planner or END
    )

    # Compile the workflow and pass in the checkpointer
    app = workflow.compile(checkpointer=checkpointer)
    return app

4. Run and Recover Demo

Now, let's demonstrate how SqliteSaver achieves breakpoint resumption, as well as the non-persistence of MemorySaver.

if __name__ == "__main__":
    # --- Demonstrate SqliteSaver (Persistent Storage) ---
    print("--- Demonstrate SqliteSaver (Persistent Storage) ---")
    db_file = "agency_checkpoints.sqlite"
    # Ensure each demo starts from a clean database file