Chapter 30: Advanced - Teams, CI, Extension

Learning Objectives

Apply this workflow to other scenarios: multi-person teams, CI, and new roles.

Multi-person Collaboration

flowchart TB
    Repo["Git Repository"] --> Shared["Team Shared (Committed to Git)"]
    Repo --> Personal["Personal (gitignore)"]

    Shared --> Sh1["CLAUDE.md"]
    Shared --> Sh2[".claude/agents/"]
    Shared --> Sh3[".claude/commands/"]
    Shared --> Sh4[".claude/hooks/"]
    Shared --> Sh5[".claude/settings.json"]
    Shared --> Sh6["openspec/"]

    Personal --> P1[".claude/settings.local.json"]
    Personal --> P2[".claude/telegram-notify.json"]
    Personal --> P3[".claude/.notify-sent"]

    style Shared fill:#c8e6c9
    style Personal fill:#fff9c4

→ Shared files are committed to Git, personal files are gitignored. Every new team member gets the same roles + rules immediately after cloning.

Multi-person Collaboration State Files

review/N.md         Committed to Git (review traces have audit value)
test-reports/N.md   Committed to Git
STUCK.md            Committed to Git
e2e-report.md       Committed to Git
.notify-sent        gitignore (personal state)
dist/               gitignore (runtime artifacts)

→ If A leaves work halfway through a run, B can git pull the next day, see review/N.md, and continue the run.

CI Integration

GitHub Actions Example:

# .github/workflows/test.yml
name: tests
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: pip install -e ".[dev]"
      - run: pytest tests/ -v -m "unit or functional"
        # Does not run e2e (requires macOS GUI)

→ CI runs unit + functional tests, but not E2E.

What CI Runs and Doesn't Run

Test Type	CI Runs	Local Runs
Unit Tests	✅	✅
Functional Tests (real subprocess)	✅ Linux runner	✅
E2E (macOS GUI)	❌	✅
Spec Consistency Check	✅ (openspec validate)	✅

Full CI Workflow:

flowchart TB
    Push["Developer git push"] --> Trigger["GitHub Actions Trigger"]
    Trigger --> Setup["Linux runner
setup Python 3.11"]
    Setup --> Install["pip install -e .[dev]"]
    Install --> Spec["openspec validate
(Spec Consistency)"]
    Install --> Unit["pytest -m unit"]
    Install --> Func["pytest -m functional"]
    Spec --> Pass{All Passed?}
    Unit --> Pass
    Func --> Pass
    Pass -->|Yes| Green["✓ PR Status Green"]
    Pass -->|No| Red["✗ Block Merge"]

    Green --> Local["Locally Run E2E"]
    Local --> Approve["Human Review + Merge"]

    Note["⚠️ E2E cannot run on CI
(macOS avfoundation)"] -.-> Trigger

    style Green fill:#c8e6c9
    style Red fill:#ffcdd2
    style Approve fill:#bbdefb

Extending Roles

Add agents according to your project needs:

.claude/agents/
├── developer.md           Core
├── tester.md              Core
├── reviewer.md            Core
├── e2e-tester.md          Core
├── architect.md           Core
├── doc-writer.md          Optional: Sync README/CHANGELOG updates
├── security-reviewer.md   Optional: Scan for vulnerabilities
├── refactor-agent.md      Optional: Long-term refactoring
├── migration-agent.md     Optional: Database schema
└── perf-tester.md         Optional: Performance regression

Each new role follows the "permission matrix + 4-part system prompt".

Scenarios Not Suitable for This Workflow

✗ 1-hour one-off scripts
   → Overhead far outweighs benefits

✗ Major refactoring of an already entrenched legacy system
   → First, refactor in small steps to create spec-writable interfaces

✗ Teams tightly coupled with Jira/Notion
   → Dual system conflict, choose one first

✗ Exploratory prototypes (playing around in Jupyter notebook)
   → Specifications will slow down experimentation

✗ Pure UI / design-driven projects
   → Specs are difficult to write with testable Scenarios

Real Value Scenarios for This Workflow

✅ Medium-sized tools/libraries (like doc2video)
✅ SaaS business logic (compliant, auditable)
✅ Long-term maintenance projects (>6 months)
✅ Multi-person collaboration
✅ Taking over projects from others
✅ AI-intensive development (>50% code written by AI)

Future Compatibility

Claude Code is evolving towards "native multi-agent" capabilities (see changelog 2.1.x series). However:

✅ Spec-driven is product-agnostic—OpenSpec can work with any agent system
✅ Role division principles are product-agnostic—any orchestrator can reuse them
✅ File-based state machine—does not depend on specific agent runtime
⚠️ Specific agent file formats may change—but migration cost is low (rewriting system prompts)
⚠️ Slash command format may change—but pseudocode logic is portable

→ The core methodology is transferable. Specific syntax will evolve with Claude Code upgrades.

What You Can Do Now

Adopt this workflow for your team (gitignore + sharing)
Configure CI to run tests (excluding E2E)
Add new role agents as needed
Determine which projects are not suitable for this workflow

flowchart TB
    Start["You"] --> S1["✓ Structure requirements with OpenSpec"]
    Start --> S2["✓ Design a multi-role agent system"]
    Start --> S3["✓ Write CLAUDE.md to make main Claude follow rules"]
    Start --> S4["✓ Automate the entire development pipeline with /dev"]
    Start --> S5["✓ Use hooks to prevent dangerous operations + proactive notifications"]
    Start --> S6["✓ Run a complete cycle from idea to ship"]
    Start --> S7["✓ Independently diagnose when stuck"]

    S1 & S2 & S3 & S4 & S5 & S6 & S7 --> End["From Zero to Autonomous Development Pipeline"]

    style End fill:#c8e6c9

Complete Capability List

Knowledge Layer (Part III):
  ✓ Write the proposal / design / spec / tasks quartet
  ✓ Testable format for Requirement + Scenario
  ✓ Delta operations (ADDED/MODIFIED/REMOVED/RENAMED)

Governance Layer (Part IV~VI):
  ✓ 4+ role agent design
  ✓ Escalation chain + architect as fallback
  ✓ CLAUDE.md Project Constitution
  ✓ File-based state machine
  ✓ /dev orchestration commands
  ✓ Permission + hook safety guardrails
  ✓ Stop hook notification integration

Practice (Part VII):
  ✓ Startup checklist
  ✓ Three major debugging tools
  ✓ Sandbox selection
  ✓ Teams / CI / Extension

Next Steps

Immediately: Run a complete cycle on your own project, applying what you've learned in each chapter.
In a week: Write your own CLAUDE.md, adding rules unique to your project.
In a month: Extend role agents to adapt the pipeline to your domain.
In three months: Teach this workflow to others (teaching is the best way to learn).

📎 Appendix A: 20 Q&A

The following 20 questions come from common beginner queries. Each answer is 100-200 words, providing a conclusion + brief reasoning.

About Philosophy (Q1~Q5)

Q1. Why not let a single Claude write all the code?

Simple: checks and balances. If one agent writes code, writes tests, and reviews its own code—it will write "just-enough tests to pass its own code" and give itself "favorable reviews." Three months later, you won't be able to trace the issues. Multi-roles = natural code review + test-driven development + black-box verification. Each role has an independent perspective and clear responsibilities, checking each other. This is like human organizations: developers don't review their own code, QA doesn't write product code, and tech leads oversee the big picture—division of labor isn't inefficiency, it's quality infrastructure.

Q2. What is the fundamental difference between OpenSpec and Notion / Confluence?

OpenSpec is "compiled"—specs are the system's current committed contracts, and changes are migrations (each change includes a proposal + design + tasks + deltas). Notion is "snapshot-based"—just text records, without version alignment. Specific differences: (1) OpenSpec has a delta mechanism, Notion writes full versions; (2) OpenSpec is in the same Git repository as the code, with traceable commit associations, while Notion is a separate system that can drift; (3) OpenSpec specs have machine-readable schemas (Requirement/Scenario) that AI can understand, Notion is free-form text. In short: OpenSpec is like database schema migration, Notion is like a Word document.

Q3. I'm already using Cursor / Copilot, do I still need this workflow?

They don't conflict; they solve problems at different layers. Cursor / Copilot provide "real-time assistance"—they complete, explain, and fix bugs as you write code, acting as in-IDE pair programming. This workflow is "batch autonomous"—you define specs and roles, and multi-agents run the full cycle autonomously, operating at a PM + team level of abstraction. The two can be combined: you use Cursor for details, and let /dev handle large chunks of work. If you only use Cursor, you'll be missing a layer when you "want to spend a week building a complete feature where requirements need to evolve long-term"—that layer is OpenSpec + multi-agent.

Q4. What scale is this workflow suitable for?

Minimum:  1-person MVP, but the project must have "long-term evolution potential"
          1-hour one-off scripts are not worth it
Maximum:  Tried with projects up to 10 people
          Larger projects require splitting into multiple OpenSpec repositories
Optimal:  1-3 people, 3 weeks to 6 months medium-sized projects (like doc2video)

Manually managing the details of 50+ tasks will lead to collapse; once the tools are set up, the marginal cost is extremely low. Not suitable for: pure 1-hour exploration / major refactoring of an already entrenched legacy system / teams tightly coupled with Jira/Notion (dual system conflict).

Q5. Are multi-agents very token-intensive? Will the bill explode?

It will be more expensive than a single Claude but controllable. Our doc2video project, with 13 groups and 61 tasks, is estimated to cost $15-$30 to run—the same scope manually would take 1-2 weeks of labor, costing thousands of dollars, so the ROI is extremely high. Control methods: (1) Default to Sonnet, escalate to Opus only when stuck (escalation tier, saves 5x); (2) Threshold circuit breakers prevent deadlocks; (3) Prompt caching automatically takes effect (repeatedly read specs are cached); (4) Immediately archive raw/ after a group completes, preventing context from accumulating indefinitely. Run one group at the beginning of the month to observe actual costs, calibrate expectations, then let it run.

About Practice (Q6~Q12)

Q6. Will agents "fight" each other?

Yes, but there are built-in defenses. The most common conflict: reviewer rejects, developer modifies, reviewer rejects new issues again—a tug-of-war. Two mechanisms prevent this: (1) F2 cumulative review rules—from the second round onwards, the reviewer only re-reviews the previous Required Changes, and cannot raise new issues (unless the previous fix introduced new problems); (2) Deadlock threshold—3 rejections automatically escalate to developer-deep to re-question assumptions, and if still stuck, escalate to architect for diagnosis. Ultimately, human intervention is needed approximately once every 5-10 groups, far less frequently than a fully manual process.

Q7. How do I add this workflow to my existing codebase?

Do not go back and fill in specs. Specific steps: (1) openspec init + commit; (2) Write a specs/ that describes current capabilities in reverse—only covering parts you are confident in, not exhaustive; (3) All subsequent new features/refactorings follow the change process (propose → apply → archive); (4) Specs accumulate naturally through archiving, do not actively backfill. "Hard-filling specs" for old code is a bottomless pit—they weren't written with testability in mind, and reverse-engineering Scenarios is extremely painful. Let specs evolve with new work.

Q8. How long does a change take to complete?

Roughly estimated by the number of tasks:

Scale	Number of tasks	Time
Small	5~10	30 minutes~1 hour
Medium	20~30	2~4 hours
Large	50+	Half a day~one day

doc2video has 61 tasks, estimated to complete the MVP in half a day. Variables: network quality, number of escalations, frequency of human intervention. The first group is usually twice as slow (environment setup); subsequent groups will pick up speed. If a change is estimated to take more than 2 days, split it—OpenSpec changes are best experienced when kept within 1 day.

Q9. Can code written by agents be shipped directly?

Technically yes, but in practice, the final gate is you. The reviewer + tester + e2e-tester already block 90% of issues, but before shipping to production, you should: (1) Run git diff and review it yourself; (2) Run it on staging once; (3) Check review/N.md for any places the reviewer marked "I'm not sure." Multi-agents block "obviously wrong" and "clearly non-compliant" issues; the remaining 10% are "style, taste, business intuition"—this is the human domain. Treat it as a trusted junior team—they can get work done, but the lead should take a look before shipping.

Q10. If it stops halfway, will context be lost?

No. The state is entirely in files (a core design of CLAUDE.md—see Ch 20). Specifically recoverable states: (1) Checkbox progress in tasks.md; (2) Round count and Required Changes in review/N.md; (3) PASS/FAIL in test-reports/N.md; (4) Diagnosis in STUCK.md. After restarting Claude Code, running /dev status will immediately show where it left off. Even if you /clear the main conversation or switch devices, all information is restored from files. This is why we firmly believe in a "file-based state machine"—in-memory state can be lost, files cannot.

Q11. If CLAUDE.md is changed, does it need to be restarted?

It takes effect only in new sessions. The CLAUDE.md loaded in the current session is not automatically re-read—unless you /clear or restart Claude. Best practice: After modifying CLAUDE.md, immediately start a new session to verify. For major rule changes (role scheduling, state machine), a restart is essential. For minor rule changes (command cheat sheet), the current session might still work. How to check: Run /dev status and see if main Claude interprets the state according to the new rules—if it does, it has truly read them.

Q12. Can different projects share agents?

Yes, in two layers:

User-level (~/.claude/agents/<name>.md): Follows your account, available to all projects—suitable for generic agents like doc-writer, refactor-agent, security-reviewer.
Project-level (<project>/.claude/agents/): Follows the project, shared by the team—suitable for project-specific agents.

Anti-pattern: Putting project-specific rules at the user level—resulting in unexpected errors in other projects. Recommendation: Put generic skeletons at the user level, and project-specific constraints at the project level + override them in project agents (manually copy necessary content).

About Design (Q13~Q17)

Q13. Is it wrong if agent files get longer and longer?

Yes. Over 200 lines usually indicates one of two problems:

Agent responsibilities are too broad—split into multiple agents. Example: developer manages both "writing code + configuring CI," split out ci-agent.
Writing what main Claude should write—e.g., putting the entire state machine into developer.md, which should be moved to CLAUDE.md.

Criterion: Can you remove this sentence without affecting the agent's behavior? If yes → delete. Agent files ideally range from 60-150 lines; exceeding this means you should review for redundancy.

Q14. Why use Opus for the reviewer, can't Sonnet work?

It can, but the effect

Chapter 30 | Advanced: Team, CI, Extensibility