Issue 10 | Performance Benchmarks and Best Practices — Integrating Caveman into Your Development Workflow — 🪨 Caveman Complete Guide: The Ultimate Weapon for AI Agents to Save 75% Tokens

🎯 Learning Objectives

After completing this issue, you will master:

How to run Caveman's official Benchmark and Eval suites
The Three-Arm Evaluation methodology: Why Caveman is better than "please answer briefly"
A complete daily development workflow: The full chain from startup to commit
Optimal mode selection strategies for different scenarios

📖 Core Content

10.1 Official Benchmark Data

Caveman's token compression effect isn't self-proclaimed—it's backed by real Claude API token count data.

Test Prompt	Normal Tokens	Caveman Tokens	Compression Rate
React re-render explanation	69	19	72%
Auth middleware Bug	89	23	74%
TypeScript generics tutorial	156	42	73%
Express performance optimization advice	203	51	75%
Docker deployment troubleshooting	178	38	79%
Database index optimization	145	33	77%
CSS Grid layout guide	112	28	75%
Git branching strategy advice	98	24	76%

Statistical Summary:

Range: 22% — 87%
Average: ~71-75%
Median: ~75%

💡 Important: Caveman only affects output tokens. Thinking/reasoning tokens are completely unaffected. Caveman doesn't shrink the brain, it just shrinks the mouth.

10.2 Running Official Benchmarks

You can reproduce this data yourself:

# Clone the repository
git clone https://github.com/JuliusBrussee/caveman.git
cd caveman

# Run LLM evaluation (requires Claude CLI and a valid API Key)
uv run python evals/llm_run.py

# Analyze results offline (no API Key needed)
uv run --with tiktoken python evals/measure.py

Three-Arm Evaluation Design (Three-Arm Eval)

Caveman's Eval doesn't simply compare "Normal vs Caveman"—that would conflate Caveman's effect with "generic brief instructions."

graph TD
    A["Three-Arm Evaluation Design"]
    
    A --> B["Arm 1: Verbose
(No Constraints)
Claude Normal Response"]
    A --> C["Arm 2: Terse
(Only 'be brief')
General Brief Instruction"]
    A --> D["Arm 3: Caveman
(Full Skill Rules)
Structured Compression"]
    
    B --> E["Baseline Comparison"]
    C --> F["Proves Caveman ≠ Simply 'be brief'"]
    D --> G["Actual Compression Effect"]
    
    F -.->|"Comparison"| G

Why a Three-Arm Design?

If you only compare Verbose vs Caveman, you cannot distinguish whether the compression effect comes from:

Caveman's structured rules ([thing] [action] [reason] pattern)
Or simply because you told the Agent "please answer briefly"

In the three-arm design, Arm 2 (Terse) is the control group—it only says "be brief." If Caveman saves more tokens than Terse and maintains higher accuracy, it proves that Caveman's rule design itself has value, and is not just "asking for brevity."

Actual results: Caveman saves an additional 15-25% tokens compared to Terse mode, with higher technical accuracy.

10.3 Academic Background: Brevity ≠ Coarseness

A March 2026 paper, "Brevity Constraints Reverse Performance Hierarchies in Language Models", found:

graph LR
    A["Traditional Assumption
More Tokens = Better Answer"] -->|"❌ Paper Disproved"| B["Experimental Results
Brevity Constraint Improves Accuracy by 26%"]
    
    C["Large Models (Verbose)"] -->|"Add Brevity Constraint"| D["Accuracy Improvement"]
    E["Small Models (Concise)"] -->|"No Constraint"| F["Accuracy Even Higher"]
    
    D --> G["Conclusion: Verbosity is Noise
Not Signal"]
    F --> G

Key Findings:

Brevity constraints improve accuracy by 26 percentage points (on specific benchmarks)
Reverses model rankings: Smaller models that originally performed worse surpassed larger models under brevity constraints
Verbosity is noise: The computational power models spend on rhetoric could be used for reasoning

This academically validates Caveman's core hypothesis: Remove the fluff, and reasoning becomes more accurate.

10.4 Complete Caveman Daily Workflow

graph TD
    A["🚀 Start Agent Session"] --> B["Hook Automatically Activates Caveman
[CAVEMAN] Badge Lights Up"]
    
    B --> C{"Development Phase"}
    
    C -->|"🔨 Coding"| D["🪨 /caveman full
Concise Technical Answers
Troubleshooting, Writing Code"]
    
    C -->|"🐛 Debugging"| E["🔥 /caveman ultra
Rapid Troubleshooting
Minimal Text to Core Issue"]
    
    C -->|"📖 Learning"| F["🪶 /caveman lite
Retains Full Sentences
Easier Concept Understanding"]
    
    C -->|"🇨🇳 Chinese Projects"| G["📜 /caveman wenyan
Classical Chinese Mode
Most Token-Efficient for Chinese"]
    
    D --> H["✅ Code Modification Complete"]
    E --> H
    F --> H
    G --> H
    
    H --> I["🔍 /caveman-review
One-Line Code Review
L42: 🔴 bug: ..."]
    
    I --> J{"Review Passed?"}
    J -->|"❌ Issues Found"| K["Fix Issues"]
    K --> I
    
    J -->|"✅ Passed"| L["📝 /caveman-commit
Refined Commit Message
fix(auth): token <= not <"]
    
    L --> M["📦 git push"]
    
    M --> N["🗜️ /caveman:compress
Compress CLAUDE.md
Saves Tokens for Next Session"]
    
    N --> O["🎉 Done!"]
    
    style B fill:#FFD700
    style I fill:#87CEEB
    style L fill:#90EE90
    style N fill:#DDA0DD

10.5 Scenario × Mode Selection Matrix

Work Scenario	Recommended Mode	Reason
Daily Coding	`full`	Balances readability and compression rate
Rapid Debugging	`ultra`	Minimal text to pinpoint root cause
Learning New Tech	`lite`	Requires more explanatory context
Code Review	`/caveman-review`	Dedicated review format
Git Commit	`/caveman-commit`	Dedicated commit format
Writing Documentation	Normal mode	Documentation requires full expression
Chinese Projects	`wenyan`	More token-efficient for Chinese
Pair Programming	`lite`	Colleagues also need to understand
CI/CD Review	`ultra` + `review`	Machine consumption, shorter is better
Context Compression	`/caveman:compress`	Compresses CLAUDE.md

10.6 Full Workflow Comparison Across Platforms

Workflow Step	Claude Code	Antigravity	Gemini CLI	Codex	OpenCode
1. Session Start	Hook auto-activates	GEMINI.md rules	Extension auto	hooks.json	AGENTS.md
2. Mode Switching	`/caveman ultra`	Natural language	`/caveman ultra`	`$caveman ultra`	Natural language
3. Coding Interaction	✅ Full Tool Calling	✅ Full Tool Calling	✅ Full Tool Calling	✅ Full Tool Calling	✅ Full Tool Calling
4. Code Review	`/caveman-review`	Natural language	`/caveman-review`	`$caveman-review`	Natural language
5. Committing Code	`/caveman-commit`	Natural language	`/caveman-commit`	`$caveman-commit`	Natural language
6. Context Compression	`/caveman:compress`	Natural language	`/caveman:compress`	`$caveman-compress`	Natural language
7. Status Monitoring	✅ `[CAVEMAN:MODE]`	❌	❌	❌	❌
8. Exiting Caveman	"stop caveman"	"stop caveman"	"stop caveman"	"stop caveman"	"stop caveman"

10.7 Advanced Best Practices

Practice 1: CLAUDE.md Layered Strategy

~/.claude/CLAUDE.md          ← Global Caveman always-on (applies to all projects)
<project>/CLAUDE.md          ← Project-specific rules (already compressed)
<project>/CLAUDE.original.md ← Human-readable original (edit this)

Practice 2: Team-wide Unified Configuration

# Commit Caveman configuration in the project root
echo 'Terse like caveman. Technical substance exact...' >> CLAUDE.md
echo 'Terse like caveman. Technical substance exact...' >> GEMINI.md

# Ensure all team members use the same Caveman behavior
git add CLAUDE.md GEMINI.md
git commit -m "chore: add caveman always-on for team"

Practice 3: CI/CD Integration

# .github/workflows/pr-review.yml
- name: Caveman Code Review
  run: |
    # Use Claude Code Action + caveman-review rules
    # Each PR automatically gets a one-line review

Practice 4: Combining with cavemem

# Install cavemem (memory compression)
# Combine with caveman (output compression) for dual optimization
npm install -g cavemem

# caveman compresses output → saves output tokens
# cavemem compresses memory → saves input tokens  
# Combined → total token consumption reduced by 60%+

Practice 5: Customizing Caveman Rules

If you need domain-specific Caveman rules, you can create custom Skills:

<!-- .claude/skills/my-caveman/SKILL.md -->
## My Custom Caveman Rules

Base: Terse like caveman. Technical substance exact.

Additional rules for this project:
- Always mention file paths in full
- Include line numbers when discussing bugs
- Use Chinese for variable name explanations
- Keep API endpoint paths in backticks

📊 Return on Investment Summary

graph LR
    subgraph Investment["💰 Investment"]
        A1["Installation: 1 min"]
        A2["Configuration: 5 min"]
        A3["Learning: This 10-part tutorial"]
    end
    
    subgraph Return["📈 Return"]
        B1["Output Tokens: -75%"]
        B2["Input Tokens: -46%"]
        B3["Response Speed: +3x"]
        B4["Monthly Cost: -$46"]
        B5["Readability: ↑"]
    end
    
    Investment --> Return

Metric	Without Caveman	With Caveman	Improvement
Avg. Tokens per Response	~300	~80	-73%
Input Tokens per Session	~2800	~1500	-46%
Daily Token Consumption	~68,000	~19,200	-72%
Monthly Cost (Est.)	~$63	~$17	-$46/month
Response Reading Time	~15 sec	~5 sec	-66%
Technical Accuracy	100%	100%	Unchanged

📝 Full Series Review

Issue	Topic	Key Takeaways
01	What is Caveman	Token Compression Philosophy + Ecosystem Overview
02	Installation on Three Platforms	Claude Code / Antigravity / Gemini CLI Installation Comparison
03	In-depth Hooks Analysis	Auto-activation Engine + Flag File Mechanism
04	Four-Speed Modes	Lite / Full / Ultra / Classical Chinese + Switching Methods
05	/caveman Core Skill	Daily Development Practice + Response Modes
06	/caveman-commit	Refined Git Commits + Git Hook Integration
07	/caveman-review	One-Line Code Review + GitHub Actions
08	/caveman:compress	Compress CLAUDE.md + Input Token Optimization
09	Always-On Configuration	Five Platform Rule Files + Team Sharing
10	Benchmarks + Best Practices	Complete Workflow + Return on Investment

🎓 Graduation Tasks

Complete the following tasks to become a qualified Caveman user:

Install Caveman on your primary Agent
Complete a full feature development using full mode
Review your own code using /caveman-review
Generate a commit message using /caveman-commit
Compress your CLAUDE.md using /caveman:compress
Configure Always-On to ensure automatic activation in the next session
(Bonus) Commit the configuration to Git so your team can also use Caveman

Issue 10 | Performance Benchmarks and Best Practices — Integrating Caveman into Your Development Workflow