Issue 10 | Performance Benchmarks and Best Practices β€” Integrating Caveman into Your Development Workflow

⏱ Est. reading time: 17 min Updated on 5/7/2026

🎯 Learning Objectives

After completing this issue, you will master:

  1. How to run Caveman's official Benchmark and Eval suites
  2. The Three-Arm Evaluation methodology: Why Caveman is better than "please answer briefly"
  3. A complete daily development workflow: The full chain from startup to commit
  4. Optimal mode selection strategies for different scenarios

πŸ“– Core Content

10.1 Official Benchmark Data

Caveman's token compression effect isn't self-proclaimedβ€”it's backed by real Claude API token count data.

Test Prompt Normal Tokens Caveman Tokens Compression Rate
React re-render explanation 69 19 72%
Auth middleware Bug 89 23 74%
TypeScript generics tutorial 156 42 73%
Express performance optimization advice 203 51 75%
Docker deployment troubleshooting 178 38 79%
Database index optimization 145 33 77%
CSS Grid layout guide 112 28 75%
Git branching strategy advice 98 24 76%

Statistical Summary:

  • Range: 22% β€” 87%
  • Average: ~71-75%
  • Median: ~75%

πŸ’‘ Important: Caveman only affects output tokens. Thinking/reasoning tokens are completely unaffected. Caveman doesn't shrink the brain, it just shrinks the mouth.

10.2 Running Official Benchmarks

You can reproduce this data yourself:

# Clone the repository
git clone https://github.com/JuliusBrussee/caveman.git
cd caveman

# Run LLM evaluation (requires Claude CLI and a valid API Key)
uv run python evals/llm_run.py

# Analyze results offline (no API Key needed)
uv run --with tiktoken python evals/measure.py

Three-Arm Evaluation Design (Three-Arm Eval)

Caveman's Eval doesn't simply compare "Normal vs Caveman"β€”that would conflate Caveman's effect with "generic brief instructions."

graph TD
    A["Three-Arm Evaluation Design"]
    
    A --> B["Arm 1: Verbose
(No Constraints)
Claude Normal Response"] A --> C["Arm 2: Terse
(Only 'be brief')
General Brief Instruction"] A --> D["Arm 3: Caveman
(Full Skill Rules)
Structured Compression"] B --> E["Baseline Comparison"] C --> F["Proves Caveman β‰  Simply 'be brief'"] D --> G["Actual Compression Effect"] F -.->|"Comparison"| G

Why a Three-Arm Design?

If you only compare Verbose vs Caveman, you cannot distinguish whether the compression effect comes from:

  • Caveman's structured rules ([thing] [action] [reason] pattern)
  • Or simply because you told the Agent "please answer briefly"

In the three-arm design, Arm 2 (Terse) is the control groupβ€”it only says "be brief." If Caveman saves more tokens than Terse and maintains higher accuracy, it proves that Caveman's rule design itself has value, and is not just "asking for brevity."

Actual results: Caveman saves an additional 15-25% tokens compared to Terse mode, with higher technical accuracy.

10.3 Academic Background: Brevity β‰  Coarseness

A March 2026 paper, "Brevity Constraints Reverse Performance Hierarchies in Language Models", found:

graph LR
    A["Traditional Assumption
More Tokens = Better Answer"] -->|"❌ Paper Disproved"| B["Experimental Results
Brevity Constraint Improves Accuracy by 26%"] C["Large Models (Verbose)"] -->|"Add Brevity Constraint"| D["Accuracy Improvement"] E["Small Models (Concise)"] -->|"No Constraint"| F["Accuracy Even Higher"] D --> G["Conclusion: Verbosity is Noise
Not Signal"] F --> G

Key Findings:

  1. Brevity constraints improve accuracy by 26 percentage points (on specific benchmarks)
  2. Reverses model rankings: Smaller models that originally performed worse surpassed larger models under brevity constraints
  3. Verbosity is noise: The computational power models spend on rhetoric could be used for reasoning

This academically validates Caveman's core hypothesis: Remove the fluff, and reasoning becomes more accurate.

10.4 Complete Caveman Daily Workflow

graph TD
    A["πŸš€ Start Agent Session"] --> B["Hook Automatically Activates Caveman
[CAVEMAN] Badge Lights Up"] B --> C{"Development Phase"} C -->|"πŸ”¨ Coding"| D["πŸͺ¨ /caveman full
Concise Technical Answers
Troubleshooting, Writing Code"] C -->|"πŸ› Debugging"| E["πŸ”₯ /caveman ultra
Rapid Troubleshooting
Minimal Text to Core Issue"] C -->|"πŸ“– Learning"| F["πŸͺΆ /caveman lite
Retains Full Sentences
Easier Concept Understanding"] C -->|"πŸ‡¨πŸ‡³ Chinese Projects"| G["πŸ“œ /caveman wenyan
Classical Chinese Mode
Most Token-Efficient for Chinese"] D --> H["βœ… Code Modification Complete"] E --> H F --> H G --> H H --> I["πŸ” /caveman-review
One-Line Code Review
L42: πŸ”΄ bug: ..."] I --> J{"Review Passed?"} J -->|"❌ Issues Found"| K["Fix Issues"] K --> I J -->|"βœ… Passed"| L["πŸ“ /caveman-commit
Refined Commit Message
fix(auth): token <= not <"] L --> M["πŸ“¦ git push"] M --> N["πŸ—œοΈ /caveman:compress
Compress CLAUDE.md
Saves Tokens for Next Session"] N --> O["πŸŽ‰ Done!"] style B fill:#FFD700 style I fill:#87CEEB style L fill:#90EE90 style N fill:#DDA0DD

10.5 Scenario Γ— Mode Selection Matrix

Work Scenario Recommended Mode Reason
Daily Coding full Balances readability and compression rate
Rapid Debugging ultra Minimal text to pinpoint root cause
Learning New Tech lite Requires more explanatory context
Code Review /caveman-review Dedicated review format
Git Commit /caveman-commit Dedicated commit format
Writing Documentation Normal mode Documentation requires full expression
Chinese Projects wenyan More token-efficient for Chinese
Pair Programming lite Colleagues also need to understand
CI/CD Review ultra + review Machine consumption, shorter is better
Context Compression /caveman:compress Compresses CLAUDE.md

10.6 Full Workflow Comparison Across Platforms

Workflow Step Claude Code Antigravity Gemini CLI Codex OpenCode
1. Session Start Hook auto-activates GEMINI.md rules Extension auto hooks.json AGENTS.md
2. Mode Switching /caveman ultra Natural language /caveman ultra $caveman ultra Natural language
3. Coding Interaction βœ… Full Tool Calling βœ… Full Tool Calling βœ… Full Tool Calling βœ… Full Tool Calling βœ… Full Tool Calling
4. Code Review /caveman-review Natural language /caveman-review $caveman-review Natural language
5. Committing Code /caveman-commit Natural language /caveman-commit $caveman-commit Natural language
6. Context Compression /caveman:compress Natural language /caveman:compress $caveman-compress Natural language
7. Status Monitoring βœ… [CAVEMAN:MODE] ❌ ❌ ❌ ❌
8. Exiting Caveman "stop caveman" "stop caveman" "stop caveman" "stop caveman" "stop caveman"

10.7 Advanced Best Practices

Practice 1: CLAUDE.md Layered Strategy

~/.claude/CLAUDE.md          ← Global Caveman always-on (applies to all projects)
<project>/CLAUDE.md          ← Project-specific rules (already compressed)
<project>/CLAUDE.original.md ← Human-readable original (edit this)

Practice 2: Team-wide Unified Configuration

# Commit Caveman configuration in the project root
echo 'Terse like caveman. Technical substance exact...' >> CLAUDE.md
echo 'Terse like caveman. Technical substance exact...' >> GEMINI.md

# Ensure all team members use the same Caveman behavior
git add CLAUDE.md GEMINI.md
git commit -m "chore: add caveman always-on for team"

Practice 3: CI/CD Integration

# .github/workflows/pr-review.yml
- name: Caveman Code Review
  run: |
    # Use Claude Code Action + caveman-review rules
    # Each PR automatically gets a one-line review

Practice 4: Combining with cavemem

# Install cavemem (memory compression)
# Combine with caveman (output compression) for dual optimization
npm install -g cavemem

# caveman compresses output β†’ saves output tokens
# cavemem compresses memory β†’ saves input tokens  
# Combined β†’ total token consumption reduced by 60%+

Practice 5: Customizing Caveman Rules

If you need domain-specific Caveman rules, you can create custom Skills:

<!-- .claude/skills/my-caveman/SKILL.md -->
## My Custom Caveman Rules

Base: Terse like caveman. Technical substance exact.

Additional rules for this project:
- Always mention file paths in full
- Include line numbers when discussing bugs
- Use Chinese for variable name explanations
- Keep API endpoint paths in backticks

πŸ“Š Return on Investment Summary

graph LR
    subgraph Investment["πŸ’° Investment"]
        A1["Installation: 1 min"]
        A2["Configuration: 5 min"]
        A3["Learning: This 10-part tutorial"]
    end
    
    subgraph Return["πŸ“ˆ Return"]
        B1["Output Tokens: -75%"]
        B2["Input Tokens: -46%"]
        B3["Response Speed: +3x"]
        B4["Monthly Cost: -$46"]
        B5["Readability: ↑"]
    end
    
    Investment --> Return
Metric Without Caveman With Caveman Improvement
Avg. Tokens per Response ~300 ~80 -73%
Input Tokens per Session ~2800 ~1500 -46%
Daily Token Consumption ~68,000 ~19,200 -72%
Monthly Cost (Est.) ~$63 ~$17 -$46/month
Response Reading Time ~15 sec ~5 sec -66%
Technical Accuracy 100% 100% Unchanged

πŸ“ Full Series Review

Issue Topic Key Takeaways
01 What is Caveman Token Compression Philosophy + Ecosystem Overview
02 Installation on Three Platforms Claude Code / Antigravity / Gemini CLI Installation Comparison
03 In-depth Hooks Analysis Auto-activation Engine + Flag File Mechanism
04 Four-Speed Modes Lite / Full / Ultra / Classical Chinese + Switching Methods
05 /caveman Core Skill Daily Development Practice + Response Modes
06 /caveman-commit Refined Git Commits + Git Hook Integration
07 /caveman-review One-Line Code Review + GitHub Actions
08 /caveman:compress Compress CLAUDE.md + Input Token Optimization
09 Always-On Configuration Five Platform Rule Files + Team Sharing
10 Benchmarks + Best Practices Complete Workflow + Return on Investment

πŸŽ“ Graduation Tasks

Complete the following tasks to become a qualified Caveman user:

  • Install Caveman on your primary Agent
  • Complete a full feature development using full mode
  • Review your own code using /caveman-review
  • Generate a commit message using /caveman-commit
  • Compress your CLAUDE.md using /caveman:compress
  • Configure Always-On to ensure automatic activation in the next session
  • (Bonus) Commit the configuration to Git so your team can also use Caveman

πŸ”— References