⚡ News

5 Practical Tips to Cut Claude Code Token Usage by 30%

5 Practical Tips to Cut Claude Code Token Usage by 30%

I've been using Claude Code daily for the past few months. While the output quality is outstanding, the API bill at the end of the month was becoming hard to swallow. After experimenting with different approaches, I found a few habits that consistently cut my token consumption by 25% to 35% without sacrificing code quality. Here is what worked for me.

1. Put a CLAUDE.md at the Project Root

Claude Code reads CLAUDE.md automatically on startup and treats it as durable context. Without it, Claude has to re-discover your project structure in every session—costing you a significant amount of file-reading tokens.

Keep this file under 200 lines. Anything longer will trigger Claude to spend extra tokens summarizing the file itself. Here is a minimal template that works well:

# Project: <name>

## Stack
- Language: Go 1.22 / TypeScript 5
- Framework: Gin / React 19
- DB: PostgreSQL via GORM

## Layout
- 'controller/' — HTTP handlers
- 'service/'    — business logic
- 'model/'      — DB models

## Conventions
- Use 'common.Marshal' instead of 'encoding/json'
- All new code must compile under 'go vet'

2 & 3. Maximize Prompt Caching

Anthropic's prompt caching feature is a game-changer. When reads hit the cache, they cost only ~10% of the normal pricing. In practice, you can keep your cache warm with these habits:

  • Don't change CLAUDE.md mid-session: Any edits to it will immediately invalidate your entire cached context.
  • Append rather than rewrite: When asking follow-up questions, append to the thread rather than editing and rewriting previous prompts.
  • Paste once, refer back: For long files, paste the code once and refer back to 'the file above' instead of re-pasting it in subsequent turns.

For a 200K-token project context, my cache hit rate sits around 70%, effectively slashing input costs from $0.60 to $0.18 per session.

4. Prefer the Read Tool Over Pasting Code

There are two primary ways to feed a file to Claude:
A) 'Here's the content: <paste 5000 lines of code>'
B) 'Read src/foo.go'

While both yield the same output, option (B) is far cheaper. Claude will only invoke its tool to read the file when absolutely necessary, often fetching just a 50-line slice rather than the entire file. With option (A), you are billed for all 5000 lines upfront, regardless of how much of it is actually processed.

5. Route Routine Tasks to Smaller Models

You don't need Opus for routine tasks like 'write a unit test for this function.' Claude Code allows you to switch models per session. Swapping to Sonnet (or Haiku for trivial edits) is perfect for mechanical tasks:

  • Generating boilerplate code
  • Adding structured logging
  • Renaming variables across a file
  • Writing simple test cases

In my workflow, Sonnet handles roughly 70% of daily edits, while I save Opus for deep reasoning tasks like architecture decisions, bug hunting, and complex refactoring.

Consider the pricing difference per million output tokens: Opus 4.7 ($75) vs Sonnet 4.5 ($15) vs Haiku 4 ($3). A 5x to 15x savings on routine work adds up rapidly.

What Didn't Work

  • Manual Prompt 'Compression': Too much manual effort with marginal savings, and Claude often ended up missing critical context that got compressed away.
  • Ultra-cheap Third-party Relays: I tried sketchy, cheap 'Opus' endpoints twice, only to find they were open-source models disguised as Claude. The code quality dropped off a cliff.

[AgentUpdate Depth Analysis] As AI agents transition from simple wrapper interfaces to fully autonomous software engineering units like Claude Code and Cursor, managing token budget and context window size has emerged as the defining engineering challenge of the agentic era. The practical habits outlined in this article showcase the rise of "Context Engineering." By establishing CLAUDE.md as static workspace memory and utilizing prompt caching, developers are building a localized multi-tiered memory architecture. Looking ahead, the next generation of AI Agent platforms must feature built-in "cost-aware execution plans." Agents should autonomously handle model routing—delegating complex reasoning to frontier models while routing routine execution tasks to lightweight models—and dynamically prune active context windows. Mastering these memory-bound cost dynamics is a prerequisite for scaling developer-agent operations in production.

↗ Read original source