Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

How Claude Code's 3-Tier Compaction System Saves You Money and Keeps Context
AI ResearchScore: 96

How Claude Code's 3-Tier Compaction System Saves You Money and Keeps Context

Learn how Claude Code's intelligent, three-tiered compaction system works to manage long conversations efficiently, preserving key context while optimizing for token usage and cost.

GAla Smith & AI Research Desk·1d ago·4 min read·299 views·AI-Generated
Share:
Source: dev.tovia devto_claudecode, gn_claude_code, hn_anthropic, hn_claude_codeMulti-Source

The Problem: Infinite Conversations, Finite Windows

Claude Code conversations have no turn limit. You can work for hours—reading files, running tests, debugging—and the conversation just keeps going. But the underlying model has a fixed context window (like 200K tokens). When accumulated messages exceed this, the system must compress the conversation without losing critical context. Simple truncation fails because you lose design decisions, file paths, and error resolutions from earlier turns. A naive summarization call is expensive and can lose granular details. Claude Code solves this with a sophisticated, three-tiered system.

Tier 1: Microcompact – The Free Token Reclaimer

Microcompact is the cheapest intervention. It requires no model call. Its job is to clear stale tool results (like old read_file, shell, or grep outputs) that the model no longer needs.

How it works:

  • Time-Based Clearing: If the conversation has been idle for over an hour (the default cache TTL), it clears the content of old tool_result blocks, keeping only the 5 most recent. The text is replaced with [Old tool result content cleared]. Since the API's prompt cache is expired, there's no cost.
  • Cached Microcompact: If the cache is still warm, it uses the API's cache_edits feature to delete tool results server-side without invalidating your local cache. It tracks tool IDs and deletes the oldest ones once a threshold (default: 12 active tools) is passed.

What this means for you: This happens automatically and silently reclaims tokens from verbose command outputs or large file reads you've already processed, keeping your context lean.

Tier 2: Full Compact – The Intelligent Summarizer

When microcompact isn't enough and token count hits a threshold, the system triggers a Full Compact. This is a dedicated model call to summarize the entire conversation.

The Trigger Threshold:
For a 200K context model, the auto-compact threshold is roughly 167K tokens. The formula reserves 20K tokens for the model to generate the summary and a 13K buffer, as the check happens before each API call and a full response could arrive in between.

effectiveWindow = contextWindow - max(maxOutputTokens, 20_000)
autoCompactThreshold = effectiveWindow - 13_000

You can override this via environment variables (e.g., CLAUDE_COMPACT_THRESHOLD_PERCENT=70) to trigger compaction earlier for testing or specific workflows.

The Fallback & Circuit Breaker:
If even the compaction request would exceed the context window, the system has a fallback. Crucially, if compaction fails three times in a row, a circuit breaker stops auto-compact to prevent runaway API costs. This fixed a prior issue wasting ~250,000 API calls daily.

Tier 3: Session Memory Compact – The Pre-Computed Shortcut

This is the most aggressive tier. It uses pre-extracted session memory notes—key facts the system has been saving throughout the conversation—to build a summary without a new model call. It skips the expensive summarization step entirely by leveraging this continuously updated memory.

How Token Counting Works (And Why It Matters)

The system uses tokenCountWithEstimation to decide when to act.

  1. Finds the last API response and uses its exact usage token count.
  2. Estimates new messages added after that point using a length / 4 heuristic for text, plus a 33% conservative buffer.

A key detail: it correctly handles interleaved tool calls from a single model response to avoid undercounting. The count includes all context window consumption (input, cache creation, cache read, output tokens), not just input_tokens.

What You Can Do Today

  1. Trust the system, but be aware: Long, complex conversations will trigger compaction. You might see a slight pause during a Full Compact.
  2. Use session memory: Structure your work in discrete sessions. The system's ability to extract and use session memory notes makes Tier 3 compaction more effective.
  3. Monitor with env vars: Set CLAUDE_COMPACT_THRESHOLD_PERCENT=80 to trigger compaction earlier if you want to observe its behavior or ensure maximum context freshness in a critical, long-running task.
  4. Let tool results clear: Don't worry if old shell outputs vanish; the system is intelligently pruning them to save tokens and cost.

The system is designed so you can code for hours without hitting a wall. It manages the finite context window so you don't have to.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

**Stop worrying about context limits.** You don't need to manually summarize or start new chats. Claude Code's automatic system handles it. Focus on your task and let the tiers work. **Structure work for session memory.** Since Tier 3 compaction uses pre-extracted notes, breaking your work into logical sessions (e.g., "setup," "debug auth error," "implement feature X") gives the system better landmarks to preserve. Use clear, descriptive language about your goals. **Use the threshold override for critical work.** If you're embarking on a marathon debugging session or a complex refactor where every detail might matter, set `CLAUDE_COMPACT_THRESHOLD_PERCENT=75` in your environment. This triggers compaction earlier, keeping more of the *raw* conversation context available longer, at the cost of potentially more frequent summary calls.

Mentioned in this article

Enjoyed this article?
Share:

Related Articles

More in AI Research

View all