The Problem: Infinite Conversations, Finite Windows
Claude Code conversations have no turn limit. You can work for hours—reading files, running tests, debugging—and the conversation just keeps going. But the underlying model has a fixed context window (like 200K tokens). When accumulated messages exceed this, the system must compress the conversation without losing critical context. Simple truncation fails because you lose design decisions, file paths, and error resolutions from earlier turns. A naive summarization call is expensive and can lose granular details. Claude Code solves this with a sophisticated, three-tiered system.
Tier 1: Microcompact – The Free Token Reclaimer
Microcompact is the cheapest intervention. It requires no model call. Its job is to clear stale tool results (like old read_file, shell, or grep outputs) that the model no longer needs.
How it works:
- Time-Based Clearing: If the conversation has been idle for over an hour (the default cache TTL), it clears the content of old
tool_resultblocks, keeping only the 5 most recent. The text is replaced with[Old tool result content cleared]. Since the API's prompt cache is expired, there's no cost. - Cached Microcompact: If the cache is still warm, it uses the API's
cache_editsfeature to delete tool results server-side without invalidating your local cache. It tracks tool IDs and deletes the oldest ones once a threshold (default: 12 active tools) is passed.
What this means for you: This happens automatically and silently reclaims tokens from verbose command outputs or large file reads you've already processed, keeping your context lean.
Tier 2: Full Compact – The Intelligent Summarizer
When microcompact isn't enough and token count hits a threshold, the system triggers a Full Compact. This is a dedicated model call to summarize the entire conversation.
The Trigger Threshold:
For a 200K context model, the auto-compact threshold is roughly 167K tokens. The formula reserves 20K tokens for the model to generate the summary and a 13K buffer, as the check happens before each API call and a full response could arrive in between.
effectiveWindow = contextWindow - max(maxOutputTokens, 20_000)
autoCompactThreshold = effectiveWindow - 13_000
You can override this via environment variables (e.g., CLAUDE_COMPACT_THRESHOLD_PERCENT=70) to trigger compaction earlier for testing or specific workflows.
The Fallback & Circuit Breaker:
If even the compaction request would exceed the context window, the system has a fallback. Crucially, if compaction fails three times in a row, a circuit breaker stops auto-compact to prevent runaway API costs. This fixed a prior issue wasting ~250,000 API calls daily.
Tier 3: Session Memory Compact – The Pre-Computed Shortcut
This is the most aggressive tier. It uses pre-extracted session memory notes—key facts the system has been saving throughout the conversation—to build a summary without a new model call. It skips the expensive summarization step entirely by leveraging this continuously updated memory.
How Token Counting Works (And Why It Matters)
The system uses tokenCountWithEstimation to decide when to act.
- Finds the last API response and uses its exact
usagetoken count. - Estimates new messages added after that point using a
length / 4heuristic for text, plus a 33% conservative buffer.
A key detail: it correctly handles interleaved tool calls from a single model response to avoid undercounting. The count includes all context window consumption (input, cache creation, cache read, output tokens), not just input_tokens.
What You Can Do Today
- Trust the system, but be aware: Long, complex conversations will trigger compaction. You might see a slight pause during a Full Compact.
- Use session memory: Structure your work in discrete sessions. The system's ability to extract and use session memory notes makes Tier 3 compaction more effective.
- Monitor with env vars: Set
CLAUDE_COMPACT_THRESHOLD_PERCENT=80to trigger compaction earlier if you want to observe its behavior or ensure maximum context freshness in a critical, long-running task. - Let tool results clear: Don't worry if old
shelloutputs vanish; the system is intelligently pruning them to save tokens and cost.
The system is designed so you can code for hours without hitting a wall. It manages the finite context window so you don't have to.









