You've set up Claude Code with specific quality gates: code must pass tests, stay under a token budget, and maintain a certain readability score. Three months later, your failure rate spikes. Is Claude getting worse? Or has the environment shifted around it?
This is eval drift — and it's why most developer monitoring systems become noisy alerts within months. The source article from Iris Eval identifies the core problem: static thresholds decay as model providers change pricing, input distributions shift, and API behavior evolves.
The Claude Code-Specific Drift Factors
For Claude Code users, drift happens in three predictable ways:
Model updates without announcements — Like the Stanford/Berkeley study cited in the source (Chen et al., 2023) where GPT-4's code generation accuracy dropped from 52% to 10% in three months, Claude's performance characteristics can shift between versions. Your token usage patterns from Claude 3.5 Sonnet might not apply to Claude 3.7.
Project evolution — Your
CLAUDE.mdfile evolves. Your codebase grows. The types of tasks you ask Claude Code to handle change from debugging to feature development to refactoring. A cost threshold calibrated for small fixes breaks when you start asking for architectural reviews.Pricing changes — When Anthropic adjusts token pricing (as they did with the transition to Claude 3.5 Sonnet), every dollar-based threshold becomes instantly stale. Your $0.50 per task limit might have been the 95th percentile; after a price increase, it could be the 60th.
Implement Self-Monitoring Today
You don't need Iris Eval to start detecting drift. Here's what you can implement right now:
1. Log Your Claude Code Sessions
Add a simple logging layer to your Claude Code workflow:
# Add to your shell profile or create a claude-logger script
claude_code "$@" | tee -a ~/.claude_sessions/$(date +%Y-%m-%d).log
# Extract key metrics with grep/awk
# Token usage, success/failure flags, task type
2. Track Distributions, Not Just Pass/Fail
Instead of binary "passed test/didn't pass," track:
- Token usage per task type (debug vs. generate vs. refactor)
- Time to completion
- Number of follow-up clarifications needed
- Test coverage percentage for generated code
Create weekly summary reports:
# Simple analysis script
cat ~/.claude_sessions/*.log | \
grep "tokens_used" | \
awk '{sum+=$2; count++} END {print "Avg tokens:", sum/count}'
3. Set Dynamic Percentile Thresholds
Instead of "cost must be < $0.50," use "cost must be < 95th percentile of last 100 tasks." This automatically adjusts for environmental shifts.
Here's a Python snippet to calculate running percentiles:
import json
import numpy as np
from collections import deque
class DynamicThreshold:
def __init__(self, window_size=100, percentile=95):
self.window = deque(maxlen=window_size)
self.percentile = percentile
def add_measurement(self, value):
self.window.append(value)
def get_threshold(self):
if len(self.window) < 10:
return None # Not enough data
return np.percentile(list(self.window), self.percentile)
def check(self, value):
threshold = self.get_threshold()
if threshold is None:
return True # Insufficient data, pass
return value <= threshold
# Usage for Claude Code token monitoring
token_monitor = DynamicThreshold(window_size=100, percentile=95)
# After each Claude session:
token_monitor.add_measurement(tokens_used)
is_acceptable = token_monitor.check(current_tokens)
When to Investigate vs. When to Adjust
The source article's key insight: distinguish between actual quality drift (Claude is getting worse) and threshold drift (the environment changed).
Investigate when:
- Failure rate spikes but token usage/cost distributions haven't changed
- Specific task types degrade while others remain stable
- Your manual review confirms quality issues
Adjust thresholds when:
- Failure rate increases but manual review shows consistent quality
- The entire distribution shifted (e.g., all tasks use 20% more tokens)
- You know an external factor changed (Anthropic pricing update, model version change)
The CLAUDE.md Guardrail Pattern
Add self-calibration instructions to your CLAUDE.md:
## Performance Monitoring
After completing tasks, provide:
1. Estimated token usage
2. Time spent (if tracked)
3. Confidence score (1-10)
4. Any edge cases that required extra work
## Quality Thresholds
- Code must pass existing tests (static)
- Token usage should be in the 95th percentile of recent similar tasks (dynamic)
- Readability score > 8/10 (static)
This creates a feedback loop where Claude helps track its own performance characteristics.
What This Means for Your Daily Workflow
- Stop setting absolute thresholds for token usage or cost. Use percentiles instead.
- Create a weekly review habit — 10 minutes every Monday to check Claude Code metrics.
- Version your thresholds alongside your
CLAUDE.mdfile. When you update prompts, reset your baseline distributions. - Isolate variables — if you change model versions (Claude 3.5 to 3.7), track metrics separately for each.
The goal isn't to build a complex monitoring system. It's to recognize that your Claude Code setup exists in a shifting environment, and your quality gates need to shift with it.
For teams building more complex agent systems, the source article mentions Iris Eval's MCP-based evaluation system. For individual developers, the simple logging and percentile-based thresholds above will catch 80% of drift issues.





