Anthropic revealed Claude authored 80%+ of merged code as of May 2026. The company's internal metrics show task-length capability doubling every 4 months, up from every 7 months.
Key facts
- Claude authors 80%+ of Anthropic's merged code as of May 2026.
- Task-length capability doubles every 4 months, up from 7 months.
- Mythos Preview works 16+ hours autonomously per METR.
- Code speedup: 52x for Mythos Preview vs 3x for Opus 4.
- 800+ fixes shipped, cutting API errors 1,000x in one example.
Anthropic published a sweeping internal assessment of AI progress, claiming the company is 'getting very serious about recursive self-improvement' According to @kimmonismus. The analysis, shared by Anthropic researcher Kimmonismus, paints a picture of accelerating autonomous capabilities that could soon enable AI to design and build its own successor.
The Capability Curve
The headline metric: Claude now authors 80%+ of code merged into Anthropic's codebase, up from low single digits before Claude Code launched in February 2025. Engineers ship on average 8x as much code per quarter as they did in the 2021-2025 period.
Task-length capability is the most striking trend line. Opus 3 (March 2024) handled roughly 4-minute tasks. Sonnet 3.7 (a year later) managed ~90-minute tasks. Opus 4.6 (another year on) reached 12-hour tasks. METR found Claude Mythos Preview could work 'at least' 16 hours, at the top of what they can currently measure.
On SWE-bench, scores went from low single digits to saturation in two years. CORE-bench (research reproduction) went from ~20% to saturated in 15 months.
Code Quality and Speed
Claude-written code quality was worse than human in late 2025, roughly at parity now, and expected to be strictly better within the year. On code-speedup tests, Opus 4 averaged ~3x speedup (May 2025), while Mythos Preview hit ~52x (April 2026). A skilled human typically needs 4-8 hours to achieve 4x speedup.
One April 2026 example: Claude shipped 800+ fixes cutting a class of API errors 1,000x—work an engineer estimated would have taken a human four years.
Research and Safety
In an AI-safety research project, Claude agents recovered 97% of a performance gap (vs ~23% for two human researchers in a week), over 800 compute-hours and ~$18K. On picking the better 'next step' in research sessions, the best model beat the human choice 51% (November 2025, Opus 4.5) rising to 64% (April 2026, Mythos Preview).
Human comparative advantage, for now: research taste and judgment—choosing which problems matter and when an approach is a dead end.
Three Futures
Anthropic outlines three possible scenarios: the trend stalls (S-curve), which they consider least likely; compounding efficiency gains with humans still setting direction, where 100-person firms do the work of 10,000+ (the likely path); or full recursive self-improvement, where AI builds its successors and pace is set by compute—the alignment outcome they're least certain about.
A March 2026 poll of 130 research staff found the median respondent estimated ~4x output with Mythos Preview. On the hardest open-ended tasks, Claude's success rate hit 76% in May 2026, up 50 points in six months.
What to watch
Watch for Anthropic's next model release and whether task-length capability crosses 24+ hours. Also track enterprise adoption rates: if 100-person firms truly do the work of 10,000+, expect rapid shifts in AI-services spending patterns by Q4 2026.






