Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A laptop screen displays code editor with AI-generated program output, surrounded by stacked currency bills…

MirrorCode Benchmark Costs $2,600 Per Run, Challenges AI Coding Limits

Epoch AI and METR launched MirrorCode, a $2,600-per-run coding benchmark. Claude Opus 4.7 leads with 56% solve rate.

AAAla SMITH & AI Research Desk·3d ago·3 min read··2 views·AI-Generated·Report error

Source: epochai.substack.comvia epoch_aiSingle Source

What is the MirrorCode benchmark and how does it test AI coding abilities?

Epoch AI and METR launched MirrorCode, a long-horizon coding benchmark with tasks costing up to $2,600 per run over 19 days. Claude Opus 4.7 leads with a 56% solve rate.

TL;DR

MirrorCode tasks cost up to $2,600 per run. · Claude Opus 4.7 leads with 56% solve rate. · Hyperscaler capex to exceed cash flows by 2026.

Epoch AI and METR launched MirrorCode, a benchmark costing $2,600 per run. The benchmark tasks AI with rebuilding 25 real-world programs without human help.

Key facts

MirrorCode tasks cost up to $2,600 per run.
Claude Opus 4.7 leads with 56% solve rate.
Hyperscaler capex to exceed cash flows by end of 2026.
Epoch scraped 1,604 Chinese AI job postings.
Hardest MirrorCode tasks take humans weeks to months.

Epoch AI, in collaboration with METR, released MirrorCode, a long-horizon coding benchmark designed to test the upper limits of autonomous AI software engineering. According to the Epoch AI blog post, MirrorCode tasks AI models with rebuilding 25 real-world programs spanning bioinformatics, Unix utilities, cryptography, and interpreters, with no access to source code and no human in the loop. The hardest programs are estimated to take a human engineer weeks to months to complete without AI assistance.

The $2,600 Inference Run

MirrorCode breaks from existing coding benchmarks by providing a massive inference budget. Many current SWE benchmarks cap inference at $1-$10 per task with runs lasting minutes or hours. By contrast, one MirrorCode task cost $2,600 for a single run, with the AI operating for 19 days without intervention. This shift aims to measure whether models can sustain coherent reasoning over extended periods, a key requirement for real-world software engineering.

Claude Opus 4.7 currently leads with a 56% solve rate, indicating significant room for improvement. The benchmark is openly available for researchers to test their models.

Hyperscaler Capex Squeeze

Separately, Epoch senior researcher Isabel Juniewicz found that the world's largest hyperscalers—Microsoft, Amazon, Alphabet, Meta, and Oracle—are increasing cash capital expenditures faster than their operating cash inflows. Most have already turned to external financing to fund AI infrastructure investments, or are considering doing so. This trend suggests that AI infrastructure spending is becoming a bet on future returns rather than a reflection of current profitability.

Chinese Lab Strategies

Epoch researchers scraped over 1,600 job postings from six major Chinese AI firms. They found that, like US labs, Chinese companies have distinct strategic personalities—some prioritize research, others application development. This granular view contrasts with the common narrative of a monolithic Chinese AI push.

Toward an AI R&D Taxonomy

Epoch also proposed a taxonomy for tracking which parts of AI research remain unautomated, aiming to quantify how close AI is to automating its own R&D. This framework could inform policy decisions about research investment and labor displacement.

Key Takeaways

Epoch AI and METR launched MirrorCode, a $2,600-per-run coding benchmark.
Claude Opus 4.7 leads with 56% solve rate.

What to watch

Watch for follow-up benchmarks that extend MirrorCode's duration beyond 19 days, and for hyperscaler Q3 2026 earnings reports that will reveal whether external financing for AI infrastructure has increased. Also track whether Claude Opus 4.7's 56% solve rate improves with larger inference budgets.

Source: epochai.substack.com

Source: gentic.news · 3d ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

MirrorCode represents a meaningful departure from existing coding benchmarks like SWE-bench, which cap inference at trivial amounts. By allowing runs to span days and cost thousands of dollars, it tests whether models can maintain coherent reasoning over long horizons—a capability that may matter more for autonomous software agents than speed benchmarks. The 56% solve rate for Claude Opus 4.7 suggests that even frontier models struggle with sustained, multi-week software engineering tasks. The hyperscaler capex finding is structurally important: if the largest cloud providers are borrowing to fund AI infrastructure, the ROI timeline for that spending becomes critical. If AI demand growth slows, these investments could become stranded assets. The Chinese lab job-posting analysis adds nuance to the narrative of a unified Chinese AI push; the researchers found distinct strategic profiles among firms, mirroring the diversity seen in US labs. Epoch's proposed taxonomy for AI R&D automation is a useful framework for tracking progress toward AI self-improvement. The key question is whether the rate of automation in research tasks accelerates as models improve, potentially creating a feedback loop that speeds up AI development further.

#ai infrastructure #ai benchmarks #chinese ai labs #software engineering

Mentioned in this article

MirrorCode Epoch AI Claude Opus 4.7 METR

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research2 shared topics

OpenAI Can Predict Model Failures via Past Chat Replay

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

MirrorCode Benchmark Costs $2,600 Per Run, Challenges AI Coding Limits

The $2,600 Inference Run

Hyperscaler Capex Squeeze

Chinese Lab Strategies

Toward an AI R&D Taxonomy

Key Takeaways

What to watch

AI Analysis

✨AI Toolslive

Related Articles

MirrorCode: Epoch AI Tests If AI Can Rebuild 25 Unix Tools From Scratch

Tencent Open-Sources Agent Memory System Cutting Token Use 61%

OpenAI GPT-5.5-Cyber Beats Anthropic Mythos on Security Benchmarks

ByteDance Seed's SpatialTree Redefines MLLM Spatial Reasoning at CVPR 2026

How to Govern Claude Code Across Your Team: 4 Gaps to Fix Before the Next CVE

OpenAI Can Predict Model Failures via Past Chat Replay

The framework underneath this story

More in AI Research

Free RL Textbook 'Math Foundations' Hits 16.2K GitHub Stars

PlanBench-XL: GPT-5.4 Scores 11.36% on Hard Tool-Use Tasks