Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A bar chart comparing AI agent performance across seven benchmarks, with fixed compute budgets shown as lower bars…

AISI: Fixed compute budgets underestimate AI agents by 60%

AISI found standard benchmarks cap compute budgets, underestimating agent capabilities by ~60%. Success rates jumped ~25% with 10x tokens.

·15h ago·3 min read··10 views·AI-Generated·Report error
Share:
Source: the-decoder.comvia the_decoderCorroborated
How do fixed compute budgets in standard benchmarks underestimate AI agent capabilities?

AISI found standard benchmarks cap compute budgets, systematically underestimating AI agent capabilities. On software tasks, success rates jumped ~25% when token budgets rose from 1M to 10M. Overall frontier progress is ~60% steeper than previous measurements suggest.

TL;DR

AISI tested 7 benchmarks with varying compute budgets. · Success rates jumped ~25% with 10x token budget. · Frontier progress 60% steeper than prior measurements.

AISI tested frontier models across seven benchmarks with varying compute budgets. The finding: fixed budget caps systematically underestimate how capable AI agents really are.

Key facts

  • ~25% success rate jump on software tasks with 10x token budget.
  • ~8% of cybersecurity tasks required >10 million tokens.
  • ~22% gain on math tasks up to 5 million tokens.
  • A one-week human task costs billions of tokens.
  • Frontier progress ~60% steeper than prior measurements.

The UK's AI Security Institute (AISI) tested frontier models across seven benchmarks with varying compute budgets. The finding: fixed budget caps systematically underestimate how capable AI agents really are.

An AI agent's performance is a curve that rises with test-time compute, the amount of processing power an agent is allowed to burn while working on a task. Cut the budget while the curve is still climbing, and the measured score tells you the minimum, not the maximum. That's what the AISI researchers set out to prove in their latest work.

Key Takeaways

  • AISI found standard benchmarks cap compute budgets, underestimating agent capabilities by ~60%.
  • Success rates jumped ~25% with 10x tokens.

More compute, better results across the board

The effect shows up across domains. In cybersecurity, about 8 percent of tasks were only solved when the budget exceeded 10 million tokens; some even required 50 million. The newest models hit even higher scores at budgets above 100 million tokens.

On software engineering tasks (TerminalBench 2.0, SWE-Bench Pro), success rates jumped about 25 percent when the token budget went from one million to ten million. For math and academic tasks (Humanity's Last Exam), the gain was around 22 percent up to a budget of five million tokens.

Extra compute doesn't help everywhere equally. On HealthBench, a medical task benchmark, all models hit their plateau within the standard budget. According to AISI, more compute helps most where agents can verify their own work, like running code or testing an exploit. But it barely moves the needle where feedback is missing or delayed.

Human task time predicts how many tokens agents need

Another finding ties the time a human expert needs for a task to the agent's token consumption. Across 211 software engineering tasks from the research institute METR and 78 cyber tasks from AISI, this relationship follows a power law. A one-minute task costs the agent thousands of tokens. A one-hour task costs millions. A one-week task costs billions.

Mit steigendem Token-Budget verbessert sich die Erfolgsrate über alle Aufgaben eines Benchmarks hinweg. Neuere Modelle (dunkelrot) profitieren stärker

A fixed evaluation budget therefore cuts off the longest and hardest tasks. Failure can mean the budget was too tight, not that the agent lacked the skill. AISI points to the cyber task "The Last Ones", which takes a human expert hours — and which agents fail under standard budgets.

According to The Decoder, the implication is that actual progress at the frontier is about 60 percent steeper than previous measurements suggested. This means the gap between reported benchmark scores and real-world agent capability is widening, not closing.

What to watch

Watch for AISI's next evaluation round incorporating variable compute budgets as a standard parameter. If major labs (OpenAI, Anthropic, Google) adopt this methodology in their own benchmark submissions, expect reported capability numbers to shift upward significantly — and with them, regulatory attention.

More compute, more AI performance. But where's the limit? | Image: AISI


Source: the-decoder.com


Sources cited in this article

  1. AISI
Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from 2 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This finding upends a core assumption in AI evaluation: that a single compute budget provides a fair comparison. The AISI data suggests that newer models benefit disproportionately from larger budgets, meaning the gap between frontier labs and smaller players is larger than benchmarks show. The power-law relationship between human task time and token consumption is particularly important — it implies that as agents get deployed on longer-horizon tasks (multi-hour coding sessions, week-long research projects), the compute required scales non-linearly. This has direct implications for cost modeling and safety evaluation: a model that passes a short benchmark might fail catastrophically on a long task not because of capability limits but because of budget constraints. The HealthBench plateau, however, serves as a useful counterpoint — not all domains benefit from scaling test-time compute, suggesting that architectural innovations (like self-verification loops) may be more important than raw compute in certain domains.

Mentioned in this article

Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in Policy & Ethics

View all