AISI tested frontier models across seven benchmarks with varying compute budgets. The finding: fixed budget caps systematically underestimate how capable AI agents really are.
Key facts
- ~25% success rate jump on software tasks with 10x token budget.
- ~8% of cybersecurity tasks required >10 million tokens.
- ~22% gain on math tasks up to 5 million tokens.
- A one-week human task costs billions of tokens.
- Frontier progress ~60% steeper than prior measurements.
The UK's AI Security Institute (AISI) tested frontier models across seven benchmarks with varying compute budgets. The finding: fixed budget caps systematically underestimate how capable AI agents really are.
An AI agent's performance is a curve that rises with test-time compute, the amount of processing power an agent is allowed to burn while working on a task. Cut the budget while the curve is still climbing, and the measured score tells you the minimum, not the maximum. That's what the AISI researchers set out to prove in their latest work.
Key Takeaways
- AISI found standard benchmarks cap compute budgets, underestimating agent capabilities by ~60%.
- Success rates jumped ~25% with 10x tokens.
More compute, better results across the board
The effect shows up across domains. In cybersecurity, about 8 percent of tasks were only solved when the budget exceeded 10 million tokens; some even required 50 million. The newest models hit even higher scores at budgets above 100 million tokens.
On software engineering tasks (TerminalBench 2.0, SWE-Bench Pro), success rates jumped about 25 percent when the token budget went from one million to ten million. For math and academic tasks (Humanity's Last Exam), the gain was around 22 percent up to a budget of five million tokens.
Extra compute doesn't help everywhere equally. On HealthBench, a medical task benchmark, all models hit their plateau within the standard budget. According to AISI, more compute helps most where agents can verify their own work, like running code or testing an exploit. But it barely moves the needle where feedback is missing or delayed.
Human task time predicts how many tokens agents need
Another finding ties the time a human expert needs for a task to the agent's token consumption. Across 211 software engineering tasks from the research institute METR and 78 cyber tasks from AISI, this relationship follows a power law. A one-minute task costs the agent thousands of tokens. A one-hour task costs millions. A one-week task costs billions.

A fixed evaluation budget therefore cuts off the longest and hardest tasks. Failure can mean the budget was too tight, not that the agent lacked the skill. AISI points to the cyber task "The Last Ones", which takes a human expert hours — and which agents fail under standard budgets.
According to The Decoder, the implication is that actual progress at the frontier is about 60 percent steeper than previous measurements suggested. This means the gap between reported benchmark scores and real-world agent capability is widening, not closing.
What to watch
Watch for AISI's next evaluation round incorporating variable compute budgets as a standard parameter. If major labs (OpenAI, Anthropic, Google) adopt this methodology in their own benchmark submissions, expect reported capability numbers to shift upward significantly — and with them, regulatory attention.

Source: the-decoder.com









