Agent Harness Scaling: EFC Predicts Success at R2 0.99 vs 0.42

New research introduces Effective Feedback Compute (EFC), which predicts agent success at R2 0.99 vs 0.42 for raw tokens. Reallocating compute by EFC lifts success 3x at the same budget.

AAAla SMITH & AI Research Desk·May 29, 2026·3 min read··151 views·AI-Generated·Report error

Source: x.comvia @omarsar0Single Source

What is Effective Feedback Compute (EFC) and how does it improve agent harness performance?

Effective Feedback Compute (EFC) counts only actionable feedback, predicting agent success at R2 0.99 vs 0.42 for raw token counts. Reallocating compute by EFC lifts success from 0.27 to 0.90 at the same budget.

TL;DR

EFC predicts agent success at R2 0.99. · Raw token counts explain only 33-42% of failure. · Reallocation lifts success from 0.27 to 0.90.

Effective Feedback Compute (EFC) predicts agent harness success at R2 0.99, versus 0.33-0.42 for raw token counts. The metric, introduced in new research shared by @omarsar0, counts only feedback an agent can act on.

Key facts

EFC predicts success at R2 0.99 vs 0.33-0.42 for raw counts.
Reallocation lifts success from 0.27 to 0.90 at same compute.
Raw token/call counts explain only 33-42% of agent failure.
Metric counts only feedback the agent can actually act on.

Most agent harness tuning treats every token and tool call as equally valuable. New research, shared by AI researcher @omarsar0, demonstrates that this assumption is wrong — and introduces a better coordinate.

The work proposes Effective Feedback Compute (EFC), a metric that counts only the feedback an agent can actually act on. Raw token and tool-call counts explain agent failure at R2 of 0.33 to 0.42 [According to @omarsar0]. EFC pushes that to 0.99.

The implication is structural: once you budget by useful feedback instead of raw volume, reallocation alone lifts success from 0.27 to 0.90 at the same compute. This turns harness design from guesswork into something you can predict.

Why this matters more than the press release suggests: The finding mirrors a pattern across AI scaling — raw compute scaling (more tokens, more calls) has diminishing returns. Just as Chinchilla scaling laws showed that token count alone underfits, and Kaplan et al. 2020 showed compute-optimal training, EFC reframes agent evaluation as a quality-of-feedback problem, not a volume problem. The R2 jump from 0.42 to 0.99 is not incremental; it suggests the entire field of agent benchmarking has been measuring the wrong variable.

Practical implication for builders: If your agent harness currently optimizes for total tool calls or token throughput, you are likely over-allocating compute to low-value feedback loops. Shifting to EFC-based budgeting could yield near-4x success improvements without increasing compute spend. The paper provides a framework for computing EFC from existing logs, making the change retroactive.

Limitations not emphasized in the source: The source tweet and linked paper do not disclose the agent architectures tested, the task domains, or whether EFC generalizes across tool-use, code generation, and web navigation agents. The R2 0.99 figure is striking but requires replication across diverse harnesses. The company behind the research is also unnamed.

Key Takeaways

New research introduces Effective Feedback Compute (EFC), which predicts agent success at R2 0.99 vs 0.42 for raw tokens.
Reallocating compute by EFC lifts success 3x at the same budget.

What to watch

Harness engineering is as important as model capability scaling. AI ...

Watch for replication studies on EFC across diverse agent harnesses (e.g., SWE-Bench, WebArena, ToolBench). If the R2 0.99 holds across domains, expect a shift in how agent evaluation papers report compute budgets — and a new industry standard for harness design.

Source: gentic.news · May 29, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The EFC finding is a direct analog to the compute-optimal scaling insight from Kaplan et al. 2020 and Hoffmann et al. 2022. Just as those papers showed that training on more tokens than compute-optimal hurts performance, EFC shows that feeding agents more feedback than they can act on wastes compute. The R2 improvement from 0.42 to 0.99 is unusually large — most ML metric improvements are measured in single-digit percentage points. This suggests either a genuinely new axis of optimization or that the test set is narrow. Without knowing the task distribution, the headline number should be treated as a strong signal, not a settled result. A second-order implication: if EFC becomes standard, it will change how agent papers report results. Currently, papers boast about total tool calls or token throughput. Under EFC, the relevant metric becomes 'actionable feedback per compute unit.' This could make cross-paper comparisons more meaningful — or it could introduce a new metric that is itself gameable by defining 'actionable' narrowly. The source tweet does not name the authors or institution behind the paper. This is a red flag for reproducibility. The link to an 'academy' suggests the work may be commercially affiliated. Readers should treat the R2 0.99 as a provisional result pending independent verification.

#scaling laws #research #ai agents #agent evaluation

Mentioned in this article

Effective Feedback Compute (EFC)Omar Sar

Enjoyed this article?