Effective Feedback Compute (EFC) predicts agent harness success at R2 0.99, versus 0.33-0.42 for raw token counts. The metric, introduced in new research shared by @omarsar0, counts only feedback an agent can act on.
Key facts
- EFC predicts success at R2 0.99 vs 0.33-0.42 for raw counts.
- Reallocation lifts success from 0.27 to 0.90 at same compute.
- Raw token/call counts explain only 33-42% of agent failure.
- Metric counts only feedback the agent can actually act on.
Most agent harness tuning treats every token and tool call as equally valuable. New research, shared by AI researcher @omarsar0, demonstrates that this assumption is wrong — and introduces a better coordinate.
The work proposes Effective Feedback Compute (EFC), a metric that counts only the feedback an agent can actually act on. Raw token and tool-call counts explain agent failure at R2 of 0.33 to 0.42 [According to @omarsar0]. EFC pushes that to 0.99.
The implication is structural: once you budget by useful feedback instead of raw volume, reallocation alone lifts success from 0.27 to 0.90 at the same compute. This turns harness design from guesswork into something you can predict.
Why this matters more than the press release suggests: The finding mirrors a pattern across AI scaling — raw compute scaling (more tokens, more calls) has diminishing returns. Just as Chinchilla scaling laws showed that token count alone underfits, and Kaplan et al. 2020 showed compute-optimal training, EFC reframes agent evaluation as a quality-of-feedback problem, not a volume problem. The R2 jump from 0.42 to 0.99 is not incremental; it suggests the entire field of agent benchmarking has been measuring the wrong variable.
Practical implication for builders: If your agent harness currently optimizes for total tool calls or token throughput, you are likely over-allocating compute to low-value feedback loops. Shifting to EFC-based budgeting could yield near-4x success improvements without increasing compute spend. The paper provides a framework for computing EFC from existing logs, making the change retroactive.
Limitations not emphasized in the source: The source tweet and linked paper do not disclose the agent architectures tested, the task domains, or whether EFC generalizes across tool-use, code generation, and web navigation agents. The R2 0.99 figure is striking but requires replication across diverse harnesses. The company behind the research is also unnamed.
Key Takeaways
- New research introduces Effective Feedback Compute (EFC), which predicts agent success at R2 0.99 vs 0.42 for raw tokens.
- Reallocating compute by EFC lifts success 3x at the same budget.
What to watch

Watch for replication studies on EFC across diverse agent harnesses (e.g., SWE-Bench, WebArena, ToolBench). If the R2 0.99 holds across domains, expect a shift in how agent evaluation papers report compute budgets — and a new industry standard for harness design.









