Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

AI ResearchScore: 89

Agentic Harness Engineering Boosts Coding Agents 7% on Terminal-Bench 2

Name: The Living Graph
Creator: gentic.news
License: https://creativecommons.org/licenses/by-nc/4.0/

Agentic Harness Engineering introduces a structured approach to evolving coding-agent harnesses, using revertible components, condensed experience, and falsifiable decisions. On Terminal-Bench 2, pass@1 climbs from 69.7% to 77.0% in ten iterations, beating human-designed baselines.

GAla Smith & AI Research Desk·3h ago·4 min read··16 views·AI-Generated·Report error

Source: x.comvia @omarsar0Corroborated

Key Takeaways

Agentic Harness Engineering introduces a structured approach to evolving coding-agent harnesses, using revertible components, condensed experience, and falsifiable decisions.
On Terminal-Bench 2, pass@1 climbs from 69.7% to 77.0% in ten iterations, beating human-designed baselines.

What Happened

LangChain's Harness Engineering: From Top 30 to Top 5 on ...

A new research paper introduces Agentic Harness Engineering, a framework that makes the evolution of coding-agent harnesses observable and controllable. The approach treats each edit as a contract that can be verified or reverted, using three layers: components as revertible files, experience as condensed evidence from millions of trajectory tokens, and decisions as falsifiable predictions checked against task outcomes.

On Terminal-Bench 2, pass@1 climbs from 69.7% to 77.0% in just ten iterations, surpassing both human-designed Codex-CLI (71.9%) and self-evolving baselines like ACE and TF-GRPO. The evolved harness also transfers across model families with +5.1 to +10.1 point gains, while using 12% fewer tokens than the seed on SWE-bench-verified.

Context

Most coding-agent harnesses today are still tuned by hand or through brittle trial-and-error self-evolution. This work provides the first credible recipe for letting the harness improve itself without drifting into noise. The framework is detailed in the paper linked from the source tweet.

How It Works

The framework operates through three layers:

Components as revertible files: Each part of the harness is stored as a file that can be reverted to a previous version.
Experience as condensed evidence: Millions of trajectory tokens are compressed into actionable evidence.
Decisions as falsifiable predictions: Each decision is checked against task outcomes, creating a feedback loop.

This structure allows the harness to evolve systematically, avoiding the noise and brittleness of earlier self-evolution approaches.

Key Results

LangChain's Harness Engineering: From Top 30 to Top 5 on ...

Terminal-Bench 2 pass@1 69.7% 77.0% +7.3% Codex-CLI (human-designed) 71.9% — Beaten by evolved harness ACE baseline — — Surpassed TF-GRPO baseline — — Surpassed Cross-model transfer — +5.1 to +10.1 points Significant gains SWE-bench-verified token usage — 12% fewer tokens Efficiency gain

Why It Matters

Harness work is the biggest hidden cost in most agent systems. This framework offers a systematic way to improve harnesses without manual tuning or noise accumulation. The cross-model transfer results suggest the approach generalizes well, and the token efficiency gain on SWE-bench-verified is a practical benefit for production systems.

gentic.news Analysis

This work addresses a critical bottleneck in the agentic coding ecosystem. As we've covered in previous articles, the gap between human-designed and self-evolved harnesses has been a persistent challenge. The Agentic Harness Engineering framework provides a structured middle ground: not fully manual, not fully random, but a guided evolution process with clear revert points.

The cross-model transfer results are particularly interesting. Most harness optimization work is model-specific, so seeing +5 to +10 point gains across model families suggests the framework captures something fundamental about the task structure rather than overfitting to a particular model's quirks. This could make it valuable for teams deploying multiple models or frequently updating their base model.

The token efficiency gain on SWE-bench-verified (12% fewer tokens) is also worth noting. In production systems where token costs are a major concern, this could translate to significant savings at scale.

One caveat: the paper reports results on Terminal-Bench 2 and SWE-bench-verified, but real-world coding tasks may present different challenges. The true test will be adoption and results from independent labs.

Frequently Asked Questions

What is Agentic Harness Engineering?

Agentic Harness Engineering is a framework for systematically evolving coding-agent harnesses. It treats each edit as a contract that can be verified or reverted, using three layers: components as revertible files, experience as condensed evidence from trajectory tokens, and decisions as falsifiable predictions checked against task outcomes.

How does Agentic Harness Engineering compare to existing methods?

On Terminal-Bench 2, the evolved harness achieves 77.0% pass@1 after ten iterations, surpassing human-designed Codex-CLI (71.9%) and self-evolving baselines like ACE and TF-GRPO. It also transfers across model families with +5.1 to +10.1 point gains.

What are the practical implications for AI developers?

This framework could reduce the hidden cost of harness tuning in agent systems. Developers can let the harness evolve systematically without manual intervention or noise accumulation, potentially improving both performance and token efficiency.

Where can I find the full paper?

The paper is available at the link shared in the source tweet: https://t.co/9fEgqwlTSf

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The key insight here is the structured evolution approach. Previous self-evolution methods for harnesses suffered from noise accumulation — each iteration might improve one metric while degrading another, and there was no way to revert problematic changes. By making components revertible and decisions falsifiable, this framework creates a disciplined optimization loop that resembles version control for machine learning pipelines. The fact that it beats human-designed baselines in just ten iterations suggests the manual tuning process has significant room for improvement. From a practical standpoint, the cross-model transfer results are the most valuable finding. Most teams don't have the resources to tune harnesses separately for each model they use. A framework that can evolve a harness on one model and transfer it to others with consistent gains could be a major time and cost saver. The 12% token reduction on SWE-bench-verified is a nice bonus, though it's unclear whether this efficiency gain holds across all tasks or is specific to that benchmark. The main limitation is that the paper focuses on two benchmarks (Terminal-Bench 2 and SWE-bench-verified). Real-world coding tasks involve more diverse environments, toolchains, and failure modes. It would be valuable to see results on more varied benchmarks or in production settings. Additionally, the framework's reliance on condensed experience from trajectory tokens raises questions about information loss — how much signal is discarded in the condensation process, and could that limit long-term improvement?

#coding agents #agentic systems #harness engineering #benchmark improvement #ai research

Mentioned in this article

SWE-Bench Verified Codex CLI Terminal Bench 2 Agentic Harness Engineering

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches2 shared topics

MCP's 'By Design' Security Flaw

More in AI Research

View all

AI Research

Vector DBs Can't Reason: GraphRAG-Bench Shows 83.6% Gap on Complex Queries

FalkorDB's GraphRAG-Bench benchmarks show vector databases struggle on multi-hop reasoning (83.6% gap) and contextual summarization (85.1% gap), highlighting graph-based retrieval's advantage for complex queries.

x.com/6h ago/3 min read

vector databasesai benchmarksfalkordb

AI Research

Xiaomi MiMo 2.5 Pro Beats Opus 4.5 on Arena, MIT License

Xiaomi's MiMo v2.5 Pro, an open-source model under MIT license, has achieved a higher Arena score than Opus 4.5, signaling a major shift in competitive AI performance.

x.com/8h ago/3 min read

open sourcexiaomibenchmarks

AI Research

Embedding distance predicts VLM typographic attack success (r=-0.93)

A new study shows that embedding distance between image text and harmful prompt strongly predicts attack success rate (r=-0.71 to -0.93). The researchers introduce CWA-SSA optimization to recover readability and bypass safety alignment without model access.

arxiv.org/13h ago/3 min read

red teamingvlm safetytypographic injection

Key Takeaways

What Happened

Context

How It Works

Key Results

Why It Matters

gentic.news Analysis

Frequently Asked Questions

What is Agentic Harness Engineering?

How does Agentic Harness Engineering compare to existing methods?

What are the practical implications for AI developers?

Where can I find the full paper?

AI Analysis

✨AI Toolslive

Related Articles

Anthropic Opus 4.7: 87.6% SWE-Bench, Constrained Cyber Capabilities

Turn Claude Code Into an AI SRE

Qwen3.6-27B: How to Run a 17GB Local Model That Beats 397B MoE on Coding Tasks

Stop Losing Agent Context: Implement Session Memory Files in Your Claude

CS3: A New Framework to Boost Two-Tower Recommenders Without Slowing Them Down

MCP's 'By Design' Security Flaw

More in AI Research

Vector DBs Can't Reason: GraphRAG-Bench Shows 83.6% Gap on Complex Queries

Xiaomi MiMo 2.5 Pro Beats Opus 4.5 on Arena, MIT License

Embedding distance predicts VLM typographic attack success (r=-0.93)