A new paper reports an 88.5% average relative improvement across 126 model-environment settings by adapting the runtime interface around a frozen LLM. The code-as-agent-harness thesis suggests production agent improvements should target the harness, not the model.
Key facts
- 88.5% average relative improvement across 7 environments
- 126 model-environment settings tested
- 18 backbones evaluated
- Harness from one trajectory generalizes to 17 other backbones
- Method leaves LLM frozen, modifies only runtime interface
A new preprint, shared by @omarsar0, advances the 'code-as-agent-harness' thesis: frozen LLMs wrapped in adaptive runtimes outperform fine-tuned models across deterministic environments. The paper reports an average relative improvement of 88.5% across 7 deterministic environments, 126 model-environment settings, and 18 backbones [According to @omarsar0].
Crucially, a harness learned from one model trajectory generalizes to 17 other backbones. That tells you the harness is capturing environment structure, not model-specific patterns. This finding directly challenges the prevailing assumption that agent performance gains require model-level interventions such as SFT or RLHF.
How the harness works
The method leaves the LLM untouched. Instead, it converts recurring interaction failures into reusable interventions on the harness side. The harness is a runtime layer that intercepts model outputs, applies corrections based on past failures, and re-ranks candidate actions before execution. This mirrors production patterns at companies like Anthropic and OpenAI, where 'tool-use' wrappers and 'safety classifiers' sit between the model and environment.
The unique take
If you ship agents in production, your harness work is more portable than you might assume. The paper's generalization result implies that a team building a harness for one LLM can reuse that harness across model swaps — a significant operational cost saving. This is the opposite of the current trend where teams rebuild agent scaffolds for each new model release.
Limitations
The paper only evaluates deterministic environments (coding benchmarks, grid-world tasks). Stochastic or partially-observed environments may require different harness strategies. The preprint does not disclose training compute or harness complexity, making direct cost comparisons difficult.
Key Takeaways
- Paper shows 88.5% improvement by adapting runtime interface around frozen LLM.
- Harness generalizes across 18 backbones, challenging model-centric agent improvement.
What to watch

Watch for follow-up work extending the harness approach to stochastic environments (e.g., WebShop, ALFWorld) and whether production agent teams at Anthropic or OpenAI adopt harness-first debugging as standard practice. Also track if the preprint's generalization claim replicates on proprietary enterprise backbones.









