agent harness

30 articles about agent harness in AI news

Stanford, Meta 'Code as Agent Harness' Paper Rethinks AI Agent Design

Stanford and Meta's "Code as Agent Harness" paper proposes code-driven AI agent orchestration, potentially improving reliability over natural language prompts.

Jun 10, 2026100% relevant

Meta-Stanford Survey: Code as Agent Harness Improves AI Reasoning

Meta, Stanford, Illinois survey argues AI agents work better with code as their main working layer, calling it an agent harness.

May 25, 202689% relevant

Agent Harness Engineering: The 'OS' That Makes LLMs Useful

A clear analogy frames raw LLMs as CPUs needing an operating system. The agent harness—managing tools, memory, and execution—is what creates useful applications, as proven by LangChain's benchmark jump.

Apr 7, 202685% relevant

MIT's 'Agent Harness' Unleashes Proactive AI That Can Independently Navigate Complex Tasks

MIT researchers have developed a groundbreaking 'agent harness' system that enables AI agents to proactively plan and execute multi-step tasks with minimal human intervention. This represents a significant leap toward truly autonomous AI systems that can navigate complex, real-world scenarios independently.

Mar 5, 202685% relevant

Code-as-Agent Harness Thesis: 88.5% Gains Without Touching the LLM

Paper shows 88.5% improvement by adapting runtime interface around frozen LLM. Harness generalizes across 18 backbones, challenging model-centric agent improvement.

May 23, 202684% relevant

Agent Harnessing: The Infrastructure That Makes AI Agents Work

A detailed technical guide argues that the model is not the hard part of building AI agents. The six-component harness — context management, memory, tools, control flow, verification, and coordination — is what separates production-grade agents from those that fail silently.

Apr 25, 202688% relevant

Agent Harness Debate: Anthropic vs. OpenAI vs. LangChain on Scaffolding

A central debate in agent engineering pits a 'thin harness' approach (Anthropic) against 'thick harness' designs (LangGraph). The infrastructure layer, not the model, is becoming the primary product differentiator.

Apr 10, 202685% relevant

Agent Harness Scaling: EFC Predicts Success at R2 0.99 vs 0.42

New research introduces Effective Feedback Compute (EFC), which predicts agent success at R2 0.99 vs 0.42 for raw tokens. Reallocating compute by EFC lifts success 3x at the same budget.

May 29, 202688% relevant

Anthropic Deploys Multi-Agent Harness to Scale Claude's Frontend Design & Autonomous Software Engineering

Anthropic engineers detail a multi-agent system that orchestrates multiple Claude instances to tackle complex, long-running software tasks like frontend design. The approach aims to overcome single-model context and reasoning limits.

Mar 24, 202685% relevant

Agentic Harness Engineering Boosts Coding Agents 7% on Terminal-Bench 2

Agentic Harness Engineering introduces a structured approach to evolving coding-agent harnesses, using revertible components, condensed experience, and falsifiable decisions. On Terminal-Bench 2, pass@1 climbs from 69.7% to 77.0% in ten iterations, beating human-designed baselines.

Apr 29, 2026100% relevant

Meta-Harness Framework Automates AI Agent Engineering, Achieves 6x Performance Gap on Same Model

A new framework called Meta-Harness automates the optimization of AI agent harnesses—the system prompts, tools, and logic that wrap a model. By analyzing raw failure logs at scale, it improved text classification by 7.7 points while using 4x fewer tokens, demonstrating that harness engineering is a major leverage point as model capabilities converge.

Mar 30, 202691% relevant

Your AI Agent Is Only as Good as Its Harness — Here’s What That Means

An article from Towards AI emphasizes that the reliability and safety of an AI agent depend more on its controlling 'harness'—the system of protocols, tools, and observability layers—than on the underlying model. This concept is reportedly worth $2 billion but remains poorly understood by many developers.

Apr 19, 2026100% relevant

Akshay Pachaar Inverts LLM Agent Architecture with 'Harness' Design

AI engineer Akshay Pachaar outlined a novel 'harness' architecture for LLM agents that externalizes intelligence into memory, skills, and protocols. He is building a minimal, didactic open-source implementation of this design.

Apr 18, 202689% relevant

Harness Engineering for AI Agents: Building Production-Ready Systems That Don’t Break

A technical guide on 'Harness Engineering'—a systematic approach to building reliable, production-ready AI agents that move beyond impressive demos. This addresses the critical industry gap where most agent pilots fail to reach deployment.

Apr 1, 202672% relevant

MiniMax M2.7 AI Agent Rewrites Its Own Harness, Achieving 9 Gold Medals on MLE Bench Lite Without Retraining

MiniMax's M2.7 agent autonomously rewrites its own operational harness—skills, memory, and workflow rules—through a self-optimization loop. After 100+ internal rounds, it earned 9 gold medals on OpenAI's MLE Bench Lite without weight updates.

Mar 31, 202695% relevant

5 Harness Internals That Changed How I Use Claude Code Daily

Rebuilding Claude Code's harness reveals that CLAUDE.md layers on a hidden base prompt, hooks can block tool calls, and subagents need abort trees—5 actionable takeaways for daily use.

Jun 25, 2026100% relevant

Dynamic Workflows: A New Agent Primitive Emerges

Dynamic workflows generate harnesses on the fly for agent orchestrators, enabling branching and verified tasks across coding agents like Claude Code and Codex.

Jun 4, 202675% relevant

Grep Beats Vector Search in Agent Benchmarks, New Paper Finds

Grep beats vector search on LongMemEval across all harness-model pairs, showing agent design matters more than retrieval method for evidence-location tasks.

May 17, 202685% relevant

Claude Code's Six-Layer Architecture: Harness, Not Magic

Claude Code's six-layer architecture uses a 3-layer context compressor at 92% threshold and Redis-based multi-agent FSM protocol. The model is just one node in a harness.

May 10, 2026100% relevant

POTEMKIN Framework Exposes Critical Trust Gap in Agentic AI Tools

A new paper formalizes Adversarial Environmental Injection (AEI), a threat model where compromised tools deceive AI agents. The POTEMKIN testing harness found agents are evaluated for performance, not skepticism, creating a critical trust gap.

Apr 22, 202675% relevant

Claude Code Reverse-Engineered: 98.4% of Codebase is Operational Harness

A reverse-engineering analysis of Claude Code reveals only 1.6% of its codebase is AI decision logic, with the rest being operational infrastructure. This challenges current agent design paradigms by prioritizing a robust deterministic harness over complex model routing.

Apr 18, 2026100% relevant

Google DeepMind's AutoHarness: The AI Tool That Could Revolutionize How We Build Intelligent Systems

Google DeepMind's AutoHarness framework enables automatic testing and optimization of AI models without retraining, allowing developers to synthesize functional AI agents like coding assistants with unprecedented efficiency.

Mar 12, 202687% relevant

12-Metric Agent Eval Framework From 100+ Deployments Hits Production

12-metric evaluation framework for production AI agents from 100+ deployments targets task success, cost, latency, tool use, and safety.

May 13, 202674% relevant

8-Agent System Builder: Anthropic's Simpler Approach Beat My 2-Day Build

Engineer built 8-agent system in 2 days; Anthropic's simpler 2-agent approach outperformed it. Lesson: minimal agent architecture beats complex orchestration.

May 12, 202681% relevant

Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates

Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks. GPT-5 mini leads at 0.309 ONS, but no paradigm dominates. ASCII beats natural language.

May 11, 202698% relevant

Pylon: Self-Host Your Own AI Agent Pipeline That Fixes Sentry Errors via

Pylon is a self-hosted daemon that triggers sandboxed Claude Code agents from webhooks (Sentry, cron, chat) and reports results with human approval — no data leaves your machine.

Apr 27, 202695% relevant

Build Reusable Data Science Workflows with Claude Skills and Subagents

Claude Skills and Subagents let you package prompts into reusable modules, freeing data scientists from repetitive AI adjustments for EDA, modeling, and deployment.

Apr 26, 202699% relevant

LangFuse on Evaluating AI Agents in Production

The article outlines a practical methodology for monitoring and enhancing AI agent performance post-deployment. It emphasizes combining automated LLM-based evaluation with human feedback loops to create actionable datasets for fine-tuning.

Apr 23, 202678% relevant

Adobe, NVIDIA, WPP Launch Enterprise AI Agents for Marketing with OpenShell

NVIDIA expands collaborations with Adobe and WPP to build agentic AI systems for enterprise marketing workflows. The stack uses NVIDIA's OpenShell runtime to enforce security and policy compliance in multi-step creative and customer experience tasks.

Apr 20, 2026100% relevant

Google Gemini's UI Harness Lags Behind Claude, GPT, Analyst Says

AI researcher Ethan Mollick notes the Gemini Pro 3.1 model is technically capable but hampered by a minimal user interface and tool harness, widening its gap with competitors Claude and ChatGPT.

Apr 19, 202679% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety