agent harness
30 articles about agent harness in AI news
Meta-Stanford Survey: Code as Agent Harness Improves AI Reasoning
Meta, Stanford, Illinois survey argues AI agents work better with code as their main working layer, calling it an agent harness.
Agent Harness Engineering: The 'OS' That Makes LLMs Useful
A clear analogy frames raw LLMs as CPUs needing an operating system. The agent harness—managing tools, memory, and execution—is what creates useful applications, as proven by LangChain's benchmark jump.
MIT's 'Agent Harness' Unleashes Proactive AI That Can Independently Navigate Complex Tasks
MIT researchers have developed a groundbreaking 'agent harness' system that enables AI agents to proactively plan and execute multi-step tasks with minimal human intervention. This represents a significant leap toward truly autonomous AI systems that can navigate complex, real-world scenarios independently.
Code-as-Agent Harness Thesis: 88.5% Gains Without Touching the LLM
Paper shows 88.5% improvement by adapting runtime interface around frozen LLM. Harness generalizes across 18 backbones, challenging model-centric agent improvement.
Agent Harnessing: The Infrastructure That Makes AI Agents Work
A detailed technical guide argues that the model is not the hard part of building AI agents. The six-component harness — context management, memory, tools, control flow, verification, and coordination — is what separates production-grade agents from those that fail silently.
Agent Harness Debate: Anthropic vs. OpenAI vs. LangChain on Scaffolding
A central debate in agent engineering pits a 'thin harness' approach (Anthropic) against 'thick harness' designs (LangGraph). The infrastructure layer, not the model, is becoming the primary product differentiator.
Anthropic Deploys Multi-Agent Harness to Scale Claude's Frontend Design & Autonomous Software Engineering
Anthropic engineers detail a multi-agent system that orchestrates multiple Claude instances to tackle complex, long-running software tasks like frontend design. The approach aims to overcome single-model context and reasoning limits.
Agentic Harness Engineering Boosts Coding Agents 7% on Terminal-Bench 2
Agentic Harness Engineering introduces a structured approach to evolving coding-agent harnesses, using revertible components, condensed experience, and falsifiable decisions. On Terminal-Bench 2, pass@1 climbs from 69.7% to 77.0% in ten iterations, beating human-designed baselines.
Meta-Harness Framework Automates AI Agent Engineering, Achieves 6x Performance Gap on Same Model
A new framework called Meta-Harness automates the optimization of AI agent harnesses—the system prompts, tools, and logic that wrap a model. By analyzing raw failure logs at scale, it improved text classification by 7.7 points while using 4x fewer tokens, demonstrating that harness engineering is a major leverage point as model capabilities converge.
Your AI Agent Is Only as Good as Its Harness — Here’s What That Means
An article from Towards AI emphasizes that the reliability and safety of an AI agent depend more on its controlling 'harness'—the system of protocols, tools, and observability layers—than on the underlying model. This concept is reportedly worth $2 billion but remains poorly understood by many developers.
Akshay Pachaar Inverts LLM Agent Architecture with 'Harness' Design
AI engineer Akshay Pachaar outlined a novel 'harness' architecture for LLM agents that externalizes intelligence into memory, skills, and protocols. He is building a minimal, didactic open-source implementation of this design.
Harness Engineering for AI Agents: Building Production-Ready Systems That Don’t Break
A technical guide on 'Harness Engineering'—a systematic approach to building reliable, production-ready AI agents that move beyond impressive demos. This addresses the critical industry gap where most agent pilots fail to reach deployment.
MiniMax M2.7 AI Agent Rewrites Its Own Harness, Achieving 9 Gold Medals on MLE Bench Lite Without Retraining
MiniMax's M2.7 agent autonomously rewrites its own operational harness—skills, memory, and workflow rules—through a self-optimization loop. After 100+ internal rounds, it earned 9 gold medals on OpenAI's MLE Bench Lite without weight updates.
Grep Beats Vector Search in Agent Benchmarks, New Paper Finds
Grep beats vector search on LongMemEval across all harness-model pairs, showing agent design matters more than retrieval method for evidence-location tasks.
Claude Code's Six-Layer Architecture: Harness, Not Magic
Claude Code's six-layer architecture uses a 3-layer context compressor at 92% threshold and Redis-based multi-agent FSM protocol. The model is just one node in a harness.
POTEMKIN Framework Exposes Critical Trust Gap in Agentic AI Tools
A new paper formalizes Adversarial Environmental Injection (AEI), a threat model where compromised tools deceive AI agents. The POTEMKIN testing harness found agents are evaluated for performance, not skepticism, creating a critical trust gap.
Claude Code Reverse-Engineered: 98.4% of Codebase is Operational Harness
A reverse-engineering analysis of Claude Code reveals only 1.6% of its codebase is AI decision logic, with the rest being operational infrastructure. This challenges current agent design paradigms by prioritizing a robust deterministic harness over complex model routing.
Google DeepMind's AutoHarness: The AI Tool That Could Revolutionize How We Build Intelligent Systems
Google DeepMind's AutoHarness framework enables automatic testing and optimization of AI models without retraining, allowing developers to synthesize functional AI agents like coding assistants with unprecedented efficiency.
12-Metric Agent Eval Framework From 100+ Deployments Hits Production
12-metric evaluation framework for production AI agents from 100+ deployments targets task success, cost, latency, tool use, and safety.
8-Agent System Builder: Anthropic's Simpler Approach Beat My 2-Day Build
Engineer built 8-agent system in 2 days; Anthropic's simpler 2-agent approach outperformed it. Lesson: minimal agent architecture beats complex orchestration.
Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates
Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks. GPT-5 mini leads at 0.309 ONS, but no paradigm dominates. ASCII beats natural language.
Pylon: Self-Host Your Own AI Agent Pipeline That Fixes Sentry Errors via
Pylon is a self-hosted daemon that triggers sandboxed Claude Code agents from webhooks (Sentry, cron, chat) and reports results with human approval — no data leaves your machine.
Build Reusable Data Science Workflows with Claude Skills and Subagents
Claude Skills and Subagents let you package prompts into reusable modules, freeing data scientists from repetitive AI adjustments for EDA, modeling, and deployment.
LangFuse on Evaluating AI Agents in Production
The article outlines a practical methodology for monitoring and enhancing AI agent performance post-deployment. It emphasizes combining automated LLM-based evaluation with human feedback loops to create actionable datasets for fine-tuning.
Adobe, NVIDIA, WPP Launch Enterprise AI Agents for Marketing with OpenShell
NVIDIA expands collaborations with Adobe and WPP to build agentic AI systems for enterprise marketing workflows. The stack uses NVIDIA's OpenShell runtime to enforce security and policy compliance in multi-step creative and customer experience tasks.
Google Gemini's UI Harness Lags Behind Claude, GPT, Analyst Says
AI researcher Ethan Mollick notes the Gemini Pro 3.1 model is technically capable but hampered by a minimal user interface and tool harness, widening its gap with competitors Claude and ChatGPT.
Autogenesis Protocol Enables Self-Evolving AI Agents Without Retraining
A new paper introduces Autogenesis, a self-evolving agent protocol. Agents can assess their own shortcomings, propose and test improvements, and update their operational framework in a continuous loop.
Claude Opus 4.7 Launches with 3.75MP Vision, Agentic Coding, and New Tokenizer
Anthropic launched Claude Opus 4.7 today with 3x higher vision resolution (3.75MP), self-verifying coding outputs, and stricter instruction following. The update targets enterprise agentic workflows and knowledge work benchmarks.
MiniMax Launches MaxHermes, Cloud-Hosted Agent with NousResearch
MiniMax has launched MaxHermes, a cloud-hosted version of the Hermes agent framework, in partnership with NousResearch. This provides a managed service for users of MiniMax's M2.7 model, aiming to simplify agent deployment.
Coding Agent UIs Converge on Side-by-Side Sessions, Says Omar Sar
AI researcher Omar Sar observes a UI convergence in coding agents like Cursor and Claude Code, moving towards flexible, multi-session interfaces that boost developer productivity and agent capability.