harness engineering
30 articles about harness engineering in AI news
Agentic Harness Engineering Boosts Coding Agents 7% on Terminal-Bench 2
Agentic Harness Engineering introduces a structured approach to evolving coding-agent harnesses, using revertible components, condensed experience, and falsifiable decisions. On Terminal-Bench 2, pass@1 climbs from 69.7% to 77.0% in ten iterations, beating human-designed baselines.
Harness Engineering for AI Agents: Building Production-Ready Systems That Don’t Break
A technical guide on 'Harness Engineering'—a systematic approach to building reliable, production-ready AI agents that move beyond impressive demos. This addresses the critical industry gap where most agent pilots fail to reach deployment.
Agent Harness Engineering: The 'OS' That Makes LLMs Useful
A clear analogy frames raw LLMs as CPUs needing an operating system. The agent harness—managing tools, memory, and execution—is what creates useful applications, as proven by LangChain's benchmark jump.
Meta-Harness Framework Automates AI Agent Engineering, Achieves 6x Performance Gap on Same Model
A new framework called Meta-Harness automates the optimization of AI agent harnesses—the system prompts, tools, and logic that wrap a model. By analyzing raw failure logs at scale, it improved text classification by 7.7 points while using 4x fewer tokens, demonstrating that harness engineering is a major leverage point as model capabilities converge.
Anthropic Deploys Multi-Agent Harness to Scale Claude's Frontend Design & Autonomous Software Engineering
Anthropic engineers detail a multi-agent system that orchestrates multiple Claude instances to tackle complex, long-running software tasks like frontend design. The approach aims to overcome single-model context and reasoning limits.
Claude Code Reverse-Engineered: 98.4% of Codebase is Operational Harness
A reverse-engineering analysis of Claude Code reveals only 1.6% of its codebase is AI decision logic, with the rest being operational infrastructure. This challenges current agent design paradigms by prioritizing a robust deterministic harness over complex model routing.
Agent Harness Debate: Anthropic vs. OpenAI vs. LangChain on Scaffolding
A central debate in agent engineering pits a 'thin harness' approach (Anthropic) against 'thick harness' designs (LangGraph). The infrastructure layer, not the model, is becoming the primary product differentiator.
Agent Harnessing: The Infrastructure That Makes AI Agents Work
A detailed technical guide argues that the model is not the hard part of building AI agents. The six-component harness — context management, memory, tools, control flow, verification, and coordination — is what separates production-grade agents from those that fail silently.
Akshay Pachaar Inverts LLM Agent Architecture with 'Harness' Design
AI engineer Akshay Pachaar outlined a novel 'harness' architecture for LLM agents that externalizes intelligence into memory, skills, and protocols. He is building a minimal, didactic open-source implementation of this design.
Stanford/MIT Paper: AI Performance Depends on 'Model Harnesses'
A new paper from Stanford and MIT introduces the concept of 'Model Harnesses,' arguing that the wrapper of prompts, tools, and infrastructure around a base model is a primary determinant of real-world AI performance.
Meta-Harness from Stanford/MIT Shows System Code Creates 6x AI Performance Gap
Stanford and MIT researchers show AI performance depends as much on the surrounding system code (the 'harness') as the model itself. Their Meta-Harness framework automatically improves this code, yielding significant gains in reasoning and classification tasks.
MiniMax M2.7 AI Agent Rewrites Its Own Harness, Achieving 9 Gold Medals on MLE Bench Lite Without Retraining
MiniMax's M2.7 agent autonomously rewrites its own operational harness—skills, memory, and workflow rules—through a self-optimization loop. After 100+ internal rounds, it earned 9 gold medals on OpenAI's MLE Bench Lite without weight updates.
Anthropic's 'Harness' Design: What It Means for Your Claude Code Workflows
Anthropic's new 'harness' architecture for long-running apps could enable Claude Code to manage more complex, persistent development tasks with greater stability.
Google DeepMind's AutoHarness: The AI Tool That Could Revolutionize How We Build Intelligent Systems
Google DeepMind's AutoHarness framework enables automatic testing and optimization of AI models without retraining, allowing developers to synthesize functional AI agents like coding assistants with unprecedented efficiency.
The AI Paradox: Why Software Engineering Jobs Are Surging Despite Automation Fears
Citadel Securities data reveals software engineering job postings are spiking despite AI coding tools, illustrating the Jevons paradox where cheaper software creation drives increased demand for developers as companies expand digital initiatives.
How to Manage Multiple Claude Code Sessions with Harness and Preview
Two actionable tools to solve the core productivity bottlenecks when running multiple Claude Code agents: session management and review speed.
The AI Paradox: How Cheaper Code Creation Is Fueling a Software Engineering Boom
Contrary to fears of AI replacing developers, the Jevons Paradox suggests that making software creation cheaper through AI tools actually increases demand for human engineers who can design, review, and integrate complex systems at scale.
How Claude Code's System Prompt Engine Actually Works
Claude Code builds its system prompt dynamically from core instructions, conditional tool definitions, user files, and managed conversation history, revealing the critical role of context engineering.
Anthropic Opus 4.7: 87.6% SWE-Bench, Constrained Cyber Capabilities
Anthropic released Claude Opus 4.7 on April 16, 2026, achieving 87.6% on SWE-Bench Verified and 64.3% on SWE-Bench Pro — leading GPT-5.4 and Gemini 3.1 Pro. The company also confirmed it deliberately constrained cybersecurity capabilities in Opus 4.7, with the more powerful Mythos Preview model (83.1% on CyberGym) restricted to select partners.
OpenAI Launches GPT-Rosalind for Drug Discovery, GPT-5.4-Cyber for Security
OpenAI launched GPT-Rosalind, a life sciences model performing above the 95th percentile of human experts on novel biological data, and GPT-5.4-Cyber, a cybersecurity variant. These releases, alongside a major Agents SDK update, signal a pivot from general AI to specialized, high-stakes enterprise domains.
Stop Rewriting CLAUDE.md: The 4-Stage Evolution That Cuts Context Waste 40%
Your CLAUDE.md should grow with your project through four intentional stages, adding rejected alternatives and 'never do this' rules to prevent Claude from re-litigating settled decisions.
Claude Opus 4.7 Launches with 3.75MP Vision, Agentic Coding, and New Tokenizer
Anthropic launched Claude Opus 4.7 today with 3x higher vision resolution (3.75MP), self-verifying coding outputs, and stricter instruction following. The update targets enterprise agentic workflows and knowledge work benchmarks.
LLM Schema-Adaptive Method Enables Zero-Shot EHR Transfer
Researchers propose Schema-Adaptive Tabular Representation Learning, an LLM-driven method that transforms structured variables into semantic statements. It enables zero-shot alignment across unseen EHR schemas and outperforms clinical baselines, including neurologists, on dementia diagnosis tasks.
Cortical Labs Grows 200k Neurons on Chip, Connects to LLM
Cortical Labs grew 200,000 human brain cells on a chip and connected them to a large language model. This experiment explores hybrid biological-silicon intelligence.
Game Studios Show Wide Variance in AI Adoption, Wharton Report Finds
A Wharton School report, based on interviews at 20 game studios, finds a wide spectrum of organizational approaches to adopting generative AI tools, from aggressive integration to active resistance.
Managed Agents Emerge as Fastest Path from Prototype to Production
Developer Alex Albert highlights that managed agent services now offer the fastest path from weekend project to production-scale deployment, eliminating self-hosting complexity while maintaining flexibility.
FLAME: A Novel Framework for Efficient, High-Performance Sequential Recommendation
A new paper introduces FLAME, a training framework for sequential recommender systems. It uses a frozen 'anchor' network and a learnable network, combined via modular ensembles, to capture user behavior diversity efficiently. The result is a single model that performs like an ensemble but runs as fast as a single model at inference.
McKinsey: AI Infrastructure Value Creation Outpaces Business Capture
McKinsey's latest analysis indicates the pace of value creation from AI infrastructure is exceeding the rate at which most businesses are capturing it, highlighting a growing implementation deficit.
How Claude Code Reverse-Engineered an FPGA Bitstream: A Template for Hardware Hacking
Learn the exact Claude Code workflow used to map an Altera Cyclone IV FPGA's bitstream format—from fuzzing scripts to documentation generation.
Scaling Law Plateau Not Universal: More Tokens Boost Reasoning AI Performance
Empirical evidence indicates the 'second scaling law'—performance gains from increased computation—does not fully plateau for many reasoning tasks. Benchmark results may be artificially limited by token budgets, not model capability.