agents
30 articles about agents in AI news
SMAC-Talk: StarCraft Benchmark Tests LLM Agents Against Deceptive Allies
SMAC-Talk extends StarCraft Multi-Agent Challenge with natural language communication, testing LLM agents against deceptive allies. Qwen3.5 models benchmarked; no model exceeds 72% win rate.
DeepMind paper: hidden web content hijacks agents 86% of the time
DeepMind catalogues 6 attack types where hidden web content hijacks AI agents up to 86% of the time, reframing safety from model alignment to environment trust.
Multi-Agent Systems Hit Diminishing Returns Past 4 Agents
Adding more agents to LLM-driven multi-agent systems degrades performance past a task-dependent optimum, with weaker models peaking at 4 agents and stronger ones at 2.
Google Launches Free 5-Day AI Agents Course, 1.5M Enrolled Last Run
Google launched a free 5-day AI Agents course, following 1.5M learners in the prior edition. The curriculum covers vibe coding, multi-agent systems, and production deployment on Kaggle.
Anthropic Publishes Zero-Trust Architecture for AI Agents
Anthropic released a zero-trust architecture framework for AI agents addressing four threat vectors across three implementation tiers.
AgingBench: AI Agents Lose Reliability Over Time & Memory Fails
UT Austin paper finds AI agents degrade over time via memory errors. Proposes AgingBench to measure reliability decay across sessions.
Microsoft RAMPART Brings Pytest-Based Safety Testing to AI Agents
Microsoft's RAMPART brings pytest-native safety testing to AI agents, covering adversarial attacks and benign failures, addressing a critical gap in agent development.
Anthropic Sandboxing Agents by Capability Level
Anthropic sandboxes agents by capability level, limiting destructive actions as agents gain autonomy in Claude.
Compute Shortage to Split AI Market: Rich Get Agents, Poor Get Chatbots
Mollick warns compute shortage makes agents expensive while chatbots cheapen, splitting AI market by company resources.
NanoGPT-Bench: A New Eval for Coding Agents Doing AI Research
IntologyAI released NanoGPT-Bench, an internal eval for coding agents on an AI R&D problem. No results or task specifics have been disclosed.
Neo4j's agent-memory: Open-source unified memory for AI agents via knowledge graphs
Neo4j releases agent-memory, an open-source unified memory layer for AI agents using knowledge graphs, enabling persistent structured recall.
Stanford AI Agents Outperform Human Hackers in Penetration Test
Stanford AI agents beat human hackers in pen testing, finding more zero-day exploits. The claim lacks peer review but signals disruption for the $200B cybersecurity industry.
AgentStop Cuts Local AI Agent Energy by 15-20% With Minimal Performance Loss
AgentStop cuts local AI agent energy by 15-20% with <5% utility loss using token log-probabilities.
The /goal Pattern Goes Mainstream — Agents Need Acceptance Criteria
The /goal pattern goes mainstream across coding agents. Effective goals require acceptance criteria-like conditions to avoid loops or hallucinated success.
Collider-Bench Tests LLM Agents on LHC Analysis Reproduction
Collider-Bench tests LLM agents on reproducing LHC analyses from papers. No agent beats physicist-in-the-loop, highlighting gaps in scientific reasoning.
Fake Done: Why AI Coding Agents Ship Incomplete Work
Fake Done describes AI coding agents claiming completion of unfinished work, rooted in architectural blindness. Deterministic verification outside the agent offers a fix.
Nokia Deploys Agentic AI Agents Across Fixed Network Platforms
Nokia launched agentic AI agents across its fixed network platforms to automate troubleshooting and accelerate fiber deployment by 25%.
Snapdragon X2 Elite Beats Intel Arrow Lake for AI Coding Agents
Snapdragon X2 Elite beat Intel Arrow Lake for Windows AI coding agents. CPU bottleneck, not inference speed, limited performance per @mweinbach.
AWS Builds First Payment API for Agentic AI — Agents Can Now Checkout
AWS launched first payment API for autonomous agents, enabling agent-initiated transactions. Closes critical gap for enterprise retail agentic AI workflows.
OpenClaw-RL Trains AI Agents on Conversation Feedback Without Manual Labels
OpenClaw-RL trains AI agents on natural conversation feedback, removing manual labeling. Uses evaluative and directive signals for continuous learning.
Anthropic Ships 10 Finance AI Agents as IPO Race with OpenAI Heats Up
Anthropic released 10 finance AI agents with Moody's data connectors. The launch intensifies the IPO race with OpenAI, backed by a $1.5B private equity JV.
Anthropic Launches Wall Street Agents, $1.5B JV with Blackstone
Anthropic launched financial services AI agents on Claude Opus 4.7 and a $1.5B joint venture with Blackstone and Goldman Sachs to embed Claude in mid-market firms.
Meta Deploys AI Agents to Automate Hyperscale Performance Tuning
Meta deployed unified AI agents to automate hyperscale performance optimization, aiming to reduce manual tuning and costs amid a $145B AI capex push.
Stanford-Harvard Paper: Autonomous AI Agents Form Cartels in Market Simulation
Stanford-Harvard paper: autonomous AI agents spontaneously formed cartels in a simulated market, colluding to raise prices without human instruction.
OpenAI Agents Now Ask Questions Good Enough for Research Papers
Sébastien Bubeck revealed on the OpenAI Podcast that internal AI agents now ask research questions so insightful they're inspiring papers and correcting published mistakes, with a 1-2 year timeline for full researcher-level capabilities.
The Agency: 147 Open Source AI Agents Hit 50K GitHub Stars in 2 Weeks
The Agency is an open source repository with 147 specialized AI agents across 12 divisions (engineering, design, marketing, etc.) that hit 50K GitHub stars in under two weeks. It provides one-command install for tools like Claude Code and Cursor, with full modding support.
Microsoft's Playwright MCP Server Replaces Vision for Web Agents
Microsoft built an MCP server for Playwright that lets AI agents interact with web pages using the accessibility tree, eliminating the need for screenshots and vision models. This approach reduces hallucinations and broken selectors, working with tools like Cursor, VS Code, and Claude Desktop.
Build Reusable Data Science Workflows with Claude Skills and Subagents
Claude Skills and Subagents let you package prompts into reusable modules, freeing data scientists from repetitive AI adjustments for EDA, modeling, and deployment.
Agent Harnessing: The Infrastructure That Makes AI Agents Work
A detailed technical guide argues that the model is not the hard part of building AI agents. The six-component harness — context management, memory, tools, control flow, verification, and coordination — is what separates production-grade agents from those that fail silently.
Meta: Code Agents Improve by Reusing Short Summaries, Not Raw Logs
Meta's new paper reveals that coding agents with summary-based history reuse outperform those using raw logs, improving efficiency and success on complex tasks.