llm agents

30 articles about llm agents in AI news

SMAC-Talk: StarCraft Benchmark Tests LLM Agents Against Deceptive Allies

SMAC-Talk extends StarCraft Multi-Agent Challenge with natural language communication, testing LLM agents against deceptive allies. Qwen3.5 models benchmarked; no model exceeds 72% win rate.

Jun 5, 202670% relevant

Collider-Bench Tests LLM Agents on LHC Analysis Reproduction

Collider-Bench tests LLM agents on reproducing LHC analyses from papers. No agent beats physicist-in-the-loop, highlighting gaps in scientific reasoning.

May 15, 202692% relevant

MemoryCD: New Benchmark Tests LLM Agents on Real-World, Lifelong User Memory for Personalization

Researchers introduce MemoryCD, the first large-scale benchmark for evaluating LLM agents' long-context memory using real Amazon user data across 12 domains. It reveals current methods are far from satisfactory for lifelong personalization.

Mar 30, 202674% relevant

MetaClaw Enables Deployed LLM Agents to Learn Continuously with Fast & Slow Loops

MetaClaw introduces a two-loop system allowing production LLM agents to learn from failures in real-time via a fast skill-writing loop and update their core model later in a slow training loop, boosting accuracy by up to 32% relative.

Mar 27, 202685% relevant

EnterpriseArena Benchmark Reveals LLM Agents Fail at Long-Horizon CFO-Style Resource Allocation

Researchers introduced EnterpriseArena, a 132-month enterprise simulator, to test LLM agents on CFO-style resource allocation. Only 16% of runs survived the full horizon, revealing a distinct capability gap for current models.

Mar 26, 202695% relevant

Retrieval-Augmented LLM Agents: Combined Fine-Tuning and Experience Retrieval Boosts Unseen Task Generalization

Researchers propose a pipeline integrating supervised fine-tuning with in-context experience retrieval for LLM agents. The combined approach significantly improves generalization to unseen tasks compared to using either method alone.

Mar 20, 202695% relevant

AgentDrift: How Corrupted Tool Data Causes Unsafe Recommendations in LLM Agents

New research reveals LLM agents making product recommendations can maintain ranking quality while suggesting unsafe items when their tools provide corrupted data. Standard metrics like NDCG fail to detect this safety drift, creating hidden risks for high-stakes applications.

Mar 16, 202695% relevant

GeoAgentBench: New Dynamic Benchmark Tests LLM Agents on 117 GIS Tools

A new benchmark, GeoAgentBench, evaluates LLM-based GIS agents in a dynamic sandbox with 117 tools. It introduces a novel Plan-and-React agent architecture that outperforms existing frameworks in multi-step spatial tasks.

Apr 17, 202694% relevant

Omar Saro on Multi-User LLM Agents: A New Framework Frontier

AI researcher Omar Saro points out that all current LLM agent frameworks are designed for single-user instruction, creating a deployment barrier for team-based workflows. This identifies a major unsolved problem in making AI agents practically useful in organizations.

Apr 15, 202675% relevant

Multi-User LLM Agents Struggle: Gemini 3 Pro Scores 85.6% on Muses-Bench

A new benchmark reveals LLMs struggle with multi-user scenarios where agents face conflicting instructions. Gemini 3 Pro leads but only achieves 85.6% average, with privacy-utility tradeoffs proving particularly difficult.

Apr 14, 202692% relevant

Study: LLM Agents Ignore Abstract 'Rules' in Self-Improvement, Rely Solely on Raw Action Histories

Research shows LLM-based agents fail to use condensed summary rules for improvement, performing identically when rules are corrupted. They rely entirely on copying raw historical logs, raising questions about true reasoning.

Mar 21, 202685% relevant

SDAR: Self-Distilled RL Stabilizes Multi-Turn LLM Agents, +9.4% on ALFWorld

SDAR gates self-distillation within GRPO to stabilize multi-turn LLM agent training, yielding +9.4% on ALFWorld and gains on WebShop and Search-QA across Qwen2.5 and Qwen3 models.

May 15, 202685% relevant

LLM Agents Will Reshape Personalization

Researchers propose that LLM-based assistants are reconfiguring how user representations are produced and exposed, requiring a shift toward inspectable, portable, and revisable user models across services. They identify five research fronts for the future of recommender systems.

Apr 23, 202684% relevant

ToolTree: A New Planning Paradigm for LLM Agents That Could Transform Complex Retail Operations

Researchers propose ToolTree, a Monte Carlo tree search-inspired method for LLM agent tool planning. It uses dual-stage evaluation and bidirectional pruning to improve foresight and efficiency in multi-step tasks, achieving ~10% gains over state-of-the-art methods.

Mar 16, 202670% relevant

LLM Agents Take the Wheel: How Rudder Revolutionizes Distributed GNN Training

Researchers have developed Rudder, a novel system that uses Large Language Model agents to dynamically prefetch data in distributed Graph Neural Network training, achieving up to 91% performance improvement over traditional methods by adapting to changing computational conditions in real-time.

Mar 2, 202675% relevant

Agent4POI: LLM Agents Beat Static Embeddings by 23.2% on POI Rec

Agent4POI achieves 23.2% relative gain over baselines by generating context-aware POI representations at inference time, proving static embeddings insufficient.

May 18, 202676% relevant

Research Paper Proposes Security Framework for Autonomous AI Agents in Commerce

A Systematization of Knowledge (SoK) paper analyzes the emerging threat landscape for autonomous LLM agents conducting commerce. It identifies 12 attack vectors across five dimensions and proposes a layered defense architecture. This is a foundational security analysis for a nascent but high-stakes technology.

Apr 20, 2026100% relevant

Akshay Pachaar Inverts LLM Agent Architecture with 'Harness' Design

AI engineer Akshay Pachaar outlined a novel 'harness' architecture for LLM agents that externalizes intelligence into memory, skills, and protocols. He is building a minimal, didactic open-source implementation of this design.

Apr 18, 202689% relevant

HORIZON Benchmark Diagnoses Long-Horizon Failures in GPT-5 and Claude Agents

A new benchmark called HORIZON systematically analyzes where and why LLM agents like GPT-5 and Claude fail on long-horizon tasks. The study collected over 3100 agent trajectories and provides a scalable method for failure attribution, offering practical guidance for building more reliable agents.

Apr 15, 2026100% relevant

ContextSim: A New LLM Framework for Context-Aware Recommender System Simulation

A new arXiv preprint introduces ContextSim, a framework that uses LLM agents to simulate users interacting with recommender systems within realistic daily scenarios (time, location, needs). Experiments show it generates more human-aligned interactions and that RS parameters optimized with it yield improved real-world engagement.

Apr 14, 202692% relevant

LLMs Score Only 22% Win Rate in Multi-Agent Clue Game, Revealing Deductive Reasoning Gaps

Researchers created a text-based Clue game to test LLM agents' multi-step deductive reasoning. Across 18 games with GPT-4o-mini and Gemini-2.5-Flash agents, only 4 correct wins were achieved, showing fine-tuning on logic puzzles doesn't reliably improve performance.

Mar 19, 202675% relevant

Strategic AI Agents: Meta-Reinforcement Learning for Dynamic Retail Environments

MAGE introduces meta-RL to create LLM agents that strategically explore and exploit in changing environments. For retail, this enables adaptive pricing, inventory, and marketing systems that learn from continuous feedback without constant retraining.

Mar 5, 202665% relevant

Microsoft's EMPO²: A Memory-Augmented RL Framework That Supercharges LLM Agent Exploration

Microsoft has unveiled EMPO², a hybrid reinforcement learning framework that enhances LLM agents with augmented memory for true exploration. The system combines on- and off-policy optimization to discover novel states, achieving 128.6% performance gains over existing methods on ScienceWorld benchmarks.

Feb 28, 202685% relevant

LLM4Cov: How Offline Agent Learning is Revolutionizing Hardware Verification

Researchers have developed LLM4Cov, a novel framework that enables execution-aware LLM agents to learn from expensive simulator feedback without costly online reinforcement learning. The approach achieves 69.2% coverage in hardware verification tasks, outperforming larger models through innovative offline learning techniques.

Feb 20, 202675% relevant

A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search

A new research paper presents a reference architecture for 'agentic hybrid retrieval' that orchestrates BM25, dense embeddings, and LLM agents to handle underspecified queries against sparse metadata. It introduces offline metadata augmentation and analyzes two architectural styles for quality attributes like governance and performance.

Apr 21, 202684% relevant

Agent Judges with Big Five Personas Match Human Evaluators, Show Logarithmic Score Saturation in New arXiv Study

A new arXiv study shows LLM agents conditioned with Big Five personalities produce evaluations indistinguishable from humans. Crucially, quality scores saturate logarithmically with panel size, while discovering unique issues follows a slower power law.

Apr 2, 202672% relevant

MiRA Framework Boosts Gemma3-12B to 43% Success Rate on WebArena-Lite, Surpassing GPT-4 and WebRL

Researchers propose MiRA, a milestone-based RL framework that improves long-horizon planning in LLM agents. It boosts Gemma3-12B's web navigation success from 6.4% to 43%, outperforming GPT-4-Turbo (17.6%) and the previous SOTA WebRL (38.4%).

Mar 23, 202677% relevant

ServiceNow Research Launches EnterpriseOps-Gym: A 512-Tool Benchmark for Testing Agentic Planning in Enterprise Environments

ServiceNow Research and Mila have released EnterpriseOps-Gym, a high-fidelity benchmark with 164 database tables and 512 tools across eight domains to evaluate LLM agents on long-horizon enterprise workflows.

Mar 18, 202695% relevant

Verified Multi-Agent Orchestration: A Plan-Execute-Verify-Replan Framework for Complex Query Resolution

Researchers propose VMAO, a framework coordinating specialized LLM agents through verification-driven iteration. It decomposes complex queries into parallelizable DAGs, verifies completeness, and replans adaptively. On market research queries, it significantly improved answer quality over single-agent baselines.

Mar 13, 202675% relevant

OpenAI Open-Sources Agents SDK, Supports 100+ LLMs

OpenAI has open-sourced its internal Agents SDK, a lightweight framework for building multi-agent systems. It features three core primitives, works with over 100 LLMs, and has gained 18.9k GitHub stars immediately.

Apr 18, 202695% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety