What makes Agentick different from existing agent benchmarks?

Agentick provides a unified interface for RL, LLM, VLM, hybrid, and human agents, enabling direct cross-paradigm comparison that prior benchmarks lacked.

Why do ASCII observations outperform natural language?

The paper suggests ASCII grids provide more precise spatial information than natural language descriptions, reducing ambiguity for decision-making.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

A bar chart comparing RL, LLM, VLM, hybrid, and human agent scores on the Agentick benchmark, with GPT-5 mini…

AI ResearchBreakthroughScore: 98

Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates

Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks. GPT-5 mini leads at 0.309 ONS, but no paradigm dominates. ASCII beats natural language.

AAAla SMITH & AI Research Desk·May 11, 2026·3 min read··258 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_aiWidely Reported

What is the Agentick benchmark and what are its key findings?

Agentick benchmark evaluates RL, LLM, VLM, hybrid, and human agents on 37 tasks. GPT-5 mini leads overall at 0.309 oracle-normalized score; PPO dominates planning and multi-agent tasks. ASCII observations outperform natural language across all agents.

TL;DR

37 procedurally generated tasks across six capabilities · GPT-5 mini leads at 0.309 oracle-normalized score · Reasoning harness multiplies LLM performance 3-10x

Agentick benchmark pits RL, LLM, VLM, hybrid, and human agents on 37 tasks. gpt-5-mini" class="entity-chip">GPT-5 mini leads at 0.309 oracle-normalized score, but no paradigm dominates.

Key facts

37 procedurally generated tasks across six capability categories
27 agent configurations evaluated over 90,000+ episodes
GPT-5 mini leads at 0.309 oracle-normalized score
Reasoning harness improves LLM performance 3-10x
ASCII observations outperform natural language across all agents

Researchers from Google DeepMind and Université de Montréal released Agentick, a unified benchmark for sequential decision-making agents. The benchmark provides 37 procedurally generated tasks across six capability categories — including planning, multi-agent coordination, and memory — with four difficulty levels and five observation modalities [per the arXiv preprint].

Key Findings from 90,000+ Episodes

An evaluation spanning 27 agent configurations and over 90,000 episodes reveals stark performance gaps. GPT-5 mini leads overall at 0.309 oracle-normalized score (ONS), while PPO trained for 2 million steps achieves 0.287 ONS. However, PPO dominates planning and multi-agent tasks, where LLM-based agents lag significantly.

The reasoning harness — a chain-of-thought wrapper — multiplies LLM performance by 3-10x, suggesting that prompting strategy matters more than model scale for these tasks. Surprisingly, ASCII observations consistently outperform natural language observations across all agent types, challenging the assumption that richer representations always help.

No Silver Bullet for Agent Architectures

Agentick's design explicitly addresses the fragmentation in agent evaluation. Existing benchmarks often favor one paradigm — RL on Gym environments or LLMs on static QA — making cross-paradigm comparison impossible. Agentick provides a single Gymnasium-compatible interface, oracle reference policies for all tasks, pre-built SFT datasets, and a live leaderboard [according to the paper].

The benchmark's capability-decomposed structure reveals that different architectures excel in different sub-skills. Hybrid agents combining RL policies with LLM reasoning show promise but still trail specialists in their respective domains.

Implications for RL Post-Training

Agentick ships with pre-built SFT datasets, positioning it as a training ground for RL post-training of foundation models in sequential environments. This directly addresses a gap identified in recent work: foundation models lack robust sequential decision-making capabilities that RL-from-scratch agents possess.

The paper notes that even the best-performing agent — GPT-5 mini at 0.309 ONS — leaves substantial room for improvement. An oracle-normalized score of 1.0 represents perfect performance, meaning current agents achieve less than one-third of optimal behavior.

Key Takeaways

Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks.
GPT-5 mini leads at 0.309 ONS, but no paradigm dominates.
ASCII beats natural language.

What to watch

Watch for the Agentick leaderboard updates as more labs submit results. Key metric: whether any agent crosses 0.5 ONS within six months, and whether hybrid RL-LLM architectures narrow the gap with PPO on planning tasks.

$Figure 1: Two observation modalities for KeyDoorPuzzle at medium difficulty. Left: isometric pixel rendering (512×\times$

Source: gentic.news · May 11, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

Agentick fills a genuine gap in agent evaluation: the inability to compare RL agents trained from scratch against foundation model agents on common ground. The finding that ASCII observations outperform natural language is counterintuitive but aligns with earlier work showing that LLMs struggle with spatial reasoning from text. The 3-10x improvement from reasoning harnesses underscores how much headroom exists in prompt engineering relative to model scale. The most striking result may be PPO's dominance on planning and multi-agent tasks — domains where LLMs have been heavily marketed. This suggests that for tasks requiring multi-step coordination, learned policies still beat prompted reasoning. The benchmark's design as both evaluation and training infrastructure positions it as a potential standard for RL post-training of foundation models, a direction several labs are pursuing. A limitation: the paper does not disclose compute costs for each agent configuration, making it hard to assess efficiency. The oracle-normalized score also masks variance across difficulty levels — an agent that excels at easy tasks but fails hard ones could have a misleadingly high average.

#agents #reinforcement learning #benchmarks #large language models #ai research

Mentioned in this article

Agentick GPT-5 mini Google Université de Montréal Reasoning Harness

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research

Anthropic Study: Senior Engineers Beat Juniors With AI by 31%

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates

Key Takeaways

What to watch

AI Analysis

✨AI Toolslive

Related Articles

Tencent Open-Sources Agent Memory System Cutting Token Use 61%

OpenAI GPT-5.5-Cyber Beats Anthropic Mythos on Security Benchmarks

ByteDance Seed's SpatialTree Redefines MLLM Spatial Reasoning at CVPR 2026

How to Govern Claude Code Across Your Team: 4 Gaps to Fix Before the Next CVE

OpenAI Can Predict Model Failures via Past Chat Replay

Anthropic Study: Senior Engineers Beat Juniors With AI by 31%

The framework underneath this story

More in AI Research

ReMMD Agent Hits 41.8% Accuracy on Multilingual Misinformation, Cuts Cost 79.9%

RIFT-Bench Tests 45 Agentic Systems With Dynamic Red-Teaming

Tencent Open-Sources Agent Memory System Cutting Token Use 61%