Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A bar chart comparing RL, LLM, VLM, hybrid, and human agent scores on the Agentick benchmark, with GPT-5 mini…
AI ResearchBreakthroughScore: 78

Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates

Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks. GPT-5 mini leads at 0.309 ONS, but no paradigm dominates. ASCII beats natural language.

·4h ago·3 min read··7 views·AI-Generated·Report error
Share:
Source: arxiv.orgvia arxiv_aiSingle Source
What is the Agentick benchmark and what are its key findings?

Agentick benchmark evaluates RL, LLM, VLM, hybrid, and human agents on 37 tasks. GPT-5 mini leads overall at 0.309 oracle-normalized score; PPO dominates planning and multi-agent tasks. ASCII observations outperform natural language across all agents.

TL;DR

37 procedurally generated tasks across six capabilities · GPT-5 mini leads at 0.309 oracle-normalized score · Reasoning harness multiplies LLM performance 3-10x

Agentick benchmark pits RL, LLM, VLM, hybrid, and human agents on 37 tasks. gpt-5-mini" class="entity-chip">GPT-5 mini leads at 0.309 oracle-normalized score, but no paradigm dominates.

Key facts

  • 37 procedurally generated tasks across six capability categories
  • 27 agent configurations evaluated over 90,000+ episodes
  • GPT-5 mini leads at 0.309 oracle-normalized score
  • Reasoning harness improves LLM performance 3-10x
  • ASCII observations outperform natural language across all agents

Researchers from Google DeepMind and Université de Montréal released Agentick, a unified benchmark for sequential decision-making agents. The benchmark provides 37 procedurally generated tasks across six capability categories — including planning, multi-agent coordination, and memory — with four difficulty levels and five observation modalities [per the arXiv preprint].

Key Findings from 90,000+ Episodes

An evaluation spanning 27 agent configurations and over 90,000 episodes reveals stark performance gaps. GPT-5 mini leads overall at 0.309 oracle-normalized score (ONS), while PPO trained for 2 million steps achieves 0.287 ONS. However, PPO dominates planning and multi-agent tasks, where LLM-based agents lag significantly.

The reasoning harness — a chain-of-thought wrapper — multiplies LLM performance by 3-10x, suggesting that prompting strategy matters more than model scale for these tasks. Surprisingly, ASCII observations consistently outperform natural language observations across all agent types, challenging the assumption that richer representations always help.

No Silver Bullet for Agent Architectures

Agentick's design explicitly addresses the fragmentation in agent evaluation. Existing benchmarks often favor one paradigm — RL on Gym environments or LLMs on static QA — making cross-paradigm comparison impossible. Agentick provides a single Gymnasium-compatible interface, oracle reference policies for all tasks, pre-built SFT datasets, and a live leaderboard [according to the paper].

The benchmark's capability-decomposed structure reveals that different architectures excel in different sub-skills. Hybrid agents combining RL policies with LLM reasoning show promise but still trail specialists in their respective domains.

Implications for RL Post-Training

Agentick ships with pre-built SFT datasets, positioning it as a training ground for RL post-training of foundation models in sequential environments. This directly addresses a gap identified in recent work: foundation models lack robust sequential decision-making capabilities that RL-from-scratch agents possess.

The paper notes that even the best-performing agent — GPT-5 mini at 0.309 ONS — leaves substantial room for improvement. An oracle-normalized score of 1.0 represents perfect performance, meaning current agents achieve less than one-third of optimal behavior.

Key Takeaways

  • Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks.
  • GPT-5 mini leads at 0.309 ONS, but no paradigm dominates.
  • ASCII beats natural language.

What to watch

Watch for the Agentick leaderboard updates as more labs submit results. Key metric: whether any agent crosses 0.5 ONS within six months, and whether hybrid RL-LLM architectures narrow the gap with PPO on planning tasks.

Figure 1: Two observation modalities for KeyDoorPuzzle at medium difficulty. Left: isometric pixel rendering (512×\times


Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

Agentick fills a genuine gap in agent evaluation: the inability to compare RL agents trained from scratch against foundation model agents on common ground. The finding that ASCII observations outperform natural language is counterintuitive but aligns with earlier work showing that LLMs struggle with spatial reasoning from text. The 3-10x improvement from reasoning harnesses underscores how much headroom exists in prompt engineering relative to model scale. The most striking result may be PPO's dominance on planning and multi-agent tasks — domains where LLMs have been heavily marketed. This suggests that for tasks requiring multi-step coordination, learned policies still beat prompted reasoning. The benchmark's design as both evaluation and training infrastructure positions it as a potential standard for RL post-training of foundation models, a direction several labs are pursuing. A limitation: the paper does not disclose compute costs for each agent configuration, making it hard to assess efficiency. The oracle-normalized score also masks variance across difficulty levels — an agent that excels at easy tasks but fails hard ones could have a misleadingly high average.
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all