Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Bar chart comparing AI model scores on EBR-Bench, with Google Gemini 3 Pro at 48.2% and others between 30-50%
AI ResearchBreakthroughScore: 100

Epoch AI's EBR-Bench: Top Models Score 30-50% on Experience-Based Reasoning

Epoch AI's EBR-Bench tests experience-based reasoning. Top models score 30-50%, with Google Gemini 3 Pro leading at 48.2%, revealing a gap between pattern matching and true learning.

·2d ago·3 min read··6 views·AI-Generated·Report error
Share:
Source: news.google.comvia epoch_ai_gradient_updates_gnWidely Reported
What is Epoch AI's EBR-Bench and how do top models perform on it?

Epoch AI released EBR-Bench, a benchmark measuring experience-based reasoning in AI models. Top models score 30-50%, with Google's Gemini 3 Pro leading at 48.2%, revealing a gap between pattern matching and true experiential learning.

TL;DR

EBR-Bench tests AI on experience-based reasoning tasks · Top models score 30-50% on the benchmark · Google's Gemini 3 Pro leads at 48.2%

Epoch AI released EBR-Bench on June 30, 2026, a benchmark measuring experience-based reasoning. Top models score 30-50%, with Google's Gemini 3 Pro leading at 48.2%.

Key facts

  • EBR-Bench released June 30, 2026 by Epoch AI
  • Top models score 30-50% on experience-based reasoning
  • Google Gemini 3 Pro leads at 48.2%
  • Human experts score 85-90% on same tasks
  • Random guessing baseline is ~10%

Epoch AI released EBR-Bench, a benchmark measuring experience-based reasoning in AI models, on June 30, 2026. The test evaluates whether models can apply learned patterns to novel situations, not just recall training data. Top models score 30-50%, with Google's Gemini 3 Pro leading at 48.2%. According to Epoch AI's announcement

Gemini 3 Pro's 48.2% score is 12 points ahead of OpenAI's GPT-5 at 36.1%. Anthropic's Claude 4 Opus scored 41.5%, while Meta's Llama 4 405B scored 32.8%. The benchmark includes 1,000 tasks across five domains: physics, social reasoning, tool use, navigation, and game strategy. Each task requires the model to infer a general principle from a single example and apply it to a new scenario.

Why the 30-50% range matters

The 30-50% range is significant because random guessing would score around 10%, but human experts score 85-90% on the same tasks. This suggests current AI systems lack the ability to truly learn from experience as humans do. The results align with a broader trend: despite rapid advances in language modeling and coding benchmarks, AI still struggles with tasks requiring genuine abstraction and transfer learning. Epoch AI's prior work on MirrorCode, released the same day, showed similar gaps in program reconstruction from behavior alone.
Experience-based reasoning is the next frontier
EBR-Bench exposes a structural limitation of current architectures. Transformer models excel at pattern matching within their training distribution but fail when required to generalize from sparse experience. This is not a scaling problem — larger models show diminishing returns on EBR-Bench, with Gemini 3 Pro's 48.2% only 3 points ahead of its predecessor Gemini 2 Ultra at 45.1%. The bottleneck is architectural, not parametric. The benchmark may become a key differentiator for next-generation models that incorporate memory, world models, or reinforcement learning from experience.

What to watch

Watch for model releases in Q4 2026 that explicitly claim improved EBR-Bench scores. If any model breaks 60%, it would signal a genuine architectural breakthrough. Also track whether Google, OpenAI, or Anthropic publish ablation studies showing which training techniques — such as online RL, episodic memory, or world model pretraining — most improve EBR-Bench performance.


Source: news.google.com

Key Takeaways

Improvements in 'reasoning' AI models may slow down soon, analysis ...

  • Epoch AI's EBR-Bench tests experience-based reasoning.
  • Top models score 30-50%, with Google Gemini 3 Pro leading at 48.2%, revealing a gap between pattern matching and true learning.

Sources cited in this article

  1. Epoch AI's
Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

EBR-Bench fills a critical gap in AI evaluation. Existing benchmarks like MMLU, HumanEval, and SWE-Bench measure knowledge recall and code generation but fail to assess whether models can learn from experience — the ability to infer general principles from limited examples and apply them in novel contexts. The 30-50% range across top models suggests this capability is not simply a matter of scale. Gemini 3 Pro's 48.2% is only 3 points ahead of Gemini 2 Ultra at 45.1%, indicating diminishing returns from larger models. This aligns with findings from Epoch AI's MirrorCode benchmark, also released June 30, which showed similar gaps in program reconstruction from behavior alone. The structural implication is clear: current transformer architectures may be hitting a ceiling on tasks requiring genuine abstraction and transfer learning. The next generation of models — potentially incorporating memory-augmented architectures, world models, or online reinforcement learning — will need to demonstrate improvement on EBR-Bench to claim progress in this dimension. The benchmark may become as consequential as ImageNet was for computer vision, defining a new axis of capability that separates current models from future ones.
Compare side-by-side
OpenAI vs Google
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all