EBR-Bench is a benchmark released by Epoch AI on June 30, 2026, that tests AI models' ability to apply learned patterns to novel situations — measuring experience-based reasoning rather than training data recall.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Bar chart comparing AI model scores on EBR-Bench, with Google Gemini 3 Pro at 48.2% and others between 30-50%

AI ResearchBreakthroughScore: 100

Epoch AI's EBR-Bench: Top Models Score 30-50% on Experience-Based Reasoning

Epoch AI's EBR-Bench tests experience-based reasoning. Top models score 30-50%, with Google Gemini 3 Pro leading at 48.2%, revealing a gap between pattern matching and true learning.

AAAla SMITH & AI Research Desk·2d ago·3 min read··6 views·AI-Generated·Report error

Source: news.google.comvia epoch_ai_gradient_updates_gnWidely Reported

What is Epoch AI's EBR-Bench and how do top models perform on it?

Epoch AI released EBR-Bench, a benchmark measuring experience-based reasoning in AI models. Top models score 30-50%, with Google's Gemini 3 Pro leading at 48.2%, revealing a gap between pattern matching and true experiential learning.

TL;DR

EBR-Bench tests AI on experience-based reasoning tasks · Top models score 30-50% on the benchmark · Google's Gemini 3 Pro leads at 48.2%

Epoch AI released EBR-Bench on June 30, 2026, a benchmark measuring experience-based reasoning. Top models score 30-50%, with Google's Gemini 3 Pro leading at 48.2%.

Key facts

EBR-Bench released June 30, 2026 by Epoch AI
Top models score 30-50% on experience-based reasoning
Google Gemini 3 Pro leads at 48.2%
Human experts score 85-90% on same tasks
Random guessing baseline is ~10%

Epoch AI released EBR-Bench, a benchmark measuring experience-based reasoning in AI models, on June 30, 2026. The test evaluates whether models can apply learned patterns to novel situations, not just recall training data. Top models score 30-50%, with Google's Gemini 3 Pro leading at 48.2%. According to Epoch AI's announcement

Gemini 3 Pro's 48.2% score is 12 points ahead of OpenAI's GPT-5 at 36.1%. Anthropic's Claude 4 Opus scored 41.5%, while Meta's Llama 4 405B scored 32.8%. The benchmark includes 1,000 tasks across five domains: physics, social reasoning, tool use, navigation, and game strategy. Each task requires the model to infer a general principle from a single example and apply it to a new scenario.

Why the 30-50% range matters

The 30-50% range is significant because random guessing would score around 10%, but human experts score 85-90% on the same tasks. This suggests current AI systems lack the ability to truly learn from experience as humans do. The results align with a broader trend: despite rapid advances in language modeling and coding benchmarks, AI still struggles with tasks requiring genuine abstraction and transfer learning. Epoch AI's prior work on MirrorCode, released the same day, showed similar gaps in program reconstruction from behavior alone.
Experience-based reasoning is the next frontier
EBR-Bench exposes a structural limitation of current architectures. Transformer models excel at pattern matching within their training distribution but fail when required to generalize from sparse experience. This is not a scaling problem — larger models show diminishing returns on EBR-Bench, with Gemini 3 Pro's 48.2% only 3 points ahead of its predecessor Gemini 2 Ultra at 45.1%. The bottleneck is architectural, not parametric. The benchmark may become a key differentiator for next-generation models that incorporate memory, world models, or reinforcement learning from experience.

What to watch

Watch for model releases in Q4 2026 that explicitly claim improved EBR-Bench scores. If any model breaks 60%, it would signal a genuine architectural breakthrough. Also track whether Google, OpenAI, or Anthropic publish ablation studies showing which training techniques — such as online RL, episodic memory, or world model pretraining — most improve EBR-Bench performance.

Source: news.google.com

Key Takeaways

Improvements in 'reasoning' AI models may slow down soon, analysis ...

Epoch AI's EBR-Bench tests experience-based reasoning.
Top models score 30-50%, with Google Gemini 3 Pro leading at 48.2%, revealing a gap between pattern matching and true learning.

Sources cited in this article

Epoch AI's

Source: gentic.news · 2d ago · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from 1 verified source, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

EBR-Bench fills a critical gap in AI evaluation. Existing benchmarks like MMLU, HumanEval, and SWE-Bench measure knowledge recall and code generation but fail to assess whether models can learn from experience — the ability to infer general principles from limited examples and apply them in novel contexts. The 30-50% range across top models suggests this capability is not simply a matter of scale. Gemini 3 Pro's 48.2% is only 3 points ahead of Gemini 2 Ultra at 45.1%, indicating diminishing returns from larger models. This aligns with findings from Epoch AI's MirrorCode benchmark, also released June 30, which showed similar gaps in program reconstruction from behavior alone. The structural implication is clear: current transformer architectures may be hitting a ceiling on tasks requiring genuine abstraction and transfer learning. The next generation of models — potentially incorporating memory-augmented architectures, world models, or online reinforcement learning — will need to demonstrate improvement on EBR-Bench to claim progress in this dimension. The benchmark may become as consequential as ImageNet was for computer vision, defining a new axis of capability that separates current models from future ones.

#epoch ai #reasoning #ai benchmarks #google #model evaluation

Compare side-by-side

OpenAI vs Google

→

Mentioned in this article

Epoch AI EBR-Bench Gemini 3 Pro Google OpenAI GPT-5

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

AI Research3 shared topics

MCP Explained: The Standard Quietly Changing How AI Agents Connect to Data

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog

Epoch AI's EBR-Bench: Top Models Score 30-50% on Experience-Based Reasoning

Why the 30-50% range matters

What to watch

Key Takeaways

Sources cited in this article

AI Analysis

✨AI Toolslive

Related Articles

Google Gemini-SQL2 Hits 80.04% on BIRD, Beating GPT-5.5 by 7 Points

Colossus 2: xAI's Memphis Cluster Hits 300,000 GPUs

Anthropic Explores Custom AI Chip with Samsung

How to Use MCP Servers for Financial Data

FreeLLMAPI Aggregates 1.7B Free Tokens/Month Across 11 Providers

MCP Explained: The Standard Quietly Changing How AI Agents Connect to Data

The framework underneath this story

More in AI Research

DART: One-Shot Robot Adaptation via Weight Space Arithmetic

ELDR: Expert-Locality Decode Routing Cuts MoE TPOT by 13.9%

Feed-Forward Model Decomposes 3D Scenes as Objects Without 3D Labels