Epoch AI released EBR-Bench on June 30, 2026, a benchmark measuring experience-based reasoning. Top models score 30-50%, with Google's Gemini 3 Pro leading at 48.2%.
Key facts
- EBR-Bench released June 30, 2026 by Epoch AI
- Top models score 30-50% on experience-based reasoning
- Google Gemini 3 Pro leads at 48.2%
- Human experts score 85-90% on same tasks
- Random guessing baseline is ~10%
Epoch AI released EBR-Bench, a benchmark measuring experience-based reasoning in AI models, on June 30, 2026. The test evaluates whether models can apply learned patterns to novel situations, not just recall training data. Top models score 30-50%, with Google's Gemini 3 Pro leading at 48.2%. According to Epoch AI's announcement
Gemini 3 Pro's 48.2% score is 12 points ahead of OpenAI's GPT-5 at 36.1%. Anthropic's Claude 4 Opus scored 41.5%, while Meta's Llama 4 405B scored 32.8%. The benchmark includes 1,000 tasks across five domains: physics, social reasoning, tool use, navigation, and game strategy. Each task requires the model to infer a general principle from a single example and apply it to a new scenario.
Why the 30-50% range matters
The 30-50% range is significant because random guessing would score around 10%, but human experts score 85-90% on the same tasks. This suggests current AI systems lack the ability to truly learn from experience as humans do. The results align with a broader trend: despite rapid advances in language modeling and coding benchmarks, AI still struggles with tasks requiring genuine abstraction and transfer learning. Epoch AI's prior work on MirrorCode, released the same day, showed similar gaps in program reconstruction from behavior alone.
Experience-based reasoning is the next frontier
EBR-Bench exposes a structural limitation of current architectures. Transformer models excel at pattern matching within their training distribution but fail when required to generalize from sparse experience. This is not a scaling problem — larger models show diminishing returns on EBR-Bench, with Gemini 3 Pro's 48.2% only 3 points ahead of its predecessor Gemini 2 Ultra at 45.1%. The bottleneck is architectural, not parametric. The benchmark may become a key differentiator for next-generation models that incorporate memory, world models, or reinforcement learning from experience.
What to watch
Watch for model releases in Q4 2026 that explicitly claim improved EBR-Bench scores. If any model breaks 60%, it would signal a genuine architectural breakthrough. Also track whether Google, OpenAI, or Anthropic publish ablation studies showing which training techniques — such as online RL, episodic memory, or world model pretraining — most improve EBR-Bench performance.
Source: news.google.com
Key Takeaways

- Epoch AI's EBR-Bench tests experience-based reasoning.
- Top models score 30-50%, with Google Gemini 3 Pro leading at 48.2%, revealing a gap between pattern matching and true learning.









