Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Benchmarks Lab

Live frontier-AI benchmarks — every score, every model, no marketing.

Benchmark Catalogue

Every benchmark this lab tracks. Each tile shows current SOTA, holder, and the lab's calibrated reliability score — because not every benchmark is created equal.

Verified runs — feed

The last benchmark observations the lab confirmed. Pulled live from /api/v1/lab/findings.

All findings →

Continue in

Agents Lab →

Which models actually ship as agents — OSWorld, GAIA, τ-bench scores in production.

Continue in

Predictions Lab →

Forecasts on the next SOTA — when does GPT-6 break HLE? Which benchmark falls next?

AI Model Leaderboard

Ranked by real benchmarks + news momentum from 89+ sources

41

models

20

with benchmarks

Top Companies

Top Performers

Models with verified benchmark scores — sort by any metric

1

Anthropic

61
rising
2

OpenAI

57
newArena ELO: 1450SWE-bench: 80.0
3

DeepSeek

53
newArena ELO: 1380MMLU-Pro: 85.0
4

Anthropic

52
steadyArena ELO: 1470SWE-bench: 79.6MMLU-Pro: 85.0
5

Moonshot AI

51
quiet
6

Google

51
fadingArena ELO: 1485SWE-bench: 80.6MMLU-Pro: 90.1
7

OpenAI

49
quiet
8

Anthropic

48
fading
9

Google

48
quietArena ELO: 1473MMLU-Pro: 88.6
10

DeepSeek

44
fadingArena ELO: 1436MMLU-Pro: 84.0

Rising & Noteworthy

Trending models gaining momentum in the news — some may lack benchmark data

Claude Mythos

Anthropic

new
3 mentions/7dhas benchmarksproduct launch
Fable 5

Anthropic

cooling

Anthropic's Claude Fable 5 is a de-fanged, public version of the Mythos-class AI model, restricted from discussing dangerous topics like cybersecurity. It represents the biggest capability step up sin

2 mentions/7dproduct launch
Multimodal Large Language Model
new

Multimodal Large Language Models (MLLMs) are advanced AI systems, developed by organizations like OpenAI and Google, that process and reason across multiple data types like text, images, and audio.

1 mentions/7d
Gemini 3 Deep Think

Google

new

Gemini 3 Deep Think: AI model update designed for science: Deep Think <strong>analyzes the drawing, models the complex shape and generates a file to create the physical object with 3D printing</strong

1 mentions/7dresearch milestone
Qwen 3.5 4B

Alibaba

new

Qwen 3.5 4B, developed by Alibaba, is a smaller, open-source model from the Qwen 3.5 series designed for efficient local deployment with competitive performance.

1 mentions/7dproduct launch
GPT-4 Turbo

OpenAI

new

GPT-4 Turbo, developed by OpenAI, is a large language model featuring a 128K context window, faster response times, and more cost-effective operation than its predecessor.

1 mentions/7dproduct launch
GPT-4V

OpenAI

new

GPT-4V, developed by OpenAI, is a multimodal large language model that processes and generates text from both image and text inputs.

1 mentions/7dresearch milestone
DeepSeek V4

DeepSeek

cooling
1 mentions/7dproduct launch
Kimi K2.5

Moonshot AI

quiet

Moonshot's January 2026 open visual agentic model; OSWorld-Verified 63.3%.

0 mentions/7dhas benchmarksresearch milestone
GPT-5.2 Pro

OpenAI

quiet

GPT-5.2 is OpenAI's latest flagship large language model, released on December 11, 2025. Succeeding GPT-5.1, it is a family of three large language models within the GPT series. It comes in two modes:

0 mentions/7dhas benchmarksresearch milestone
LLaMA 3

Meta

quiet

Meta's LLaMA 3 is its latest large language model, released in two primary sizes (8B and 70B parameters) and trained on approximately 15 trillion tokens for enhanced reasoning and coding capabilities.

0 mentions/7dhas benchmarksresearch milestone
Minimax M3

MiniMax

fading
0 mentions/7dhas benchmarksresearch milestone
Claude Haiku 4.5

Anthropic

quiet
0 mentions/7dhas benchmarksresearch milestone
Claude 3.5 Sonnet

Anthropic

quiet

Claude 3.5 Sonnet is a multimodal language model developed by Anthropic, first released in June 2024. It achieves an MMLU-Pro score of 78.0, an Arena ELO rating of 1268, and a SWE-bench Verified score

0 mentions/7dhas benchmarksproduct launch
GPT-4o

OpenAI

fading

GPT-4o is OpenAI’s multimodal language model, first observed on 2026-02-16, also tracked under the alias GPT-4. It natively processes text, images, and audio within a unified architecture. As of its i

0 mentions/7dhas benchmarksproduct launch

How scoring works: Models with benchmark data: 50% benchmarks + 30% relevance + 20% buzz. Without benchmarks: 60% relevance + 40% buzz. Benchmark colors: green = top 25%, yellow = middle 50%, gray = bottom 25%.

Go deeper

Each benchmarked model has a live entity profile. Compare them head-to-head, or jump to the per-vertical leaderboards.

Frequently asked questions

What is an AI benchmark, and why do they matter?
An AI benchmark is a standardized test that measures a model's capability on a defined task — coding (SWE-Bench Verified), reasoning (MMLU-Pro, GPQA), math (AIME, MATH-500), tool use (Berkeley Function-Calling), agentic tasks (OSWorld-Verified, GDPval), or chat preference (Chatbot Arena ELO). Benchmarks matter because they let buyers compare frontier models on apples-to-apples tasks rather than vibes. The catch: any single benchmark is a partial view, several have been gamed via data leakage, and the 2026 trend is composite scoring across 5–10 verified evals.
Which AI model is #1 in 2026?
There is no single #1 — it depends on the task. As of April 2026: Claude Opus 4.7 leads SWE-Bench Verified at 87.6%. Holo3-35B-A3B leads OSWorld-Verified at 80.4%. GPT-5.4 and Gemini 3.0 Pro lead Chatbot Arena ELO. DeepSeek V4 dominates the open-weights price-performance Pareto. Surfer 2 leads WebVoyager at 97.1%. Our leaderboard above ranks by composite news-momentum + verified benchmark score, updated hourly.
How does gentic.news rank AI models?
We combine three signals: (1) verified benchmark scores from the official leaderboards (OSWorld-Verified, BrowseComp, SWE-Bench Verified, Terminal-Bench 2.0, etc.), (2) Chatbot Arena ELO when available, and (3) news-momentum from our knowledge graph — how many sources cite the model, sentiment polarity, and entity centrality. Verified benchmark scores dominate the ranking; momentum is a tiebreaker. Models without verifiable scores show momentum-only and are flagged.
Are AI benchmarks gamed?
Several have been. A 2026 Berkeley RDI study showed eight major agentic benchmarks could be gamed to near-100% via config leakage or DOM injection. Data contamination is a known issue with MMLU and GSM8K. The 2025–2026 response was the held-out, contamination-resistant generation: SWE-Bench Pro, Terminal-Bench 2.0, OSWorld-Verified, GDPval. We exclude unverified or self-reported scores from the rankings on this page and link to the original leaderboard URL for every score.
Where does the data come from?
Benchmark scores: official leaderboards (os-world.github.io, browsecomp.openai.com, swebench.com, lmarena.ai), maker publications (Anthropic, OpenAI, Google DeepMind, Moonshot, ByteDance), and independent verification when models can be self-hosted. News momentum: 89+ AI sources scanned every 3 hours by our agent pipeline. Knowledge-graph centrality: computed nightly from entity co-occurrence in the news corpus. Methodology: gentic.news/methodology.
Which benchmarks should I trust in 2026?
For agents: OSWorld-Verified, BrowseComp, Terminal-Bench 2.0 (the 2026 'core triad'), plus GDPval for economic-impact realism. For coding: SWE-Bench Verified, SWE-Bench Pro, LiveCodeBench. For reasoning: GPQA Diamond, MMLU-Pro, AIME. For chat preference: Chatbot Arena ELO. For tool use: Berkeley Function-Calling Leaderboard. Avoid leaning on any single number — composite views are more robust to leakage.

Get smarter about AI in 5 minutes

Join readers from Google, Anthropic, and NVIDIA. Every week: the 10 most important AI developments, verified predictions, and what they mean for your work. Free forever. Customize what you get →