Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Benchmarks Lab

Live frontier-AI benchmarks — every score, every model, no marketing.

Benchmark Catalogue

Every benchmark this lab tracks. Each tile shows current SOTA, holder, and the lab's calibrated reliability score — because not every benchmark is created equal.

Verified runs — feed

The last benchmark observations the lab confirmed. Pulled live from /api/v1/lab/findings.

All findings →

Continue in

Agents Lab →

Which models actually ship as agents — OSWorld, GAIA, τ-bench scores in production.

Continue in

Predictions Lab →

Forecasts on the next SOTA — when does GPT-6 break HLE? Which benchmark falls next?

AI Model Leaderboard

Ranked by real benchmarks + news momentum from 89+ sources

62

models

20

with benchmarks

Top Companies

Top Performers

Models with verified benchmark scores — sort by any metric

1

Anthropic

77
steady
2

Anthropic

64
fading
3

Google

60
newArena ELO: 1473MMLU-Pro: 88.6
4

Anthropic

59
steady
5

OpenAI

55
fadingArena ELO: 1450SWE-bench: 80.0
6

OpenAI

55
fading
7

Google

54
fadingArena ELO: 1485SWE-bench: 80.6MMLU-Pro: 90.1
8

Moonshot AI

52
fadingSWE-bench: 80.2
9

Anthropic

52
fadingArena ELO: 1470SWE-bench: 79.6MMLU-Pro: 85.0
10

Anthropic

51
quietSWE-bench: 80.9MMLU-Pro: 89.5

Rising & Noteworthy

Trending models gaining momentum in the news — some may lack benchmark data

Gemma 4

Google

new

Google's upcoming Gemma 4 is an open-source AI model designed for efficient, high-performance local execution on devices like smartphones.

2 mentions/7dresearch milestone
GPT-3.5

OpenAI

fading
2 mentions/7dresearch milestone
Spark

Meta

new

Meta Superintelligence Labs developed Muse Spark, its first competitive model from a rebuilt infrastructure aimed at personal superintelligence.

1 mentions/7dproduct launch
MiniMax M2.5

MiniMax

cooling

MiniMax M2.5, developed by MiniMax, is a frontier AI model designed for real-world productivity and agents, achieving state-of-the-art coding performance with high speed and unmatched cost efficiency.

1 mentions/7dproduct launch
LLaMA 3

Meta

steady

Meta's LLaMA 3 is its latest large language model, released in two primary sizes (8B and 70B parameters) and trained on approximately 15 trillion tokens for enhanced reasoning and coding capabilities.

1 mentions/7dhas benchmarksresearch milestone
DeepSeek-V3

DeepSeek

steady

DeepSeek-V3, developed by DeepSeek, is a highly efficient mixture-of-experts language model trained at a fraction of the cost of comparable systems while maintaining strong performance.

1 mentions/7dhas benchmarks
Qwen 3.5 Medium

Alibaba

new

Alibaba Qwen efficiency model. Outperforms Qwen 2.5 235B with 7x fewer active params. Open-weight. Competes with Nemotron-Cascade, Mistral.

1 mentions/7dresearch milestone
Claude 3.5 Sonnet

Anthropic

steady

Claude 3.5 Sonnet is a large language model developed by Anthropic, first released on February 23, 2026, as part of the Claude 3.5 family. It achieves a MMLU-Pro score of 78.0, an Arena ELO rating of

1 mentions/7dhas benchmarksresearch milestone
Llama 3.1 70B

Meta

new

Meta's Llama 3.1 70B is a 70-billion-parameter large language model, released in July 2024, offering strong performance in text generation and instruction-following tasks.

1 mentions/7d
Llama 3 8B

Meta

new

Llama 3 8B, developed by Meta, is an efficient open-source large language model designed for strong performance at a smaller scale.

1 mentions/7d
GPT-4 Turbo

OpenAI

steady

GPT-4 Turbo, developed by OpenAI, is a large language model featuring a 128K context window, faster response times, and more cost-effective operation than its predecessor.

1 mentions/7dproduct launch
GPT-5.2 Pro

OpenAI

fading

GPT-5.2 is OpenAI's latest flagship large language model, released on December 11, 2025. Succeeding GPT-5.1, it is a family of three large language models within the GPT series. It comes in two modes:

0 mentions/7dhas benchmarksresearch milestone
DeepSeek-R1

DeepSeek

quiet

DeepSeek-R1 is a 671-billion-parameter reasoning model developed by DeepSeek, trained via reinforcement learning to achieve state-of-the-art performance on coding and reasoning benchmarks.

0 mentions/7dhas benchmarksresearch milestone
Kimi K2.5

Moonshot AI

quiet

Kimi K2.5 is an open-source, multimodal AI model from Moonshot AI, featuring 1 trillion parameters, vision capabilities, and Agent Swarm technology for complex task orchestration.

0 mentions/7dhas benchmarksresearch milestone
GPT-5.3-Codex

OpenAI

quiet

GPT-5.1 is a family of four large language models within OpenAI's GPT series. Two were released on November 12, 2025; two more were released one week later on November 19.

0 mentions/7dhas benchmarksresearch milestone

How scoring works: Models with benchmark data: 50% benchmarks + 30% relevance + 20% buzz. Without benchmarks: 60% relevance + 40% buzz. Benchmark colors: green = top 25%, yellow = middle 50%, gray = bottom 25%.

Go deeper

Each benchmarked model has a live entity profile. Compare them head-to-head, or jump to the per-vertical leaderboards.

Frequently asked questions

What is an AI benchmark, and why do they matter?
An AI benchmark is a standardized test that measures a model's capability on a defined task — coding (SWE-Bench Verified), reasoning (MMLU-Pro, GPQA), math (AIME, MATH-500), tool use (Berkeley Function-Calling), agentic tasks (OSWorld-Verified, GDPval), or chat preference (Chatbot Arena ELO). Benchmarks matter because they let buyers compare frontier models on apples-to-apples tasks rather than vibes. The catch: any single benchmark is a partial view, several have been gamed via data leakage, and the 2026 trend is composite scoring across 5–10 verified evals.
Which AI model is #1 in 2026?
There is no single #1 — it depends on the task. As of April 2026: Claude Opus 4.7 leads SWE-Bench Verified at 87.6%. Holo3-35B-A3B leads OSWorld-Verified at 80.4%. GPT-5.4 and Gemini 3.0 Pro lead Chatbot Arena ELO. DeepSeek V4 dominates the open-weights price-performance Pareto. Surfer 2 leads WebVoyager at 97.1%. Our leaderboard above ranks by composite news-momentum + verified benchmark score, updated hourly.
How does gentic.news rank AI models?
We combine three signals: (1) verified benchmark scores from the official leaderboards (OSWorld-Verified, BrowseComp, SWE-Bench Verified, Terminal-Bench 2.0, etc.), (2) Chatbot Arena ELO when available, and (3) news-momentum from our knowledge graph — how many sources cite the model, sentiment polarity, and entity centrality. Verified benchmark scores dominate the ranking; momentum is a tiebreaker. Models without verifiable scores show momentum-only and are flagged.
Are AI benchmarks gamed?
Several have been. A 2026 Berkeley RDI study showed eight major agentic benchmarks could be gamed to near-100% via config leakage or DOM injection. Data contamination is a known issue with MMLU and GSM8K. The 2025–2026 response was the held-out, contamination-resistant generation: SWE-Bench Pro, Terminal-Bench 2.0, OSWorld-Verified, GDPval. We exclude unverified or self-reported scores from the rankings on this page and link to the original leaderboard URL for every score.
Where does the data come from?
Benchmark scores: official leaderboards (os-world.github.io, browsecomp.openai.com, swebench.com, lmarena.ai), maker publications (Anthropic, OpenAI, Google DeepMind, Moonshot, ByteDance), and independent verification when models can be self-hosted. News momentum: 89+ AI sources scanned every 3 hours by our agent pipeline. Knowledge-graph centrality: computed nightly from entity co-occurrence in the news corpus. Methodology: gentic.news/methodology.
Which benchmarks should I trust in 2026?
For agents: OSWorld-Verified, BrowseComp, Terminal-Bench 2.0 (the 2026 'core triad'), plus GDPval for economic-impact realism. For coding: SWE-Bench Verified, SWE-Bench Pro, LiveCodeBench. For reasoning: GPQA Diamond, MMLU-Pro, AIME. For chat preference: Chatbot Arena ELO. For tool use: Berkeley Function-Calling Leaderboard. Avoid leaning on any single number — composite views are more robust to leakage.

Get smarter about AI in 5 minutes

Join readers from Google, Anthropic, and NVIDIA. Every week: the 10 most important AI developments, verified predictions, and what they mean for your work. Free forever. Customize what you get →