Live frontier-AI benchmarks — every score, every model, no marketing.
Benchmark Catalogue
Every benchmark this lab tracks. Each tile shows current SOTA, holder, and the lab's calibrated reliability score — because not every benchmark is created equal.
Verified runs — feed
The last benchmark observations the lab confirmed. Pulled live from /api/v1/lab/findings.
- May 5
[SEO] Citation audit — 5 pages need fixing
Citation audit 2026-05-05: 5/16 pages flagged as not-citable. Weak: /benchmarks, /claude-code, /entity/github, /entity/claude-opus-4-6, /entity/gpt-4o
- May 5
[KG] GPT-5.2 Pro — risk
GPT-5.2 Pro, OpenAI's flagship LLM released December 2025, already faces succession pressure. Recent headlines show GPT-5.5 Pro leapfrogging benchmarks and sustaining 2-hour bug-fixing sessions, while GPT-5.2 Pro's mention count (just 6 in 30 days) signals waning attention. The model deploys RLHF, Chain-of-Thought, and Instruction Tuning — proven but not novel techniques. Endorsed by LessWrong and researcher Will Brian, it targets scientific discovery use cases. Yet with GPT-5.5 rumored as a 'qu
- May 4
[SEO] Citation audit — 7 pages need fixing
Citation audit 2026-05-04: 7/16 pages flagged as not-citable. Weak: /benchmarks, /claude-code, /entity/openai, /entity/meta, /entity/microsoft, /entity/claude-opus-4-6, /entity/gpt-4o
- May 3
[SEO] Citation audit — 6 pages need fixing
Citation audit 2026-05-03: 6/16 pages flagged as not-citable. Weak: /benchmarks, /claude-code, /entity/anthropic, /entity/openai, /entity/claude-opus-4-6, /entity/gpt-4o
- May 2
[KG] Claude Mythos Preview — risk
Anthropic's Claude Mythos Preview has become the first model to pass the UK AI Safety Institute's cyber evaluation, but its deployment by the National Security Agency under a 'supply chain risk' label reveals a sharp tension. Developed by Anthropic, the model is already used by Project Glasswing and regulated by both the UK and AI Security Institutes. It directly competes with OpenAI's GPT-5.3 and Codex 5.3, yet GPT-5.5 recently tied it in enterprise cyber attack tests. Despite constrained cyber
Continue in
Agents Lab →
Which models actually ship as agents — OSWorld, GAIA, τ-bench scores in production.
Continue in
Predictions Lab →
Forecasts on the next SOTA — when does GPT-6 break HLE? Which benchmark falls next?
AI Model Leaderboard
Ranked by real benchmarks + news momentum from 89+ sources
62
models
20
with benchmarks
Top Companies
Top Performers
Models with verified benchmark scores — sort by any metric
| # | Model | Score | Arena ELO | SWE-bench | MMLU-Pro | Buzz | Sent. | $/M |
|---|---|---|---|---|---|---|---|---|
| 🥇 | Claude Opus 4.7Anthropic | 77 | — | — | — | steady | — | |
| 🥈 | Claude Opus 4.6Anthropic | 64 | — | — | — | fading | — | |
| 🥉 | Gemini 3 FlashGoogle | 60 | 1473 | — | 88.6 | new | $3 | |
| 4 | Claude Mythos PreviewAnthropic | 59 | — | — | — | steady | — | |
| 5 | GPT-5OpenAI | 55 | 1450 | 80.0 | — | fading | $14 | |
| 6 | GPT-5.3OpenAI | 55 | — | — | — | fading | $14 | |
| 7 | Gemini 3 ProGoogle | 54 | 1485 | 80.6 | 90.1 | fading | $12 | |
| 8 | Kimi K2.6Moonshot AI | 52 | — | 80.2 | — | fading | $undefined | |
| 9 | Claude Sonnet 4.6Anthropic | 52 | 1470 | 79.6 | 85.0 | fading | $undefined | |
| 10 | Claude 4.5Anthropic | 51 | — | 80.9 | 89.5 | quiet | $25 |
Rising & Noteworthy
Trending models gaining momentum in the news — some may lack benchmark data
Google's upcoming Gemma 4 is an open-source AI model designed for efficient, high-performance local execution on devices like smartphones.
OpenAI
Meta
Meta Superintelligence Labs developed Muse Spark, its first competitive model from a rebuilt infrastructure aimed at personal superintelligence.
MiniMax
MiniMax M2.5, developed by MiniMax, is a frontier AI model designed for real-world productivity and agents, achieving state-of-the-art coding performance with high speed and unmatched cost efficiency.
Meta
Meta's LLaMA 3 is its latest large language model, released in two primary sizes (8B and 70B parameters) and trained on approximately 15 trillion tokens for enhanced reasoning and coding capabilities.
DeepSeek
DeepSeek-V3, developed by DeepSeek, is a highly efficient mixture-of-experts language model trained at a fraction of the cost of comparable systems while maintaining strong performance.
Alibaba
Alibaba Qwen efficiency model. Outperforms Qwen 2.5 235B with 7x fewer active params. Open-weight. Competes with Nemotron-Cascade, Mistral.
Anthropic
Claude 3.5 Sonnet is a large language model developed by Anthropic, first released on February 23, 2026, as part of the Claude 3.5 family. It achieves a MMLU-Pro score of 78.0, an Arena ELO rating of
Meta
Meta's Llama 3.1 70B is a 70-billion-parameter large language model, released in July 2024, offering strong performance in text generation and instruction-following tasks.
Meta
Llama 3 8B, developed by Meta, is an efficient open-source large language model designed for strong performance at a smaller scale.
OpenAI
GPT-4 Turbo, developed by OpenAI, is a large language model featuring a 128K context window, faster response times, and more cost-effective operation than its predecessor.
OpenAI
GPT-5.2 is OpenAI's latest flagship large language model, released on December 11, 2025. Succeeding GPT-5.1, it is a family of three large language models within the GPT series. It comes in two modes:
DeepSeek
DeepSeek-R1 is a 671-billion-parameter reasoning model developed by DeepSeek, trained via reinforcement learning to achieve state-of-the-art performance on coding and reasoning benchmarks.
Moonshot AI
Kimi K2.5 is an open-source, multimodal AI model from Moonshot AI, featuring 1 trillion parameters, vision capabilities, and Agent Swarm technology for complex task orchestration.
OpenAI
GPT-5.1 is a family of four large language models within OpenAI's GPT series. Two were released on November 12, 2025; two more were released one week later on November 19.
How scoring works: Models with benchmark data: 50% benchmarks + 30% relevance + 20% buzz. Without benchmarks: 60% relevance + 40% buzz. Benchmark colors: green = top 25%, yellow = middle 50%, gray = bottom 25%.
Go deeper
Each benchmarked model has a live entity profile. Compare them head-to-head, or jump to the per-vertical leaderboards.
Frequently asked questions
What is an AI benchmark, and why do they matter?
Which AI model is #1 in 2026?
How does gentic.news rank AI models?
Are AI benchmarks gamed?
Where does the data come from?
Which benchmarks should I trust in 2026?
Get smarter about AI in 5 minutes
Join readers from Google, Anthropic, and NVIDIA. Every week: the 10 most important AI developments, verified predictions, and what they mean for your work. Free forever. Customize what you get →