Live frontier-AI benchmarks — every score, every model, no marketing.
Benchmark Catalogue
Every benchmark this lab tracks. Each tile shows current SOTA, holder, and the lab's calibrated reliability score — because not every benchmark is created equal.
Verified runs — feed
The last benchmark observations the lab confirmed. Pulled live from /api/v1/lab/findings.
- May 11
[SEO] Citation audit — 5 pages need fixing
Citation audit 2026-05-11: 5/16 pages flagged as not-citable. Weak: /benchmarks, /claude-code, /entity/microsoft, /entity/gpt-4o, /entity/amazon
- May 11
[SEO] Citation audit — 7 pages need fixing
Citation audit 2026-05-11: 7/16 pages flagged as not-citable. Weak: /benchmarks, /claude-code, /entity/anthropic, /entity/nvidia, /entity/meta, /entity/gpt-4o, /entity/amazon
- May 11
[SEO] Citation audit — 8 pages need fixing
Citation audit 2026-05-11: 8/16 pages flagged as not-citable. Weak: /benchmarks, /claude-code, /entity/anthropic, /entity/google, /entity/nvidia, /entity/meta, /entity/github, /entity/gpt-4o
- May 10
[KG] Moonshot AI — momentum
Moonshot AI, founded by Transformer-XL co-author Yang Zhilin, is no longer just the Kimi chatbot with its headline 2M-token context window. Backed by Alibaba, Tencent, and Sequoia at an $18B+ valuation, the company has shipped a torrent of models in weeks: Kimi K2.5, K2.6, and the thinking-optimized K2.6 variant. The latest K2.6 hits 58.6% on SWE-Bench Pro, leading open-source coding benchmarks and reportedly matching Claude Opus on coding tasks. Yet the graph reveals a tension: Moonshot directl
- May 10
[KG] GPT-3.5 — risk
GPT-3.5, OpenAI's aging workhorse, now competes with eight entities — from Claude Opus 4.7 to Microsoft Excel — signaling it is being benchmarked against both frontier models and productivity tools. Developed by OpenAI and regulated by the UK AI Safety Institute, the model faces mounting pressure. Recent headlines all focus on GPT-5.5, suggesting GPT-3.5 has been leapfrogged internally. It still powers Codex 5.3 (which also uses GPT-3.5), but that dependency may soon shift to newer architectures
Continue in
Agents Lab →
Which models actually ship as agents — OSWorld, GAIA, τ-bench scores in production.
Continue in
Predictions Lab →
Forecasts on the next SOTA — when does GPT-6 break HLE? Which benchmark falls next?
AI Model Leaderboard
Ranked by real benchmarks + news momentum from 89+ sources
53
models
18
with benchmarks
Top Companies
Top Performers
Models with verified benchmark scores — sort by any metric
| # | Model | Score | Arena ELO | SWE-bench | MMLU-Pro | Buzz | Sent. | $/M |
|---|---|---|---|---|---|---|---|---|
| 🥇 | Kimi K2.6Moonshot AI | 64 | — | 80.2 | — | new | $undefined | |
| 🥈 | Claude Sonnet 4.6Anthropic | 64 | 1470 | 79.6 | 85.0 | new | $undefined | |
| 🥉 | Claude Mythos PreviewAnthropic | 63 | — | — | — | surging | — | |
| 4 | LLaMA 3Meta | 57 | — | — | 63.0 | new | Free | |
| 5 | Gemini 3 ProGoogle | 53 | 1485 | 80.6 | 90.1 | quiet | $12 | |
| 6 | Claude 4.5Anthropic | 51 | — | 80.9 | 89.5 | quiet | $25 | |
| 7 | GPT-5.3OpenAI | 51 | — | — | — | fading | $14 | |
| 8 | GPT-4oOpenAI | 50 | 1286 | 38.4 | 73.0 | new | $10 | |
| 9 | Gemini 3 FlashGoogle | 48 | 1473 | — | 88.6 | fading | $3 | |
| 10 | GPT-5OpenAI | 47 | 1450 | 80.0 | — | fading | $14 |
Rising & Noteworthy
Trending models gaining momentum in the news — some may lack benchmark data
Anthropic
Claude Opus 4.6 is Anthropic's flagship LLM released February 5, 2026. Successor to Claude Opus 4.5; superseded by Opus 4.7 on April 16, 2026. 1M-token context window, 128k max output. Benchmarks: 80.
OpenAI
Anthropic
Anthropic
Claude 3.5 Sonnet is a large language model developed by Anthropic, first released on February 23, 2026, as part of the Claude 3.5 family. It achieves a MMLU-Pro score of 78.0, an Arena ELO rating of
DeepSeek
Anthropic
Anthropic
Claude 3.5 Opus, developed by Anthropic, is a flagship AI model known for its top-tier reasoning and advanced capabilities, including computer use.
OpenAI
OpenAI's GPT-OSS-120B is a 120-billion parameter open-weight reasoning model designed to push the frontier of accuracy while optimizing inference cost.
Gemma 4 2B is a 2-billion parameter open model from Google, designed for efficiency on smartphones and edge devices.
Meta
Llama 3 8B, developed by Meta, is an efficient open-source large language model designed for strong performance at a smaller scale.
OpenAI
GPT-5.2 is OpenAI's latest flagship large language model, released on December 11, 2025. Succeeding GPT-5.1, it is a family of three large language models within the GPT series. It comes in two modes:
Moonshot AI
Kimi K2.5 is an open-source, multimodal AI model from Moonshot AI, featuring 1 trillion parameters, vision capabilities, and Agent Swarm technology for complex task orchestration.
DeepSeek
DeepSeek-R1 is a 671-billion-parameter reasoning model developed by DeepSeek, trained via reinforcement learning to achieve state-of-the-art performance on coding and reasoning benchmarks.
DeepSeek
DeepSeek-V3, developed by DeepSeek, is a highly efficient mixture-of-experts language model trained at a fraction of the cost of comparable systems while maintaining strong performance.
Anthropic
How scoring works: Models with benchmark data: 50% benchmarks + 30% relevance + 20% buzz. Without benchmarks: 60% relevance + 40% buzz. Benchmark colors: green = top 25%, yellow = middle 50%, gray = bottom 25%.
Go deeper
Each benchmarked model has a live entity profile. Compare them head-to-head, or jump to the per-vertical leaderboards.
Frequently asked questions
What is an AI benchmark, and why do they matter?
Which AI model is #1 in 2026?
How does gentic.news rank AI models?
Are AI benchmarks gamed?
Where does the data come from?
Which benchmarks should I trust in 2026?
Get smarter about AI in 5 minutes
Join readers from Google, Anthropic, and NVIDIA. Every week: the 10 most important AI developments, verified predictions, and what they mean for your work. Free forever. Customize what you get →