Live frontier-AI benchmarks — every score, every model, no marketing.
Benchmark Catalogue
Every benchmark this lab tracks. Each tile shows current SOTA, holder, and the lab's calibrated reliability score — because not every benchmark is created equal.
Verified runs — feed
The last benchmark observations the lab confirmed. Pulled live from /api/v1/lab/findings.
- Jun 22
Benchmark extraction: Extracted 2 benchmark scores
Ran benchmark extraction cycle. Results: Extracted 2 benchmark scores
- Jun 21
[AUTOREASON] Qwen 3.5 Medium — 2 iterations
Qwen 3.5 Medium is a family of open-weight language models released by Alibaba on February 24, 2026, comprising four configurations: the MoE Qwen3.5-122B-A10B (122B total parameters, 10B active), Qwen3.5-35B-A3B (35B total, 3B active), the dense Qwen3.5-27B, and the lightweight Qwen3.5-Flash. On the MMLU-Pro benchmark, the 122B-A10B model scores 74.2, and it achieves 81.3 on HumanEval, surpassing Alibaba’s prior Qwen 2.5-235B-A22B MoE model while activating 10B parameters versus 22B. All variant
- Jun 21
[AUTOREASON] Intel — 2 iterations
Intel Corporation (NASDAQ: INTC) is an American integrated device manufacturer and semiconductor design company, founded in 1968 and headquartered in Santa Clara, California. As of early March 2026, Intel is executing a foundry-first strategy under CEO Lip-Bu Tan, with its Intel 18A process node entering risk production and third-party manufacturing expanding through Intel Foundry Services. The company has announced its 15th-gen Core Ultra 'Arrow Lake' processor lineup and Gaudi 3 AI accelerator
- Jun 21
[KG] GPT-5.3 — risk
GPT-5.3, OpenAI's Mixture of Experts flagship first observed February 2026, posts a GPQA of 92.0 and SWE-bench Pro of 56.8 at $1.75/$14 per million tokens. It deploys RLHF, Chain-of-Thought, and Sparse MoE — but faces a four-front competitive war. Claude Mythos Preview, Claude Opus 4.7, GPT-Rosalind, and even the legacy GPT-3.5 all claim rival status. The model runs on Qualcomm hardware and powers Computer Use, yet its own successor GPT-5.5 already tops benchmarks at double the API cost. Agent C
- Jun 21
[KG] Codex 5.3 — moat
Codex 5.3, OpenAI's third major program synthesis model released March 19, 2026, posts a 94.7% pass@1 on HumanEval—up from 92.1% in Codex 5.0. It now handles multi-file repository-scale tasks with 89.3% functional correctness, a clear escalation against rivals Claude Mythos Preview, Claude Code, and Qwen 3.6. Yet a June 2026 study reveals AI coding agents, including Codex, miss 81–86% of critical code lines in repository sweeps, undermining the headline metric. Codex is embedded in ChatGPT Works
Continue in
Agents Lab →
Which models actually ship as agents — OSWorld, GAIA, τ-bench scores in production.
Continue in
Predictions Lab →
Forecasts on the next SOTA — when does GPT-6 break HLE? Which benchmark falls next?
AI Model Leaderboard
Ranked by real benchmarks + news momentum from 89+ sources
41
models
20
with benchmarks
Top Companies
Top Performers
Models with verified benchmark scores — sort by any metric
| # | Model | Score | Arena ELO | SWE-bench | MMLU-Pro | Buzz | Sent. | $/M |
|---|---|---|---|---|---|---|---|---|
| 🥇 | Claude Opus 4.6Anthropic | 61 | — | — | — | rising | — | |
| 🥈 | GPT-5OpenAI | 57 | 1450 | 80.0 | — | new | $14 | |
| 🥉 | DeepSeek-V3DeepSeek | 53 | 1380 | — | 85.0 | new | $0.28 | |
| 4 | Claude Sonnet 4.6Anthropic | 52 | 1470 | 79.6 | 85.0 | steady | — | |
| 5 | Kimi K2.6Moonshot AI | 51 | — | — | — | quiet | — | |
| 6 | Gemini 3 ProGoogle | 51 | 1485 | 80.6 | 90.1 | fading | $12 | |
| 7 | GPT-5.3OpenAI | 49 | — | — | — | quiet | $14 | |
| 8 | Claude Mythos PreviewAnthropic | 48 | — | — | — | fading | — | |
| 9 | Gemini 3 FlashGoogle | 48 | 1473 | — | 88.6 | quiet | $3 | |
| 10 | DeepSeek-R1DeepSeek | 44 | 1436 | — | 84.0 | fading | $2.19 |
Rising & Noteworthy
Trending models gaining momentum in the news — some may lack benchmark data
Anthropic
Anthropic
Anthropic's Claude Fable 5 is a de-fanged, public version of the Mythos-class AI model, restricted from discussing dangerous topics like cybersecurity. It represents the biggest capability step up sin
Multimodal Large Language Models (MLLMs) are advanced AI systems, developed by organizations like OpenAI and Google, that process and reason across multiple data types like text, images, and audio.
Gemini 3 Deep Think: AI model update designed for science: Deep Think <strong>analyzes the drawing, models the complex shape and generates a file to create the physical object with 3D printing</strong
Alibaba
Qwen 3.5 4B, developed by Alibaba, is a smaller, open-source model from the Qwen 3.5 series designed for efficient local deployment with competitive performance.
OpenAI
GPT-4 Turbo, developed by OpenAI, is a large language model featuring a 128K context window, faster response times, and more cost-effective operation than its predecessor.
OpenAI
GPT-4V, developed by OpenAI, is a multimodal large language model that processes and generates text from both image and text inputs.
DeepSeek
Moonshot AI
Moonshot's January 2026 open visual agentic model; OSWorld-Verified 63.3%.
OpenAI
GPT-5.2 is OpenAI's latest flagship large language model, released on December 11, 2025. Succeeding GPT-5.1, it is a family of three large language models within the GPT series. It comes in two modes:
Meta
Meta's LLaMA 3 is its latest large language model, released in two primary sizes (8B and 70B parameters) and trained on approximately 15 trillion tokens for enhanced reasoning and coding capabilities.
MiniMax
Anthropic
Anthropic
Claude 3.5 Sonnet is a multimodal language model developed by Anthropic, first released in June 2024. It achieves an MMLU-Pro score of 78.0, an Arena ELO rating of 1268, and a SWE-bench Verified score
OpenAI
GPT-4o is OpenAI’s multimodal language model, first observed on 2026-02-16, also tracked under the alias GPT-4. It natively processes text, images, and audio within a unified architecture. As of its i
How scoring works: Models with benchmark data: 50% benchmarks + 30% relevance + 20% buzz. Without benchmarks: 60% relevance + 40% buzz. Benchmark colors: green = top 25%, yellow = middle 50%, gray = bottom 25%.
Go deeper
Each benchmarked model has a live entity profile. Compare them head-to-head, or jump to the per-vertical leaderboards.
Frequently asked questions
What is an AI benchmark, and why do they matter?
Which AI model is #1 in 2026?
How does gentic.news rank AI models?
Are AI benchmarks gamed?
Where does the data come from?
Which benchmarks should I trust in 2026?
Get smarter about AI in 5 minutes
Join readers from Google, Anthropic, and NVIDIA. Every week: the 10 most important AI developments, verified predictions, and what they mean for your work. Free forever. Customize what you get →