benchmarks

30 articles about benchmarks in AI news

OpenAI GPT-5.5-Cyber Beats Anthropic Mythos on Security Benchmarks

OpenAI's GPT-5.5-Cyber beats Anthropic's Mythos on security benchmarks. Updated Codex plugin auto-patches after scanning 30M commits.

Jun 23, 2026100% relevant

OpenAI shows small doses of beneficial-trait RL improve 44 of 53 safety benchmarks — and the gains generalize

OpenAI researchers Jagadeesh, Saab, Singhal et al. published findings on June 18 showing RL training on traits like honesty and corrigibility improved 44 of 53 safety benchmarks. Gains generalized across domains not used in training, and the model resisted harmful fine-tuning better than the baselin

Jun 19, 202695% relevant

Visual-Seeker: Active Visual Reasoning Beats Proprietary MLLMs on 5 Benchmarks

Visual-Seeker achieves SOTA on five multimodal search benchmarks, surpassing proprietary models by actively harvesting visual evidence during search.

Jun 16, 202672% relevant

MiniMax M3 Exceeds Human Gold-Medal on Math Benchmarks via MaxProof

MiniMax's M3 exceeded human gold-medal on math benchmarks via MaxProof, but no scores or details were disclosed.

Jun 12, 2026100% relevant

NVIDIA Vera CPU Benchmarks: 1.55x Faster Than Intel Xeon in Phoronix Tests

NVIDIA Vera CPU benchmarks show 1.55x performance over Intel Xeon 6980P and 10% over AMD EPYC 9575F, with 1.2 TB/s memory bandwidth.

May 27, 2026100% relevant

Microsoft SkillOpt Trains Agent Skills in Text Space, Beats 52/52 Benchmarks

Microsoft's SkillOpt trains agent skills in text space, achieving best or tied-best results in all 52 settings across 6 benchmarks and 7 models.

May 25, 202689% relevant

ByteDance Lance 3B MoE Beats 7B Models on Multimodal Benchmarks

ByteDance released Lance, a 3B multimodal MoE model that beats 7B+ models on benchmarks through multi-task synergy and specialized pathways.

May 19, 202690% relevant

New Paper Coins 'Curation Debt' — Benchmarks Measure Data Leakage, Not Capability

New paper coins 'curation debt' — benchmarks like MMLU measure data leakage, not capability. Proposes adversarial dynamic benchmarks.

May 16, 202685% relevant

Simple Graph Heuristic Beats Generative Recommenders on 10 of 14 Benchmarks

A no-training graph heuristic beats generative recommenders on 10 of 14 benchmarks, exposing shortcut-solvable datasets. Relative NDCG@10 gains hit 44% on Amazon CDs.

May 11, 2026100% relevant

GPT-4.1 Hits 24.65% Derm Accuracy on Real Cases vs 42.25% Benchmarks

Multimodal LLMs show 10-20 point accuracy drops from benchmarks to real hospital cases. GPT-4.1 falls from 42.25% to 24.65%.

May 7, 202692% relevant

o1 Outperforms Human Doctors on Medical Benchmarks & ER Cases

o1 beat human physicians on medical benchmarks and real ER cases, per a new paper. Authors urge prospective trials.

May 1, 202687% relevant

Personalized LLM Benchmarks: Individual Rankings Diverge from Aggregate (ρ=0.04)

A new study of 115 Chatbot Arena users finds personalized LLM rankings diverge dramatically from aggregate benchmarks, with an average Bradley-Terry correlation of only ρ=0.04. This challenges the validity of one-size-fits-all model evaluations.

Apr 22, 202693% relevant

FiMMIA Paper Exposes Broken MIA Benchmarks, Challenges Hessian Theory

A paper accepted at EACL 2026 shows membership inference attack (MIA) benchmarks suffer from data leakage, allowing model-free classifiers to achieve up to 99.9% AUC. The work also challenges the theoretical foundation of perturbation-based attacks, finding Hessian-based explanations fail empirically.

Apr 18, 202684% relevant

LLM Evaluation Beyond Benchmarks

The source critiques traditional LLM benchmarks as inadequate for assessing performance in live applications. It proposes a shift toward creating continuous test suites that mirror actual user interactions and business logic to ensure reliability and safety.

Apr 14, 202672% relevant

Mythos AI Model Reportedly 'Destroys' Benchmarks in Early Leak

A viral tweet claims the unreleased Mythos AI model 'destroys every other model' based on leaked benchmarks. No official confirmation or technical details are available.

Apr 7, 202685% relevant

MLPerf 6.0: NVIDIA Sweeps New Benchmarks, AMD MI355X Within 30% on Select Tests

MLPerf 6.0 results show NVIDIA winning every new benchmark, with its GB300 NVL72 system achieving nearly 3x more throughput than six months ago. AMD's MI355X showed progress, coming within 10-30% on select single-node tests but skipping most new benchmarks.

Apr 7, 202685% relevant

Nobody Warns You About Eval Drift: 7 Ways Benchmarks Rot

A critical examination of how AI evaluation benchmarks degrade over time, losing their ability to reflect real-world performance. This 'eval drift' poses a silent risk to any team relying on static metrics for model validation and deployment decisions.

Mar 22, 202672% relevant

Health AI Benchmarks Show 'Validity Gap': 0.6% of Queries Use Raw Medical Records, 5.5% Cover Chronic Care

Analysis of 18,707 health queries across six public benchmarks reveals a structural misalignment with clinical reality. Benchmarks over-index on wellness data (17.7%) while under-representing lab values (5.2%), imaging (3.8%), and safety-critical scenarios.

Mar 20, 202677% relevant

EMBRAG Framework Achieves SOTA on KGQA Benchmarks via Embedding-Space Rule Generation

Researchers propose EMBRAG, a framework that uses LLMs to generate logical rules from a query, then performs multi-hop reasoning in knowledge graph embedding space. It sets new state-of-the-art on two KGQA benchmarks.

Mar 17, 202684% relevant

Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned

A new report details the practical challenges and emerging best practices for evaluating AI agents in real-world applications, moving beyond simple benchmarks to assess reliability, safety, and business value.

Mar 17, 202690% relevant

Stanford/CMU Study: AI Agent Benchmarks Focus on 7.6% of Jobs, Ignoring Management, Legal, and Interpersonal Work

Researchers analyzed 43 AI benchmarks against 72,000+ real job tasks and found they overwhelmingly test programming/math skills, which represent only 7.6% of actual economic work. Management, legal, and interpersonal tasks—which dominate the labor market—are almost entirely absent from evaluation.

Mar 16, 202685% relevant

vLLM Semantic Router: A New Approach to LLM Orchestration Beyond Simple Benchmarks

The article critiques current LLM routing benchmarks as solving only the easy part, introducing vLLM Semantic Router as a comprehensive solution for production-grade LLM orchestration with semantic understanding.

Mar 16, 202675% relevant

Survey Benchmarks Four Approaches to Synthetic Brain Signal Generation for BCI Data Scarcity

A comprehensive survey categorizes and benchmarks four methodological approaches to generating synthetic brain signals for BCIs, addressing data scarcity and privacy constraints. The authors provide an open-source codebase for comparing knowledge-based, feature-based, model-based, and translation-based generative algorithms.

Mar 16, 202684% relevant

The Jagged Frontier: What AI Coding Benchmarks Reveal and Conceal

New analysis of AI coding benchmarks like METR shows they capture real ability but miss key 'jagged' limitations. While performance correlates highly across tests and improves exponentially, crucial gaps in reasoning and reliability remain hard to measure.

Mar 11, 202685% relevant

Qwen 3.5 Small Models Defy Expectations, Outperforming Giants in Key AI Benchmarks

Alibaba's Qwen 3.5 small models (4B and 9B parameters) are reportedly outperforming much larger competitors like GPT-OSS-120B on several metrics. These compact models feature a 262K context window, early-fusion vision-language training, and hybrid architecture, achieving impressive scores on MMLU-Pro and other benchmarks.

Mar 2, 202695% relevant

Google's Gemini 3.1 Pro: The Quiet Revolution That's Redefining AI Benchmarks

Google's Gemini 3.1 Pro preview, released in November 2025, has achieved remarkable performance leaps within just three months. The modest version numbering belies what industry observers describe as 'significant jumps' across most benchmarks, positioning it as a new state-of-the-art contender.

Feb 19, 202685% relevant

GPT-5.5 Tops Benchmarks, Costs 2x API Price, Still Hallucinates

OpenAI launched GPT-5.5, an agentic model that tops Terminal-Bench 2.0 at 82.7% and surpasses Claude Opus 4.7 and Gemini 3.1 Pro on coding and math. However, independent testing shows higher hallucination rates and effective API costs 20% above GPT-5.4 despite doubled token prices.

Apr 25, 2026100% relevant

MIT's RLM Handles 10M+ Tokens, Outperforms RAG on Long-Context Benchmarks

MIT researchers introduced Recursive Language Models (RLMs), which treat long documents as an external environment and use code to search, slice, and filter data, achieving 58.00 on a hard long-context benchmark versus 0.04 for standard models.

Apr 23, 202695% relevant

The Silent Threat to AI Benchmarks: 8 Sources of Eval Contamination

The article warns that subtle data contamination in evaluation pipelines—from benchmark leakage to temporal overlap—can create misleading performance metrics. Identifying these eight leakage sources is essential for trustworthy AI validation.

Apr 17, 202674% relevant

Alibaba's ABot Models Top Embodied AI Benchmarks, Beat Google & NVIDIA

Alibaba's mapping division, Amap, launched three embodied AI models that topped the AGIbot World Challenge and World Arena, beating Google and NVIDIA. The ABot-M0 model for manipulation is fully open-source.

Apr 15, 202699% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety