benchmarks
30 articles about benchmarks in AI news
ByteDance Lance 3B MoE Beats 7B Models on Multimodal Benchmarks
ByteDance released Lance, a 3B multimodal MoE model that beats 7B+ models on benchmarks through multi-task synergy and specialized pathways.
New Paper Coins 'Curation Debt' — Benchmarks Measure Data Leakage, Not Capability
New paper coins 'curation debt' — benchmarks like MMLU measure data leakage, not capability. Proposes adversarial dynamic benchmarks.
Simple Graph Heuristic Beats Generative Recommenders on 10 of 14 Benchmarks
A no-training graph heuristic beats generative recommenders on 10 of 14 benchmarks, exposing shortcut-solvable datasets. Relative NDCG@10 gains hit 44% on Amazon CDs.
GPT-4.1 Hits 24.65% Derm Accuracy on Real Cases vs 42.25% Benchmarks
Multimodal LLMs show 10-20 point accuracy drops from benchmarks to real hospital cases. GPT-4.1 falls from 42.25% to 24.65%.
o1 Outperforms Human Doctors on Medical Benchmarks & ER Cases
o1 beat human physicians on medical benchmarks and real ER cases, per a new paper. Authors urge prospective trials.
Personalized LLM Benchmarks: Individual Rankings Diverge from Aggregate (ρ=0.04)
A new study of 115 Chatbot Arena users finds personalized LLM rankings diverge dramatically from aggregate benchmarks, with an average Bradley-Terry correlation of only ρ=0.04. This challenges the validity of one-size-fits-all model evaluations.
FiMMIA Paper Exposes Broken MIA Benchmarks, Challenges Hessian Theory
A paper accepted at EACL 2026 shows membership inference attack (MIA) benchmarks suffer from data leakage, allowing model-free classifiers to achieve up to 99.9% AUC. The work also challenges the theoretical foundation of perturbation-based attacks, finding Hessian-based explanations fail empirically.
LLM Evaluation Beyond Benchmarks
The source critiques traditional LLM benchmarks as inadequate for assessing performance in live applications. It proposes a shift toward creating continuous test suites that mirror actual user interactions and business logic to ensure reliability and safety.
Mythos AI Model Reportedly 'Destroys' Benchmarks in Early Leak
A viral tweet claims the unreleased Mythos AI model 'destroys every other model' based on leaked benchmarks. No official confirmation or technical details are available.
MLPerf 6.0: NVIDIA Sweeps New Benchmarks, AMD MI355X Within 30% on Select Tests
MLPerf 6.0 results show NVIDIA winning every new benchmark, with its GB300 NVL72 system achieving nearly 3x more throughput than six months ago. AMD's MI355X showed progress, coming within 10-30% on select single-node tests but skipping most new benchmarks.
Nobody Warns You About Eval Drift: 7 Ways Benchmarks Rot
A critical examination of how AI evaluation benchmarks degrade over time, losing their ability to reflect real-world performance. This 'eval drift' poses a silent risk to any team relying on static metrics for model validation and deployment decisions.
RedNote's 3B-Parameter Multimodal OCR Model Ranks Second to Gemini 3 Pro on Document Parsing Benchmarks
RedNote has released a 3-billion parameter multimodal OCR model that converts text, charts, diagrams, and tables into structured formats like Markdown and HTML. It reportedly ranks second only to Google's Gemini 3 Pro on OCR benchmarks.
Research Identifies 'Giant Blind Spot' in AI Scaling: Models Improve on Benchmarks Without Understanding
A new research paper argues that current AI scaling approaches have a fundamental flaw: models improve on narrow benchmarks without developing genuine understanding, creating a 'giant blind spot' in progress measurement.
Health AI Benchmarks Show 'Validity Gap': 0.6% of Queries Use Raw Medical Records, 5.5% Cover Chronic Care
Analysis of 18,707 health queries across six public benchmarks reveals a structural misalignment with clinical reality. Benchmarks over-index on wellness data (17.7%) while under-representing lab values (5.2%), imaging (3.8%), and safety-critical scenarios.
EMBRAG Framework Achieves SOTA on KGQA Benchmarks via Embedding-Space Rule Generation
Researchers propose EMBRAG, a framework that uses LLMs to generate logical rules from a query, then performs multi-hop reasoning in knowledge graph embedding space. It sets new state-of-the-art on two KGQA benchmarks.
Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned
A new report details the practical challenges and emerging best practices for evaluating AI agents in real-world applications, moving beyond simple benchmarks to assess reliability, safety, and business value.
Stanford & CMU Study: AI Benchmarks Show 'Severe Misalignment' with Real-World Job Economics
Researchers from Stanford and Carnegie Mellon found that standard AI benchmarks poorly reflect the economic value and complexity of real human jobs, creating a 'severe misalignment' in how progress is measured.
Stanford/CMU Study: AI Agent Benchmarks Focus on 7.6% of Jobs, Ignoring Management, Legal, and Interpersonal Work
Researchers analyzed 43 AI benchmarks against 72,000+ real job tasks and found they overwhelmingly test programming/math skills, which represent only 7.6% of actual economic work. Management, legal, and interpersonal tasks—which dominate the labor market—are almost entirely absent from evaluation.
vLLM Semantic Router: A New Approach to LLM Orchestration Beyond Simple Benchmarks
The article critiques current LLM routing benchmarks as solving only the easy part, introducing vLLM Semantic Router as a comprehensive solution for production-grade LLM orchestration with semantic understanding.
Survey Benchmarks Four Approaches to Synthetic Brain Signal Generation for BCI Data Scarcity
A comprehensive survey categorizes and benchmarks four methodological approaches to generating synthetic brain signals for BCIs, addressing data scarcity and privacy constraints. The authors provide an open-source codebase for comparing knowledge-based, feature-based, model-based, and translation-based generative algorithms.
The Jagged Frontier: What AI Coding Benchmarks Reveal and Conceal
New analysis of AI coding benchmarks like METR shows they capture real ability but miss key 'jagged' limitations. While performance correlates highly across tests and improves exponentially, crucial gaps in reasoning and reliability remain hard to measure.
Qwen 3.5 Small Models Defy Expectations, Outperforming Giants in Key AI Benchmarks
Alibaba's Qwen 3.5 small models (4B and 9B parameters) are reportedly outperforming much larger competitors like GPT-OSS-120B on several metrics. These compact models feature a 262K context window, early-fusion vision-language training, and hybrid architecture, achieving impressive scores on MMLU-Pro and other benchmarks.
Google's Gemini 3.1 Pro: The Quiet Revolution That's Redefining AI Benchmarks
Google's Gemini 3.1 Pro preview, released in November 2025, has achieved remarkable performance leaps within just three months. The modest version numbering belies what industry observers describe as 'significant jumps' across most benchmarks, positioning it as a new state-of-the-art contender.
GPT-5.5 Tops Benchmarks, Costs 2x API Price, Still Hallucinates
OpenAI launched GPT-5.5, an agentic model that tops Terminal-Bench 2.0 at 82.7% and surpasses Claude Opus 4.7 and Gemini 3.1 Pro on coding and math. However, independent testing shows higher hallucination rates and effective API costs 20% above GPT-5.4 despite doubled token prices.
MIT's RLM Handles 10M+ Tokens, Outperforms RAG on Long-Context Benchmarks
MIT researchers introduced Recursive Language Models (RLMs), which treat long documents as an external environment and use code to search, slice, and filter data, achieving 58.00 on a hard long-context benchmark versus 0.04 for standard models.
The Silent Threat to AI Benchmarks: 8 Sources of Eval Contamination
The article warns that subtle data contamination in evaluation pipelines—from benchmark leakage to temporal overlap—can create misleading performance metrics. Identifying these eight leakage sources is essential for trustworthy AI validation.
Alibaba's ABot Models Top Embodied AI Benchmarks, Beat Google & NVIDIA
Alibaba's mapping division, Amap, launched three embodied AI models that topped the AGIbot World Challenge and World Arena, beating Google and NVIDIA. The ABot-M0 model for manipulation is fully open-source.
VoxCPM2 Open-Source Voice AI Outperforms ElevenLabs on Key Benchmarks
Researchers from OpenBMB and Tsinghua University released VoxCPM2, a 2B-parameter open-source voice AI that clones voices from short clips and creates voices from text descriptions. It outperforms ElevenLabs on the Minimax-MLS benchmark and runs locally with no API costs.
Apple M5 Max NPU Benchmarks 2x Faster Than Intel Panther Lake NPU in Parakeet v3 AI Inference Test
A leaked benchmark using the Parakeet v3 AI speech recognition model shows Apple's next-generation M5 Max Neural Processing Unit (NPU) delivering double the inference speed of Intel's competing Panther Lake NPU. This real-world test provides early performance data in the intensifying on-device AI hardware race.
Agent Psychometrics: New Framework Predicts Task-Level Success in Agentic Coding Benchmarks with 0.81 AUC
A new research paper introduces a framework using Item Response Theory and task features to predict success on individual agentic coding tasks, achieving 0.81 AUC. This enables benchmark designers to calibrate difficulty without expensive evaluations.