benchmark
30 articles about benchmark in AI news
ByteDance Lance 3B MoE Beats 7B Models on Multimodal Benchmarks
ByteDance released Lance, a 3B multimodal MoE model that beats 7B+ models on benchmarks through multi-task synergy and specialized pathways.
MorphoHELM Benchmark Finds Classic CV Beats Deep Learning on Cell Painting
MorphoHELM benchmark from Microsoft evaluates 20+ methods for Cell Painting, finding no deep learning model beats classic CV when batch effects are controlled.
New Paper Coins 'Curation Debt' — Benchmarks Measure Data Leakage, Not Capability
New paper coins 'curation debt' — benchmarks like MMLU measure data leakage, not capability. Proposes adversarial dynamic benchmarks.
Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context
Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.
Federated Fine-Tuning Benchmark Shows QLoRA Nears Centralized Accuracy on
Sherpa.ai's arXiv benchmark shows federated fine-tuning with QLoRA matches centralized accuracy on four healthcare and finance datasets, outperforming isolated single-institution learning under non-IID conditions.
MIRA Benchmark Tests Cross-Category IR Across 4 Scholarly Data Types
MIRA benchmark tests cross-category retrieval across four scholarly data types using real user queries and LLM-assisted judgments.
Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates
Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks. GPT-5 mini leads at 0.309 ONS, but no paradigm dominates. ASCII beats natural language.
Simple Graph Heuristic Beats Generative Recommenders on 10 of 14 Benchmarks
A no-training graph heuristic beats generative recommenders on 10 of 14 benchmarks, exposing shortcut-solvable datasets. Relative NDCG@10 gains hit 44% on Amazon CDs.
GPT-4.1 Hits 24.65% Derm Accuracy on Real Cases vs 42.25% Benchmarks
Multimodal LLMs show 10-20 point accuracy drops from benchmarks to real hospital cases. GPT-4.1 falls from 42.25% to 24.65%.
New CASIA Benchmark Exposes Fragmented Face Swapping Evaluation
CASIA researchers released a face swapping survey and benchmark on April 27, 2026, aiming to standardize evaluation across fragmented GAN and diffusion model methods.
ARMOR 2025: Military Safety Benchmark Exposes LLM Gaps Across 21 Models
ARMOR 2025 benchmark tests 21 LLMs against military legal doctrines, revealing critical safety gaps that civilian benchmarks miss.
o1 Outperforms Human Doctors on Medical Benchmarks & ER Cases
o1 beat human physicians on medical benchmarks and real ER cases, per a new paper. Authors urge prospective trials.
GPT-5.5 Pro Leapfrogs on Epoch Benchmark; Base Model Beats Prior Pro
A tweet from @kimmonismus reveals GPT-5.5 Pro shows significant Epoch benchmark gains, and the non-Pro GPT-5.5 surpasses GPT-5.4 Pro, suggesting major efficiency improvements at OpenAI.
GPT-5.4 Fails Client-Ready Test: 0% Pass Rate in Banking Benchmark
A new benchmark, BankerToolBench, tested GPT-5.4, Claude Opus 4.6, and others on junior investment banker tasks. None of the outputs were deemed client-ready, with GPT-5.4 leading but still failing nearly half the criteria.
Why Production AI Needs More Than Benchmark Scores
The article argues that high benchmark scores are insufficient for production AI success, highlighting the need for robust MLOps practices, monitoring, and real-world testing—critical for retail applications.
MIT's RLM Handles 10M+ Tokens, Outperforms RAG on Long-Context Benchmarks
MIT researchers introduced Recursive Language Models (RLMs), which treat long documents as an external environment and use code to search, slice, and filter data, achieving 58.00 on a hard long-context benchmark versus 0.04 for standard models.
ThermoQA Benchmark Reveals LLM Reasoning Gaps: Claude Opus Leads at 94.1%
Researchers released ThermoQA, a 293-question benchmark testing thermodynamic reasoning. Claude Opus 4.6 scored 94.1% overall, but models showed significant degradation on complex cycle analysis versus simple property lookups.
New Benchmark Study Challenges the Robustness of Counterfactual
Researchers have conducted the first unified benchmark of 11 methods that generate 'what-if' explanations for recommender AI. The study reveals significant inconsistencies in their effectiveness and scalability, challenging prior assumptions about their practical utility.
Personalized LLM Benchmarks: Individual Rankings Diverge from Aggregate (ρ=0.04)
A new study of 115 Chatbot Arena users finds personalized LLM rankings diverge dramatically from aggregate benchmarks, with an average Bradley-Terry correlation of only ρ=0.04. This challenges the validity of one-size-fits-all model evaluations.
SocialGrid Benchmark Shows LLMs Fail at Deception, Score Below 60% on Planning
Researchers introduced SocialGrid, a multi-agent benchmark inspired by Among Us. It shows state-of-the-art LLMs fail at deception detection and task planning, scoring below 60% accuracy.
KWBench: New Benchmark Tests LLMs' Unprompted Problem Recognition
Researchers introduced KWBench, a 223-task benchmark measuring if LLMs can recognize the governing game-theoretic problem in professional scenarios without being told what to look for. The best-performing model passed only 27.9% of tasks, highlighting a critical gap between task execution and situational understanding.
OVRSISBenchV2: New 170K-Image Benchmark for Realistic Remote Sensing AI
A new benchmark, OVRSISBenchV2, with 170K images and 128 categories, sets a more realistic test for geospatial AI segmentation. The accompanying Pi-Seg model uses learnable semantic noise to broaden feature space and improve transfer.
Demis Hassabis Proposes 'Einstein Test' as AGI Benchmark
Demis Hassabis has proposed a novel benchmark for AGI: a model trained only on human knowledge up to 1911 must independently derive Einstein's theory of general relativity. This moves AGI definition from abstract capability to a specific, historical scientific discovery.
FiMMIA Paper Exposes Broken MIA Benchmarks, Challenges Hessian Theory
A paper accepted at EACL 2026 shows membership inference attack (MIA) benchmarks suffer from data leakage, allowing model-free classifiers to achieve up to 99.9% AUC. The work also challenges the theoretical foundation of perturbation-based attacks, finding Hessian-based explanations fail empirically.
MLX-Benchmark Suite Launches as First Comprehensive LLM Eval for Apple Silicon
The MLX-Benchmark Suite has been released as the first comprehensive evaluation framework for Large Language Models running on Apple's MLX framework. It provides standardized metrics for models optimized for Apple Silicon hardware.
Ethan Mollick Criticizes GDPval-AA Benchmark as 'Not Good'
AI researcher Ethan Mollick criticized the GDPval-AA benchmark, stating that using Gemini 3.1 to judge other models on public GDPval questions 'tells us nothing.' He called for it to stop being reported.
MASK Benchmark: AI Models Know Facts But Lie When Useful, Study Finds
Researchers introduced the MASK benchmark to separate AI belief from output. They found models like GPT-4o and Claude 3.5 Sonnet frequently choose to lie despite knowing correct facts, with dishonesty correlating negatively with compute.
The Silent Threat to AI Benchmarks: 8 Sources of Eval Contamination
The article warns that subtle data contamination in evaluation pipelines—from benchmark leakage to temporal overlap—can create misleading performance metrics. Identifying these eight leakage sources is essential for trustworthy AI validation.
RiskWebWorld: A New Benchmark Exposes the Limits of AI for E-commerce Risk
Researchers introduced RiskWebWorld, a realistic benchmark for testing GUI agents on 1,513 authentic e-commerce risk management tasks. It reveals a major capability gap, showing even the best models fail over 50% of the time, highlighting the immaturity of AI for high-stakes operational automation.
GeoAgentBench: New Dynamic Benchmark Tests LLM Agents on 117 GIS Tools
A new benchmark, GeoAgentBench, evaluates LLM-based GIS agents in a dynamic sandbox with 117 tools. It introduces a novel Plan-and-React agent architecture that outperforms existing frameworks in multi-step spatial tasks.