benchmark
30 articles about benchmark in AI news
Zhipu GLM-5.2 beats Anthropic's Mythos on bug-hunt benchmark
Zhipu AI's GLM-5.2 beat Anthropic's Claude Opus 4.8 on a cybersecurity bug-hunting benchmark, then matched it with extra instructions, marking another 'DeepSeek moment'.
GPT-5.6 Sol, Terra, Luna: Benchmark Performance Depends on Which Test You Use
OpenAI released GPT-5.6 as three tiers—Sol, Terra, Luna—on June 27, 2026. Sol tops Terminal-Bench 2.1 but trails competitors on other benchmarks. The release shifts focus to tiered pricing and efficiency, but access remains restricted.
Epoch AI's CursorBench Benchmarks AI Code Editing at Scale
Epoch AI launched CursorBench, a 500-task benchmark for AI code editors. It reveals a 15% accuracy gap vs. humans and 3x latency variance.
SciCode: Epoch AI Launches Benchmark Measuring AI Research Ability
Epoch AI launched SciCode benchmark testing LLMs on real research coding tasks. Top models score below 30%, exposing gap between coding benchmarks and scientific ability.
MirrorCode Benchmark Costs $2,600 Per Run, Challenges AI Coding Limits
Epoch AI and METR launched MirrorCode, a $2,600-per-run coding benchmark. Claude Opus 4.7 leads with 56% solve rate.
Zhipu GLM-5.2 tops global coding benchmarks, sparks 'DeepSeek moment'
Zhipu AI's GLM-5.2 ranks top-3 globally on a coding benchmark, with US engineers calling it a daily driver superior to GPT-5.5.
OpenAI GPT-5.5-Cyber Beats Anthropic Mythos on Security Benchmarks
OpenAI's GPT-5.5-Cyber beats Anthropic's Mythos on security benchmarks. Updated Codex plugin auto-patches after scanning 30M commits.
OpenAI shows small doses of beneficial-trait RL improve 44 of 53 safety benchmarks — and the gains generalize
OpenAI researchers Jagadeesh, Saab, Singhal et al. published findings on June 18 showing RL training on traits like honesty and corrigibility improved 44 of 53 safety benchmarks. Gains generalized across domains not used in training, and the model resisted harmful fine-tuning better than the baselin
Estonian Institute: Claude Tops Russian Propaganda Benchmark, Mistral Trails
Estonian Language Institute benchmark tests 60 AI models vs Russian propaganda. Claude tops, Mistral trails with 36.67% misinformation rate.
Visual-Seeker: Active Visual Reasoning Beats Proprietary MLLMs on 5 Benchmarks
Visual-Seeker achieves SOTA on five multimodal search benchmarks, surpassing proprietary models by actively harvesting visual evidence during search.
NVIDIA Blackwell Ultra Leads First Agentic AI Benchmark, 20x Agents/MW vs Hopper
NVIDIA Blackwell Ultra NVL72 leads the first AgentPerf benchmark for agentic AI, delivering 20x more agents per megawatt than Hopper.
MiniMax M3 Exceeds Human Gold-Medal on Math Benchmarks via MaxProof
MiniMax's M3 exceeded human gold-medal on math benchmarks via MaxProof, but no scores or details were disclosed.
MacArena: 421-Task macOS Benchmark Reveals 26% CUA Ranking Inversion
MacArena benchmark of 421 macOS tasks reveals 26% performance gap for top models on native tasks, suggesting current CUAs overfit to Linux distributions.
WorldBench: Top MLLM Scores 64% on Visually Diverse Benchmark
WorldBench, a new multimodal benchmark, tests 15 MLLMs on visually diverse images. Top model scores 64.0%, exposing fundamental gaps in visual understanding.
SMAC-Talk: StarCraft Benchmark Tests LLM Agents Against Deceptive Allies
SMAC-Talk extends StarCraft Multi-Agent Challenge with natural language communication, testing LLM agents against deceptive allies. Qwen3.5 models benchmarked; no model exceeds 72% win rate.
New 474-Game Benchmark Reveals LLMs Collapse on Counterfactual Reasoning
New 474-game benchmark reveals LLMs fail on counterfactual reasoning, with larger drops than contextual perturbations. Highlights metacognitive gaps in agentic AI.
SpatialBench: New Benchmark Tests Foundation Models on 3D Tasks
SpatialBench, a new benchmark from ropedia_ai, evaluates spatial foundation models across 7 tasks and 5 datasets, testing depth estimation, surface normal prediction, and 3D object detection.
NVIDIA Vera CPU Benchmarks: 1.55x Faster Than Intel Xeon in Phoronix Tests
NVIDIA Vera CPU benchmarks show 1.55x performance over Intel Xeon 6980P and 10% over AMD EPYC 9575F, with 1.2 TB/s memory bandwidth.
Microsoft SkillOpt Trains Agent Skills in Text Space, Beats 52/52 Benchmarks
Microsoft's SkillOpt trains agent skills in text space, achieving best or tied-best results in all 52 settings across 6 benchmarks and 7 models.
HAVEN Benchmark Exposes MLLM Gap Between Fluency and Video Understanding
HAVEN benchmark tests MLLMs on hierarchical video understanding across frame, shot, and video levels. Results show top models lack grounded multimodal reasoning despite fluent text generation.
ByteDance Lance 3B MoE Beats 7B Models on Multimodal Benchmarks
ByteDance released Lance, a 3B multimodal MoE model that beats 7B+ models on benchmarks through multi-task synergy and specialized pathways.
MorphoHELM Benchmark Finds Classic CV Beats Deep Learning on Cell Painting
MorphoHELM benchmark from Microsoft evaluates 20+ methods for Cell Painting, finding no deep learning model beats classic CV when batch effects are controlled.
New Paper Coins 'Curation Debt' — Benchmarks Measure Data Leakage, Not Capability
New paper coins 'curation debt' — benchmarks like MMLU measure data leakage, not capability. Proposes adversarial dynamic benchmarks.
Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context
Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.
Federated Fine-Tuning Benchmark Shows QLoRA Nears Centralized Accuracy on
Sherpa.ai's arXiv benchmark shows federated fine-tuning with QLoRA matches centralized accuracy on four healthcare and finance datasets, outperforming isolated single-institution learning under non-IID conditions.
MIRA Benchmark Tests Cross-Category IR Across 4 Scholarly Data Types
MIRA benchmark tests cross-category retrieval across four scholarly data types using real user queries and LLM-assisted judgments.
Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates
Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks. GPT-5 mini leads at 0.309 ONS, but no paradigm dominates. ASCII beats natural language.
Simple Graph Heuristic Beats Generative Recommenders on 10 of 14 Benchmarks
A no-training graph heuristic beats generative recommenders on 10 of 14 benchmarks, exposing shortcut-solvable datasets. Relative NDCG@10 gains hit 44% on Amazon CDs.
GPT-4.1 Hits 24.65% Derm Accuracy on Real Cases vs 42.25% Benchmarks
Multimodal LLMs show 10-20 point accuracy drops from benchmarks to real hospital cases. GPT-4.1 falls from 42.25% to 24.65%.
ARMOR 2025: Military Safety Benchmark Exposes LLM Gaps Across 21 Models
ARMOR 2025 benchmark tests 21 LLMs against military legal doctrines, revealing critical safety gaps that civilian benchmarks miss.