benchmark

30 articles about benchmark in AI news

Zhipu GLM-5.2 beats Anthropic's Mythos on bug-hunt benchmark

Zhipu AI's GLM-5.2 beat Anthropic's Claude Opus 4.8 on a cybersecurity bug-hunting benchmark, then matched it with extra instructions, marking another 'DeepSeek moment'.

Jun 29, 202675% relevant

GPT-5.6 Sol, Terra, Luna: Benchmark Performance Depends on Which Test You Use

OpenAI released GPT-5.6 as three tiers—Sol, Terra, Luna—on June 27, 2026. Sol tops Terminal-Bench 2.1 but trails competitors on other benchmarks. The release shifts focus to tiered pricing and efficiency, but access remains restricted.

Jun 28, 202674% relevant

Epoch AI's CursorBench Benchmarks AI Code Editing at Scale

Epoch AI launched CursorBench, a 500-task benchmark for AI code editors. It reveals a 15% accuracy gap vs. humans and 3x latency variance.

Jun 27, 202695% relevant

SciCode: Epoch AI Launches Benchmark Measuring AI Research Ability

Epoch AI launched SciCode benchmark testing LLMs on real research coding tasks. Top models score below 30%, exposing gap between coding benchmarks and scientific ability.

Jun 27, 202695% relevant

MirrorCode Benchmark Costs $2,600 Per Run, Challenges AI Coding Limits

Epoch AI and METR launched MirrorCode, a $2,600-per-run coding benchmark. Claude Opus 4.7 leads with 56% solve rate.

Jun 26, 202675% relevant

Zhipu GLM-5.2 tops global coding benchmarks, sparks 'DeepSeek moment'

Zhipu AI's GLM-5.2 ranks top-3 globally on a coding benchmark, with US engineers calling it a daily driver superior to GPT-5.5.

Jun 26, 2026100% relevant

OpenAI GPT-5.5-Cyber Beats Anthropic Mythos on Security Benchmarks

OpenAI's GPT-5.5-Cyber beats Anthropic's Mythos on security benchmarks. Updated Codex plugin auto-patches after scanning 30M commits.

Jun 23, 2026100% relevant

OpenAI shows small doses of beneficial-trait RL improve 44 of 53 safety benchmarks — and the gains generalize

OpenAI researchers Jagadeesh, Saab, Singhal et al. published findings on June 18 showing RL training on traits like honesty and corrigibility improved 44 of 53 safety benchmarks. Gains generalized across domains not used in training, and the model resisted harmful fine-tuning better than the baselin

Jun 19, 202695% relevant

Estonian Institute: Claude Tops Russian Propaganda Benchmark, Mistral Trails

Estonian Language Institute benchmark tests 60 AI models vs Russian propaganda. Claude tops, Mistral trails with 36.67% misinformation rate.

Jun 16, 202672% relevant

Visual-Seeker: Active Visual Reasoning Beats Proprietary MLLMs on 5 Benchmarks

Visual-Seeker achieves SOTA on five multimodal search benchmarks, surpassing proprietary models by actively harvesting visual evidence during search.

Jun 16, 202672% relevant

NVIDIA Blackwell Ultra Leads First Agentic AI Benchmark, 20x Agents/MW vs Hopper

NVIDIA Blackwell Ultra NVL72 leads the first AgentPerf benchmark for agentic AI, delivering 20x more agents per megawatt than Hopper.

Jun 12, 202692% relevant

MiniMax M3 Exceeds Human Gold-Medal on Math Benchmarks via MaxProof

MiniMax's M3 exceeded human gold-medal on math benchmarks via MaxProof, but no scores or details were disclosed.

Jun 12, 2026100% relevant

MacArena: 421-Task macOS Benchmark Reveals 26% CUA Ranking Inversion

MacArena benchmark of 421 macOS tasks reveals 26% performance gap for top models on native tasks, suggesting current CUAs overfit to Linux distributions.

Jun 8, 202695% relevant

WorldBench: Top MLLM Scores 64% on Visually Diverse Benchmark

WorldBench, a new multimodal benchmark, tests 15 MLLMs on visually diverse images. Top model scores 64.0%, exposing fundamental gaps in visual understanding.

Jun 8, 202692% relevant

SMAC-Talk: StarCraft Benchmark Tests LLM Agents Against Deceptive Allies

SMAC-Talk extends StarCraft Multi-Agent Challenge with natural language communication, testing LLM agents against deceptive allies. Qwen3.5 models benchmarked; no model exceeds 72% win rate.

Jun 5, 202670% relevant

New 474-Game Benchmark Reveals LLMs Collapse on Counterfactual Reasoning

New 474-game benchmark reveals LLMs fail on counterfactual reasoning, with larger drops than contextual perturbations. Highlights metacognitive gaps in agentic AI.

Jun 2, 202692% relevant

SpatialBench: New Benchmark Tests Foundation Models on 3D Tasks

SpatialBench, a new benchmark from ropedia_ai, evaluates spatial foundation models across 7 tasks and 5 datasets, testing depth estimation, surface normal prediction, and 3D object detection.

May 27, 202691% relevant

NVIDIA Vera CPU Benchmarks: 1.55x Faster Than Intel Xeon in Phoronix Tests

NVIDIA Vera CPU benchmarks show 1.55x performance over Intel Xeon 6980P and 10% over AMD EPYC 9575F, with 1.2 TB/s memory bandwidth.

May 27, 2026100% relevant

Microsoft SkillOpt Trains Agent Skills in Text Space, Beats 52/52 Benchmarks

Microsoft's SkillOpt trains agent skills in text space, achieving best or tied-best results in all 52 settings across 6 benchmarks and 7 models.

May 25, 202689% relevant

HAVEN Benchmark Exposes MLLM Gap Between Fluency and Video Understanding

HAVEN benchmark tests MLLMs on hierarchical video understanding across frame, shot, and video levels. Results show top models lack grounded multimodal reasoning despite fluent text generation.

May 21, 202685% relevant

ByteDance Lance 3B MoE Beats 7B Models on Multimodal Benchmarks

ByteDance released Lance, a 3B multimodal MoE model that beats 7B+ models on benchmarks through multi-task synergy and specialized pathways.

May 19, 202690% relevant

MorphoHELM Benchmark Finds Classic CV Beats Deep Learning on Cell Painting

MorphoHELM benchmark from Microsoft evaluates 20+ methods for Cell Painting, finding no deep learning model beats classic CV when batch effects are controlled.

May 18, 202674% relevant

New Paper Coins 'Curation Debt' — Benchmarks Measure Data Leakage, Not Capability

New paper coins 'curation debt' — benchmarks like MMLU measure data leakage, not capability. Proposes adversarial dynamic benchmarks.

May 16, 202685% relevant

Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context

Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.

May 15, 202688% relevant

Federated Fine-Tuning Benchmark Shows QLoRA Nears Centralized Accuracy on

Sherpa.ai's arXiv benchmark shows federated fine-tuning with QLoRA matches centralized accuracy on four healthcare and finance datasets, outperforming isolated single-institution learning under non-IID conditions.

May 15, 202688% relevant

MIRA Benchmark Tests Cross-Category IR Across 4 Scholarly Data Types

MIRA benchmark tests cross-category retrieval across four scholarly data types using real user queries and LLM-assisted judgments.

May 13, 202676% relevant

Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates

Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks. GPT-5 mini leads at 0.309 ONS, but no paradigm dominates. ASCII beats natural language.

May 11, 202698% relevant

Simple Graph Heuristic Beats Generative Recommenders on 10 of 14 Benchmarks

A no-training graph heuristic beats generative recommenders on 10 of 14 benchmarks, exposing shortcut-solvable datasets. Relative NDCG@10 gains hit 44% on Amazon CDs.

May 11, 2026100% relevant

GPT-4.1 Hits 24.65% Derm Accuracy on Real Cases vs 42.25% Benchmarks

Multimodal LLMs show 10-20 point accuracy drops from benchmarks to real hospital cases. GPT-4.1 falls from 42.25% to 24.65%.

May 7, 202692% relevant

ARMOR 2025: Military Safety Benchmark Exposes LLM Gaps Across 21 Models

ARMOR 2025 benchmark tests 21 LLMs against military legal doctrines, revealing critical safety gaps that civilian benchmarks miss.

May 5, 202692% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety