benchmark improvement
30 articles about benchmark improvement in AI news
Cursor Launches Composer 2 with $0.50/M Input Token Pricing, Claims Major Benchmark Gains
Cursor has released Composer 2, a coding AI model priced at $0.50 per million input tokens and $2.50 per million output tokens. The company reports significant benchmark improvements over previous versions across CursorBench, Terminal-Bench 2.0, and SWE-bench Multilingual.
MiniMax M2.7 Achieves 30% Internal Benchmark Gain via Self-Improvement Loops, Ties Gemini 3.1 on MLE Bench Lite
MiniMax had its M2.7 model run 100+ autonomous development cycles—analyzing failures, modifying code, and evaluating changes—resulting in a 30% performance improvement. The model now handles 30-50% of the research workflow and tied Gemini 3.1 in ML competition trials.
GPT-5.5 Pro Leapfrogs on Epoch Benchmark; Base Model Beats Prior Pro
A tweet from @kimmonismus reveals GPT-5.5 Pro shows significant Epoch benchmark gains, and the non-Pro GPT-5.5 surpasses GPT-5.4 Pro, suggesting major efficiency improvements at OpenAI.
Memento-Skills Agent System Achieves 116.2% Relative Improvement on Humanity's Last Exam Without LLM Updates
Memento-Skills is a generalist agent system that autonomously constructs and adapts task-specific agents through experience. It enables continual learning without updating LLM parameters, achieving 26.2% and 116.2% relative improvements on GAIA and Humanity's Last Exam benchmarks.
The Hidden Contamination Crisis: How Semantic Duplicates Are Skewing AI Benchmark Results
New research reveals that LLM training data contains widespread 'soft contamination' through semantic duplicates of benchmark test data, artificially inflating performance metrics and raising questions about genuine AI capability improvements.
MIT's RLM Handles 10M+ Tokens, Outperforms RAG on Long-Context Benchmarks
MIT researchers introduced Recursive Language Models (RLMs), which treat long documents as an external environment and use code to search, slice, and filter data, achieving 58.00 on a hard long-context benchmark versus 0.04 for standard models.
ThermoQA Benchmark Reveals LLM Reasoning Gaps: Claude Opus Leads at 94.1%
Researchers released ThermoQA, a 293-question benchmark testing thermodynamic reasoning. Claude Opus 4.6 scored 94.1% overall, but models showed significant degradation on complex cycle analysis versus simple property lookups.
Personalized LLM Benchmarks: Individual Rankings Diverge from Aggregate (ρ=0.04)
A new study of 115 Chatbot Arena users finds personalized LLM rankings diverge dramatically from aggregate benchmarks, with an average Bradley-Terry correlation of only ρ=0.04. This challenges the validity of one-size-fits-all model evaluations.
SocialGrid Benchmark Shows LLMs Fail at Deception, Score Below 60% on Planning
Researchers introduced SocialGrid, a multi-agent benchmark inspired by Among Us. It shows state-of-the-art LLMs fail at deception detection and task planning, scoring below 60% accuracy.
GPT-5.5 Limited Rollout Begins, Frontend Improvements Noted
OpenAI has started a limited rollout of GPT-5.5 to select users, with early reports highlighting significant frontend quality improvements. This suggests an incremental update focused on user experience rather than core model capabilities.
MLX-Benchmark Suite Launches as First Comprehensive LLM Eval for Apple Silicon
The MLX-Benchmark Suite has been released as the first comprehensive evaluation framework for Large Language Models running on Apple's MLX framework. It provides standardized metrics for models optimized for Apple Silicon hardware.
RiskWebWorld: A New Benchmark Exposes the Limits of AI for E-commerce Risk
Researchers introduced RiskWebWorld, a realistic benchmark for testing GUI agents on 1,513 authentic e-commerce risk management tasks. It reveals a major capability gap, showing even the best models fail over 50% of the time, highlighting the immaturity of AI for high-stakes operational automation.
OpenAI Quietly Phasing Out MRCR Benchmark in Claude Evaluations
An OpenAI engineer confirmed the company is phasing out the MRCR benchmark from Claude's system card, citing its poor correlation with real-world performance and high evaluation cost. This reflects a broader industry move toward more practical, cost-effective evaluation methods.
HORIZON Benchmark Diagnoses Long-Horizon Failures in GPT-5 and Claude Agents
A new benchmark called HORIZON systematically analyzes where and why LLM agents like GPT-5 and Claude fail on long-horizon tasks. The study collected over 3100 agent trajectories and provides a scalable method for failure attribution, offering practical guidance for building more reliable agents.
LLM Evaluation Beyond Benchmarks
The source critiques traditional LLM benchmarks as inadequate for assessing performance in live applications. It proposes a shift toward creating continuous test suites that mirror actual user interactions and business logic to ensure reliability and safety.
Mythos AI Model Reportedly 'Destroys' Benchmarks in Early Leak
A viral tweet claims the unreleased Mythos AI model 'destroys every other model' based on leaked benchmarks. No official confirmation or technical details are available.
MLPerf 6.0: NVIDIA Sweeps New Benchmarks, AMD MI355X Within 30% on Select Tests
MLPerf 6.0 results show NVIDIA winning every new benchmark, with its GB300 NVL72 system achieving nearly 3x more throughput than six months ago. AMD's MI355X showed progress, coming within 10-30% on select single-node tests but skipping most new benchmarks.
Rank, Don't Generate: A New Benchmark for Factual, Ranked Explanations in Recommendation Systems
A new research paper formalizes explainable recommendation as a statement-level ranking problem, not a generation task. It introduces the StaR benchmark, built from Amazon reviews, showing that simple popularity baselines can outperform state-of-the-art models in personalized explanation ranking.
Apple M5 Max NPU Benchmarks 2x Faster Than Intel Panther Lake NPU in Parakeet v3 AI Inference Test
A leaked benchmark using the Parakeet v3 AI speech recognition model shows Apple's next-generation M5 Max Neural Processing Unit (NPU) delivering double the inference speed of Intel's competing Panther Lake NPU. This real-world test provides early performance data in the intensifying on-device AI hardware race.
OmniSch Benchmark Exposes Major Gaps in LMMs for PCB Schematic Understanding
Researchers introduced OmniSch, a benchmark with 1,854 real PCB schematics, to evaluate LMMs on converting diagrams to netlist graphs. Results show current models have unreliable grounding, brittle parsing, and inconsistent connectivity reasoning for engineering artifacts.
QAsk-Nav Benchmark Enables Separate Scoring of Navigation and Dialogue for Collaborative AI Agents
A new benchmark called QAsk-Nav enables separate evaluation of navigation and question-asking for collaborative embodied AI agents. The accompanying Light-CoNav model outperforms state-of-the-art methods while being significantly more efficient.
Agent Psychometrics: New Framework Predicts Task-Level Success in Agentic Coding Benchmarks with 0.81 AUC
A new research paper introduces a framework using Item Response Theory and task features to predict success on individual agentic coding tasks, achieving 0.81 AUC. This enables benchmark designers to calibrate difficulty without expensive evaluations.
ReCUBE Benchmark Reveals GPT-5 Scores Only 37.6% on Repository-Level Code Generation
Researchers introduce ReCUBE, a benchmark isolating LLMs' ability to use repository-wide context for code generation. GPT-5 achieves just a 37.57% strict pass rate, showing the task remains highly challenging.
Frontier AI Models Reportedly Score Below 1% on ARC-AGI v3 Benchmark
A social media post claims frontier AI models have achieved below 1% performance on the ARC-AGI v3 benchmark, suggesting a potential saturation point for current scaling approaches. No specific models or scores were disclosed.
Meta's Hyperagents Enable Self-Referential AI Improvement, Achieving 0.710 Accuracy on Paper Review
Meta researchers introduce Hyperagents, where the self-improvement mechanism itself can be edited. The system autonomously discovered innovations like persistent memory, improving from 0.0 to 0.710 test accuracy on paper review tasks.
Ego2Web Benchmark Bridges Egocentric Video and Web Agents, Exposing Major Performance Gaps
Researchers introduce Ego2Web, the first benchmark requiring AI agents to understand real-world first-person video and execute related web tasks. Their novel Ego2WebJudge evaluation method achieves 84% human agreement, while state-of-the-art agents perform poorly across all task categories.
ReXInTheWild Benchmark Reveals VLMs Struggle with Medical Photos: Gemini-3 Leads at 78%, MedGemma Trails at 37%
Researchers introduced ReXInTheWild, a benchmark of 955 clinician-verified questions based on 484 real medical photographs. Leading multimodal models show wide performance gaps, with Gemini-3 scoring 78% accuracy while the specialized MedGemma model achieved only 37%.
Research Identifies 'Giant Blind Spot' in AI Scaling: Models Improve on Benchmarks Without Understanding
A new research paper argues that current AI scaling approaches have a fundamental flaw: models improve on narrow benchmarks without developing genuine understanding, creating a 'giant blind spot' in progress measurement.
EMBRAG Framework Achieves SOTA on KGQA Benchmarks via Embedding-Space Rule Generation
Researchers propose EMBRAG, a framework that uses LLMs to generate logical rules from a query, then performs multi-hop reasoning in knowledge graph embedding space. It sets new state-of-the-art on two KGQA benchmarks.
Evaluating AI Agents in Practice: Benchmarks, Frameworks, and Lessons Learned
A new report details the practical challenges and emerging best practices for evaluating AI agents in real-world applications, moving beyond simple benchmarks to assess reliability, safety, and business value.