methodology

30 articles about methodology in AI news

Claude Opus 4.7 Matches Dedicated NMR Software on Chemistry Tasks

Claude Opus 4.7 matches NMR software on chemistry tasks per Anthropic blog, but methodology and benchmarks undisclosed.

Jun 5, 202694% relevant

LangFuse on Evaluating AI Agents in Production

The article outlines a practical methodology for monitoring and enhancing AI agent performance post-deployment. It emphasizes combining automated LLM-based evaluation with human feedback loops to create actionable datasets for fine-tuning.

Apr 23, 202678% relevant

Google Launches PaperBanana AI to Format Raw Methods into Publication Text

Google has launched PaperBanana, an AI tool designed to transform unstructured methodology notes into polished, publication-ready text. This targets a key bottleneck in academic writing, automating the formatting and structuring of methods sections.

Apr 16, 202687% relevant

Google's PaperBanana AI Generates Academic Diagrams, Beats Human Designs 3:1

Google released PaperBanana, an AI system that transforms raw methodology text into publication-ready academic diagrams using a 5-agent creative pipeline. In blind evaluations, humans preferred its outputs nearly 3 out of 4 times over manually designed figures.

Apr 16, 202695% relevant

Study of 1,222 Users Claims ChatGPT Use Reduces Cognitive Effort

A viral social media post references a study of 1,222 people, claiming it proves ChatGPT use reduces cognitive effort. The claim lacks published methodology or data, highlighting the ongoing debate over AI's impact on human cognition.

Apr 7, 202687% relevant

Google's Groundsource: Using AI to Mine Historical Disaster Data from Global News

Google AI Research has unveiled Groundsource, a novel methodology using the Gemini model to transform unstructured global news reports into structured historical datasets. The system addresses critical data gaps in disaster management, starting with 2.6 million urban flash flood events.

Mar 13, 202675% relevant

The Trust Revolution: New AI Benchmark Promises Unprecedented Transparency and Integrity

A new AI benchmark system introduces a dual-check methodology with monthly refreshes to prevent memorization, offering full transparency through open-source verification and independence from tool vendors.

Feb 26, 202685% relevant

New AI Coding Benchmark Sets Standard with Real-World Pull Requests

A groundbreaking AI coding benchmark uses real GitHub pull requests instead of synthetic tests, measuring both precision and recall across 8 tools. The transparent methodology includes publishing all results, even unfavorable ones.

Feb 24, 202685% relevant

Zhipu's GLM 5.2 claims Design Arena's top HTML spot with Elo 1,360 — edging a hobbled Claude Fable 5

Zhipu AI's 753-billion-parameter open-weight model GLM 5.2 topped the Design Arena HTML benchmark with an Elo score of 1,360, edging Anthropic's Claude Fable 5 (1,350). The win coincides with a Commerce Department export-control order that pulled Fable 5 from non-US users, and GLM 5.2's API pricing

Jun 20, 202692% relevant

OpenAI Can Predict Model Failures via Past Chat Replay

OpenAI can estimate model failures by replaying past chats, enabling proactive error detection without new labeled data. No benchmark numbers disclosed.

Jun 18, 2026100% relevant

SciRisk-Bench Tests 10 Risk Dimensions Across 7 Science Disciplines

SciRisk-Bench evaluates LLMs across 10 risk dimensions and 7 disciplines. Safety omission and lab safety show highest vulnerability.

Jun 18, 202668% relevant

OpenAI DeploymentSim predicts GPT-5 errors 92% of the time pre-launch

OpenAI's Deployment Simulation predicted GPT-5 errors with 92% accuracy using 1.3M real conversations, outperforming standard safety tests.

Jun 17, 202690% relevant

Anthropic Study: Senior Engineers Beat Juniors With AI by 31%

Anthropic study: senior engineers achieve 31% higher success rate with Claude Code than juniors, challenging the democratization narrative.

Jun 16, 2026100% relevant

AI editor matches pro on 84% of video cuts in blind test

AI editor matched pro on 84% of video cuts in blind test of 4-hour project. Suggests editorial judgment is partially automatable.

Jun 15, 202665% relevant

NVIDIA Blackwell Ultra Leads First Agentic AI Benchmark, 20x Agents/MW vs Hopper

NVIDIA Blackwell Ultra NVL72 leads the first AgentPerf benchmark for agentic AI, delivering 20x more agents per megawatt than Hopper.

Jun 12, 202692% relevant

General LLMs Beat Clinical AI Tools in Doctor Study

Frontier LLMs beat clinical AI tools like OpenEvidence in all evaluations, matching Google Search AI Overview.

Jun 12, 202683% relevant

MiniMax M3 Exceeds Human Gold-Medal on Math Benchmarks via MaxProof

MiniMax's M3 exceeded human gold-medal on math benchmarks via MaxProof, but no scores or details were disclosed.

Jun 12, 2026100% relevant

MCP Server Report: 54% of 39,762 Servers Have Zero Community Adoption —

54% of 39,762 MCP servers are invisible to AI agents due to zero community adoption. Use Agent Tool Intelligence's new grading model to boost your server's discoverability.

Jun 12, 202690% relevant

MiniMax-M3 Scores 55 on AI Index, Open-Source Lead Looms

MiniMax-M3 scored 55 on the Artificial Analysis Intelligence Index, set to become the leading open-source model once weights are released.

Jun 8, 202687% relevant

Nemotron 3 Ultra matches GPT-5.5 on physics test at 10X lower cost

Nemotron 3 Ultra matched GPT-5.5 on a physics test at 10X lower cost ($0.051 vs $0.57), highlighting MoE efficiency.

Jun 5, 202685% relevant

Law Profs Prefer AI Answers 75% of Time in Stanford Study

Stanford researchers found law professors preferred AI answers 75% of time in blind legal analysis test, per @rohanpaul_ai.

Jun 3, 202685% relevant

Open-Weight Models Trail Frontier AI by Four Months: EpochAI

EpochAI finds open-weight models trail frontier closed-source models by four months, a small gap reflecting rapid catch-up.

May 29, 202679% relevant

No Rigorous Productivity Tests Exist for Post-2025 Autonomous Coding Tools

No productivity studies exist for autonomous coding tools launched December 2025. All research predates the Claude Code/Codex revolution, creating a major knowledge gap.

May 26, 202672% relevant

Cerebras Hits 981 Tokens/sec on 1T-Parameter Kimi K2.6, Claims 6.7× GPU Cloud Speedup

Cerebras reported 981 tokens/sec on the 1T-parameter Kimi K2.6 model, a 6.7× speedup over the next GPU cloud, validated by an independent third party.

May 23, 202693% relevant

Claude Reaches 30M Daily Users; Anthropic Scales

Claude reportedly reaches 30 million daily users per a third-party claim, though Anthropic has not confirmed the figure. The milestone, if accurate, shows growing consumer adoption but lags behind ChatGPT.

May 22, 202679% relevant

Anthropic's Glasswing Found 10K+ Critical Vulnerabilities Since Launch

Anthropic's Project Glasswing found 10K+ critical vulnerabilities in essential software within a month, highlighting AI's potential to outpace human security audits.

May 22, 2026100% relevant

Composer 2.5 Scores 62 on Coding Index at $0.07 vs. $4-5 for Rivals

Composer 2.5 scores 62 on coding index at $0.07/task vs $4-5 for rivals scoring 65-66. 60x cost savings with near-parity performance.

May 21, 202683% relevant

OpenAI Model Disproves Erdős Conjecture, First AI to Solve Open Math Problem

OpenAI reasoning model disproves 1946 Erdős conjecture, first AI to solve open math problem. Cross-domain proof verified by Gowers.

May 21, 202697% relevant

Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context

Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.

May 15, 202688% relevant

CLAUDE.md Wastes 7K+ Tokens Per Turn; Skills Cut to 50

A 1,000-line CLAUDE.md burns 7,000-10,000 tokens per turn on instructions the model already knows. Skills using progressive disclosure cut that to ~50 tokens.

May 15, 2026100% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety