methodology
30 articles about methodology in AI news
Claude Opus 4.7 Matches Dedicated NMR Software on Chemistry Tasks
Claude Opus 4.7 matches NMR software on chemistry tasks per Anthropic blog, but methodology and benchmarks undisclosed.
LangFuse on Evaluating AI Agents in Production
The article outlines a practical methodology for monitoring and enhancing AI agent performance post-deployment. It emphasizes combining automated LLM-based evaluation with human feedback loops to create actionable datasets for fine-tuning.
Google Launches PaperBanana AI to Format Raw Methods into Publication Text
Google has launched PaperBanana, an AI tool designed to transform unstructured methodology notes into polished, publication-ready text. This targets a key bottleneck in academic writing, automating the formatting and structuring of methods sections.
Google's PaperBanana AI Generates Academic Diagrams, Beats Human Designs 3:1
Google released PaperBanana, an AI system that transforms raw methodology text into publication-ready academic diagrams using a 5-agent creative pipeline. In blind evaluations, humans preferred its outputs nearly 3 out of 4 times over manually designed figures.
Study of 1,222 Users Claims ChatGPT Use Reduces Cognitive Effort
A viral social media post references a study of 1,222 people, claiming it proves ChatGPT use reduces cognitive effort. The claim lacks published methodology or data, highlighting the ongoing debate over AI's impact on human cognition.
Google's Groundsource: Using AI to Mine Historical Disaster Data from Global News
Google AI Research has unveiled Groundsource, a novel methodology using the Gemini model to transform unstructured global news reports into structured historical datasets. The system addresses critical data gaps in disaster management, starting with 2.6 million urban flash flood events.
The Trust Revolution: New AI Benchmark Promises Unprecedented Transparency and Integrity
A new AI benchmark system introduces a dual-check methodology with monthly refreshes to prevent memorization, offering full transparency through open-source verification and independence from tool vendors.
New AI Coding Benchmark Sets Standard with Real-World Pull Requests
A groundbreaking AI coding benchmark uses real GitHub pull requests instead of synthetic tests, measuring both precision and recall across 8 tools. The transparent methodology includes publishing all results, even unfavorable ones.
Zhipu's GLM 5.2 claims Design Arena's top HTML spot with Elo 1,360 — edging a hobbled Claude Fable 5
Zhipu AI's 753-billion-parameter open-weight model GLM 5.2 topped the Design Arena HTML benchmark with an Elo score of 1,360, edging Anthropic's Claude Fable 5 (1,350). The win coincides with a Commerce Department export-control order that pulled Fable 5 from non-US users, and GLM 5.2's API pricing
OpenAI Can Predict Model Failures via Past Chat Replay
OpenAI can estimate model failures by replaying past chats, enabling proactive error detection without new labeled data. No benchmark numbers disclosed.
SciRisk-Bench Tests 10 Risk Dimensions Across 7 Science Disciplines
SciRisk-Bench evaluates LLMs across 10 risk dimensions and 7 disciplines. Safety omission and lab safety show highest vulnerability.
OpenAI DeploymentSim predicts GPT-5 errors 92% of the time pre-launch
OpenAI's Deployment Simulation predicted GPT-5 errors with 92% accuracy using 1.3M real conversations, outperforming standard safety tests.
Anthropic Study: Senior Engineers Beat Juniors With AI by 31%
Anthropic study: senior engineers achieve 31% higher success rate with Claude Code than juniors, challenging the democratization narrative.
AI editor matches pro on 84% of video cuts in blind test
AI editor matched pro on 84% of video cuts in blind test of 4-hour project. Suggests editorial judgment is partially automatable.
NVIDIA Blackwell Ultra Leads First Agentic AI Benchmark, 20x Agents/MW vs Hopper
NVIDIA Blackwell Ultra NVL72 leads the first AgentPerf benchmark for agentic AI, delivering 20x more agents per megawatt than Hopper.
General LLMs Beat Clinical AI Tools in Doctor Study
Frontier LLMs beat clinical AI tools like OpenEvidence in all evaluations, matching Google Search AI Overview.
MiniMax M3 Exceeds Human Gold-Medal on Math Benchmarks via MaxProof
MiniMax's M3 exceeded human gold-medal on math benchmarks via MaxProof, but no scores or details were disclosed.
MCP Server Report: 54% of 39,762 Servers Have Zero Community Adoption —
54% of 39,762 MCP servers are invisible to AI agents due to zero community adoption. Use Agent Tool Intelligence's new grading model to boost your server's discoverability.
MiniMax-M3 Scores 55 on AI Index, Open-Source Lead Looms
MiniMax-M3 scored 55 on the Artificial Analysis Intelligence Index, set to become the leading open-source model once weights are released.
Nemotron 3 Ultra matches GPT-5.5 on physics test at 10X lower cost
Nemotron 3 Ultra matched GPT-5.5 on a physics test at 10X lower cost ($0.051 vs $0.57), highlighting MoE efficiency.
Law Profs Prefer AI Answers 75% of Time in Stanford Study
Stanford researchers found law professors preferred AI answers 75% of time in blind legal analysis test, per @rohanpaul_ai.
Open-Weight Models Trail Frontier AI by Four Months: EpochAI
EpochAI finds open-weight models trail frontier closed-source models by four months, a small gap reflecting rapid catch-up.
No Rigorous Productivity Tests Exist for Post-2025 Autonomous Coding Tools
No productivity studies exist for autonomous coding tools launched December 2025. All research predates the Claude Code/Codex revolution, creating a major knowledge gap.
Cerebras Hits 981 Tokens/sec on 1T-Parameter Kimi K2.6, Claims 6.7× GPU Cloud Speedup
Cerebras reported 981 tokens/sec on the 1T-parameter Kimi K2.6 model, a 6.7× speedup over the next GPU cloud, validated by an independent third party.
Claude Reaches 30M Daily Users; Anthropic Scales
Claude reportedly reaches 30 million daily users per a third-party claim, though Anthropic has not confirmed the figure. The milestone, if accurate, shows growing consumer adoption but lags behind ChatGPT.
Anthropic's Glasswing Found 10K+ Critical Vulnerabilities Since Launch
Anthropic's Project Glasswing found 10K+ critical vulnerabilities in essential software within a month, highlighting AI's potential to outpace human security audits.
Composer 2.5 Scores 62 on Coding Index at $0.07 vs. $4-5 for Rivals
Composer 2.5 scores 62 on coding index at $0.07/task vs $4-5 for rivals scoring 65-66. 60x cost savings with near-parity performance.
OpenAI Model Disproves Erdős Conjecture, First AI to Solve Open Math Problem
OpenAI reasoning model disproves 1946 Erdős conjecture, first AI to solve open math problem. Cross-domain proof verified by Gowers.
Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context
Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.
CLAUDE.md Wastes 7K+ Tokens Per Turn; Skills Cut to 50
A 1,000-line CLAUDE.md burns 7,000-10,000 tokens per turn on instructions the model already knows. Skills using progressive disclosure cut that to ~50 tokens.