Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

methodology

30 articles about methodology in AI news

Claude Opus 4.7 Matches Dedicated NMR Software on Chemistry Tasks

Claude Opus 4.7 matches NMR software on chemistry tasks per Anthropic blog, but methodology and benchmarks undisclosed.

94% relevant

LangFuse on Evaluating AI Agents in Production

The article outlines a practical methodology for monitoring and enhancing AI agent performance post-deployment. It emphasizes combining automated LLM-based evaluation with human feedback loops to create actionable datasets for fine-tuning.

78% relevant

Google Launches PaperBanana AI to Format Raw Methods into Publication Text

Google has launched PaperBanana, an AI tool designed to transform unstructured methodology notes into polished, publication-ready text. This targets a key bottleneck in academic writing, automating the formatting and structuring of methods sections.

87% relevant

Google's PaperBanana AI Generates Academic Diagrams, Beats Human Designs 3:1

Google released PaperBanana, an AI system that transforms raw methodology text into publication-ready academic diagrams using a 5-agent creative pipeline. In blind evaluations, humans preferred its outputs nearly 3 out of 4 times over manually designed figures.

95% relevant

Study of 1,222 Users Claims ChatGPT Use Reduces Cognitive Effort

A viral social media post references a study of 1,222 people, claiming it proves ChatGPT use reduces cognitive effort. The claim lacks published methodology or data, highlighting the ongoing debate over AI's impact on human cognition.

87% relevant

Google's Groundsource: Using AI to Mine Historical Disaster Data from Global News

Google AI Research has unveiled Groundsource, a novel methodology using the Gemini model to transform unstructured global news reports into structured historical datasets. The system addresses critical data gaps in disaster management, starting with 2.6 million urban flash flood events.

75% relevant

The Trust Revolution: New AI Benchmark Promises Unprecedented Transparency and Integrity

A new AI benchmark system introduces a dual-check methodology with monthly refreshes to prevent memorization, offering full transparency through open-source verification and independence from tool vendors.

85% relevant

New AI Coding Benchmark Sets Standard with Real-World Pull Requests

A groundbreaking AI coding benchmark uses real GitHub pull requests instead of synthetic tests, measuring both precision and recall across 8 tools. The transparent methodology includes publishing all results, even unfavorable ones.

85% relevant

Zhipu's GLM 5.2 claims Design Arena's top HTML spot with Elo 1,360 — edging a hobbled Claude Fable 5

Zhipu AI's 753-billion-parameter open-weight model GLM 5.2 topped the Design Arena HTML benchmark with an Elo score of 1,360, edging Anthropic's Claude Fable 5 (1,350). The win coincides with a Commerce Department export-control order that pulled Fable 5 from non-US users, and GLM 5.2's API pricing

92% relevant

OpenAI Can Predict Model Failures via Past Chat Replay

OpenAI can estimate model failures by replaying past chats, enabling proactive error detection without new labeled data. No benchmark numbers disclosed.

100% relevant

SciRisk-Bench Tests 10 Risk Dimensions Across 7 Science Disciplines

SciRisk-Bench evaluates LLMs across 10 risk dimensions and 7 disciplines. Safety omission and lab safety show highest vulnerability.

68% relevant

OpenAI DeploymentSim predicts GPT-5 errors 92% of the time pre-launch

OpenAI's Deployment Simulation predicted GPT-5 errors with 92% accuracy using 1.3M real conversations, outperforming standard safety tests.

90% relevant

Anthropic Study: Senior Engineers Beat Juniors With AI by 31%

Anthropic study: senior engineers achieve 31% higher success rate with Claude Code than juniors, challenging the democratization narrative.

100% relevant

AI editor matches pro on 84% of video cuts in blind test

AI editor matched pro on 84% of video cuts in blind test of 4-hour project. Suggests editorial judgment is partially automatable.

65% relevant

NVIDIA Blackwell Ultra Leads First Agentic AI Benchmark, 20x Agents/MW vs Hopper

NVIDIA Blackwell Ultra NVL72 leads the first AgentPerf benchmark for agentic AI, delivering 20x more agents per megawatt than Hopper.

92% relevant

General LLMs Beat Clinical AI Tools in Doctor Study

Frontier LLMs beat clinical AI tools like OpenEvidence in all evaluations, matching Google Search AI Overview.

83% relevant

MiniMax M3 Exceeds Human Gold-Medal on Math Benchmarks via MaxProof

MiniMax's M3 exceeded human gold-medal on math benchmarks via MaxProof, but no scores or details were disclosed.

100% relevant

MCP Server Report: 54% of 39,762 Servers Have Zero Community Adoption —

54% of 39,762 MCP servers are invisible to AI agents due to zero community adoption. Use Agent Tool Intelligence's new grading model to boost your server's discoverability.

90% relevant

MiniMax-M3 Scores 55 on AI Index, Open-Source Lead Looms

MiniMax-M3 scored 55 on the Artificial Analysis Intelligence Index, set to become the leading open-source model once weights are released.

87% relevant

Nemotron 3 Ultra matches GPT-5.5 on physics test at 10X lower cost

Nemotron 3 Ultra matched GPT-5.5 on a physics test at 10X lower cost ($0.051 vs $0.57), highlighting MoE efficiency.

85% relevant

Law Profs Prefer AI Answers 75% of Time in Stanford Study

Stanford researchers found law professors preferred AI answers 75% of time in blind legal analysis test, per @rohanpaul_ai.

85% relevant

Open-Weight Models Trail Frontier AI by Four Months: EpochAI

EpochAI finds open-weight models trail frontier closed-source models by four months, a small gap reflecting rapid catch-up.

79% relevant

No Rigorous Productivity Tests Exist for Post-2025 Autonomous Coding Tools

No productivity studies exist for autonomous coding tools launched December 2025. All research predates the Claude Code/Codex revolution, creating a major knowledge gap.

72% relevant

Cerebras Hits 981 Tokens/sec on 1T-Parameter Kimi K2.6, Claims 6.7× GPU Cloud Speedup

Cerebras reported 981 tokens/sec on the 1T-parameter Kimi K2.6 model, a 6.7× speedup over the next GPU cloud, validated by an independent third party.

93% relevant

Claude Reaches 30M Daily Users; Anthropic Scales

Claude reportedly reaches 30 million daily users per a third-party claim, though Anthropic has not confirmed the figure. The milestone, if accurate, shows growing consumer adoption but lags behind ChatGPT.

79% relevant

Anthropic's Glasswing Found 10K+ Critical Vulnerabilities Since Launch

Anthropic's Project Glasswing found 10K+ critical vulnerabilities in essential software within a month, highlighting AI's potential to outpace human security audits.

100% relevant

Composer 2.5 Scores 62 on Coding Index at $0.07 vs. $4-5 for Rivals

Composer 2.5 scores 62 on coding index at $0.07/task vs $4-5 for rivals scoring 65-66. 60x cost savings with near-parity performance.

83% relevant

OpenAI Model Disproves Erdős Conjecture, First AI to Solve Open Math Problem

OpenAI reasoning model disproves 1946 Erdős conjecture, first AI to solve open math problem. Cross-domain proof verified by Gowers.

97% relevant

Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context

Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.

88% relevant

CLAUDE.md Wastes 7K+ Tokens Per Turn; Skills Cut to 50

A 1,000-line CLAUDE.md burns 7,000-10,000 tokens per turn on instructions the model already knows. Skills using progressive disclosure cut that to ~50 tokens.

100% relevant