reasoning

30 articles about reasoning in AI news

30B-A3B Reasoning Model Hits Gold Medal on Physics, Math Olympiads

30B-A3B reasoning model from @stingning achieves gold-medal level on physics and math Olympiads, released on Hugging Face.

May 16, 202685% relevant

LASAR Cuts Latent Reasoning Steps in Half for GenRec at 20x Speedup Over CoT

LASAR nearly halves latent reasoning steps and achieves 20x speedup over explicit CoT in generative recommendation, outperforming baselines on three datasets.

May 12, 202680% relevant

RAG's New Frontier: When to Retrieve During Reasoning

A new RAG paradigm retrieves at multiple reasoning steps via a learned gate, boosting multi-hop QA by 15-20% on HotpotQA.

May 1, 202675% relevant

ThermoQA Benchmark Reveals LLM Reasoning Gaps: Claude Opus Leads at 94.1%

Researchers released ThermoQA, a 293-question benchmark testing thermodynamic reasoning. Claude Opus 4.6 scored 94.1% overall, but models showed significant degradation on complex cycle analysis versus simple property lookups.

Apr 23, 202678% relevant

NVIDIA's Audio Flamingo Next: 30-Min Audio, Time-Grounded Reasoning

NVIDIA has launched Audio Flamingo Next, a next-generation open audio-language model supporting 30-minute audio inputs and time-grounded reasoning. Trained on over 1 million hours of data, it reportedly outperforms larger models on key audio understanding benchmarks.

Apr 19, 202695% relevant

Microsoft's MEMENTO Method Reduces LLM Reasoning Memory by 3x

Microsoft researchers introduced MEMENTO, a method where LLMs generate structured 'notes' during multi-step reasoning, reducing the memory footprint of the reasoning process by 3x while maintaining performance. This addresses a key bottleneck in deploying complex reasoning models.

Apr 16, 202680% relevant

SPPO: Sequence-Level PPO Cuts RL Training Time 5.9x for Math Reasoning

Researchers introduced SPPO, a sequence-level PPO algorithm that reformulates reasoning as a contextual bandit. It achieves a 5.9x speedup over GRPO while matching performance on AIME, AMC, and MATH benchmarks at 1.5B and 7B scales.

Apr 15, 202691% relevant

How Downgrading to Claude Code 2.1.106 Fixes Model Reasoning Issues

Developers report model reasoning improvements by downgrading to Claude Code 2.1.106 and disabling the Claude Agent feature in global settings.

Apr 14, 202696% relevant

Baidu's RLVR Method Boosts Open-Ended Reasoning by 3.29 Points on 14B Model

Baidu researchers developed RLVR, a method that reformulates subjective tasks like writing as verifiable multiple-choice questions for reinforcement learning. This approach improved a 14B reasoning model by an average of 3.29 points across seven open-ended benchmarks compared to standard RLHF.

Apr 13, 202685% relevant

AGIBOT Launches $536K 'Reasoning to Action' Challenge for Robotics

AGIBOT has announced a $536,000 prize competition targeting the 'Reasoning to Action' problem in robotics. This challenge aims to bridge high-level reasoning with low-level control, a critical hurdle for deploying generalist robots.

Apr 11, 202685% relevant

MLX Enables Local Grounded Reasoning for Satellite, Security, Robotics AI

Apple's MLX framework is enabling 'local grounded reasoning' for AI applications in satellite imagery, security systems, and robotics, moving complex tasks from the cloud to on-device processing.

Apr 11, 202685% relevant

Google Gemini Launches Notebooks for AI-Powered Long-Form Reasoning

Google has launched Gemini Notebooks, a persistent workspace for long-form AI reasoning and iterative project development. This feature directly targets the 'second brain' use case for AI assistants.

Apr 8, 202685% relevant

Massive Video Reasoning Dataset Released, Reportedly 1000x Larger Than Predecessors

An unverified report claims the release of a video reasoning dataset roughly 1000x larger than existing benchmarks. If true, it would be a significant resource for training next-generation video understanding models.

Apr 8, 202699% relevant

CMU Study: Top LLMs Fail Simple Contradiction Tests, Lack True Reasoning

Carnegie Mellon researchers tested 14 leading LLMs on simple contradiction tasks; all failed consistently, revealing fundamental reasoning gaps despite advanced benchmarks. (199 chars)

Apr 6, 202689% relevant

Scaling Law Plateau Not Universal: More Tokens Boost Reasoning AI Performance

Empirical evidence indicates the 'second scaling law'—performance gains from increased computation—does not fully plateau for many reasoning tasks. Benchmark results may be artificially limited by token budgets, not model capability.

Apr 5, 202685% relevant

Study Finds LLM 'Brain Activity' Collapses Under Hard Questions, Revealing Internal Reasoning Limits

New research shows language models' internal activation patterns shrink and simplify when faced with difficult reasoning tasks, suggesting they may rely on shortcuts rather than deep reasoning. The finding provides a new diagnostic for evaluating when models are truly 'thinking' versus pattern-matching.

Mar 31, 202685% relevant

ViGoR-Bench Exposes 'Logical Desert' in SOTA Visual AI: 20+ Models Fail Physical, Causal Reasoning Tasks

Researchers introduce ViGoR-Bench, a unified benchmark testing visual generative models on physical, causal, and spatial reasoning. It reveals significant deficits in over 20 leading models, challenging the 'performance mirage' of current evaluations.

Mar 30, 202694% relevant

QuatRoPE: New Positional Embedding Enables Linear-Scale 3D Spatial Reasoning in LLMs, Outperforming Quadratic Methods

Researchers propose QuatRoPE, a novel positional embedding method that encodes 3D object relations with linear input scaling. Paired with IGRE, it improves spatial reasoning in LLMs while preserving their original language capabilities.

Mar 27, 202679% relevant

LLM Multi-Agent Framework 'Shared Workspace' Proposed to Improve Complex Reasoning via Task Decomposition

A new research paper proposes a multi-agent framework where LLMs split complex reasoning tasks across specialized agents that collaborate via a shared workspace. This approach aims to overcome single-model limitations in planning and tool use.

Mar 25, 202685% relevant

SIDReasoner: A New Framework for Reasoning-Enhanced Generative Recommendation

Researchers propose SIDReasoner, a two-stage framework that improves LLM-based recommendation by enhancing reasoning over Semantic IDs. It strengthens the alignment between item tokens and language, enabling better interpretability and cross-domain generalization without extensive labeled reasoning data.

Mar 25, 202682% relevant

Luma Labs Launches Uni-1: An Autoregressive Transformer for Image Generation with a Pre-Generation Reasoning Phase

Luma Labs has released Uni-1, a foundational image model that uses an autoregressive transformer to reason about user intent before generating pixels. It aims to address the 'intent gap' common in diffusion models by adding a structured reasoning step.

Mar 24, 202688% relevant

New 'Step-by-Step Feedback' Reward Model Trains AI Agents to Fix Reasoning Errors

Researchers introduce a reward model that provides granular, step-by-step feedback to AI agents during training, helping them identify and correct reasoning errors. The approach aims to improve agent performance on complex, multi-step tasks.

Mar 23, 202685% relevant

Research Suggests Social Reasoning and Logical Thinking Improve AI Agent Team Collaboration

A research paper indicates that incorporating social reasoning and logical thinking capabilities into AI agent teams leads to more effective collaboration. The findings were highlighted in a tweet by AI researcher Rohan Paul.

Mar 22, 202687% relevant

Reasoning Training Fails to Improve Embedding Quality: Study Finds No Transfer to General Language Understanding

Research shows that training AI models for step-by-step reasoning does not improve their ability to create semantic embeddings for search or general QA. Advanced reasoning models perform identically to base models on standard retrieval benchmarks.

Mar 21, 202685% relevant

LLMs Score Only 22% Win Rate in Multi-Agent Clue Game, Revealing Deductive Reasoning Gaps

Researchers created a text-based Clue game to test LLM agents' multi-step deductive reasoning. Across 18 games with GPT-4o-mini and Gemini-2.5-Flash agents, only 4 correct wins were achieved, showing fine-tuning on logic puzzles doesn't reliably improve performance.

Mar 19, 202675% relevant

Fine-Tuning Gemma 3 1B-IT for Financial Reasoning with QLoRA

A technical guide details using QLoRA and reasoning-augmented data to fine-tune Google's Gemma 3 1B-IT model for financial analysis. This demonstrates a method to specialize small language models for complex, domain-specific tasks.

Mar 18, 202689% relevant

Video Reasoning Models Use Chain-of-Steps in Diffusion Denoising, Not Cross-Frame Analysis

New research reveals video reasoning models don't analyze frames sequentially but instead use a Chain-of-Steps mechanism within diffusion denoising, developing emergent working memory and self-correction.

Mar 18, 202685% relevant

ReasonGR: A Framework for Multi-Step Semantic Reasoning in Generative Retrieval

Researchers propose ReasonGR, a framework to enhance generative retrieval models' ability to handle complex, numerical queries requiring multi-step reasoning. Tested on financial QA, it improves accuracy for tasks like analyzing reports.

Mar 16, 202680% relevant

CRYSTAL Benchmark Reveals Universal Step-Disorder in MLLMs: No Model Preserves >60% of Reasoning Steps in Correct Order

Researchers introduce CRYSTAL, a 6,372-instance benchmark evaluating multimodal reasoning through verifiable steps. It reveals systematic failures in 20 tested MLLMs, including universal cherry-picking and disordered reasoning chains.

Mar 16, 202695% relevant

Anthropic Surpasses Google in Extended Context AI, Redefining Long-Form Reasoning

Anthropic's Claude has reportedly outperformed Google's models in maintaining attention and reasoning across extended contexts, marking a significant shift in the AI landscape where context length has become a critical competitive frontier.

Mar 14, 202687% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety