reasoning
30 articles about reasoning in AI news
30B-A3B Reasoning Model Hits Gold Medal on Physics, Math Olympiads
30B-A3B reasoning model from @stingning achieves gold-medal level on physics and math Olympiads, released on Hugging Face.
LASAR Cuts Latent Reasoning Steps in Half for GenRec at 20x Speedup Over CoT
LASAR nearly halves latent reasoning steps and achieves 20x speedup over explicit CoT in generative recommendation, outperforming baselines on three datasets.
RAG's New Frontier: When to Retrieve During Reasoning
A new RAG paradigm retrieves at multiple reasoning steps via a learned gate, boosting multi-hop QA by 15-20% on HotpotQA.
ThermoQA Benchmark Reveals LLM Reasoning Gaps: Claude Opus Leads at 94.1%
Researchers released ThermoQA, a 293-question benchmark testing thermodynamic reasoning. Claude Opus 4.6 scored 94.1% overall, but models showed significant degradation on complex cycle analysis versus simple property lookups.
NVIDIA's Audio Flamingo Next: 30-Min Audio, Time-Grounded Reasoning
NVIDIA has launched Audio Flamingo Next, a next-generation open audio-language model supporting 30-minute audio inputs and time-grounded reasoning. Trained on over 1 million hours of data, it reportedly outperforms larger models on key audio understanding benchmarks.
Microsoft's MEMENTO Method Reduces LLM Reasoning Memory by 3x
Microsoft researchers introduced MEMENTO, a method where LLMs generate structured 'notes' during multi-step reasoning, reducing the memory footprint of the reasoning process by 3x while maintaining performance. This addresses a key bottleneck in deploying complex reasoning models.
SPPO: Sequence-Level PPO Cuts RL Training Time 5.9x for Math Reasoning
Researchers introduced SPPO, a sequence-level PPO algorithm that reformulates reasoning as a contextual bandit. It achieves a 5.9x speedup over GRPO while matching performance on AIME, AMC, and MATH benchmarks at 1.5B and 7B scales.
How Downgrading to Claude Code 2.1.106 Fixes Model Reasoning Issues
Developers report model reasoning improvements by downgrading to Claude Code 2.1.106 and disabling the Claude Agent feature in global settings.
Baidu's RLVR Method Boosts Open-Ended Reasoning by 3.29 Points on 14B Model
Baidu researchers developed RLVR, a method that reformulates subjective tasks like writing as verifiable multiple-choice questions for reinforcement learning. This approach improved a 14B reasoning model by an average of 3.29 points across seven open-ended benchmarks compared to standard RLHF.
AGIBOT Launches $536K 'Reasoning to Action' Challenge for Robotics
AGIBOT has announced a $536,000 prize competition targeting the 'Reasoning to Action' problem in robotics. This challenge aims to bridge high-level reasoning with low-level control, a critical hurdle for deploying generalist robots.
MLX Enables Local Grounded Reasoning for Satellite, Security, Robotics AI
Apple's MLX framework is enabling 'local grounded reasoning' for AI applications in satellite imagery, security systems, and robotics, moving complex tasks from the cloud to on-device processing.
Google Gemini Launches Notebooks for AI-Powered Long-Form Reasoning
Google has launched Gemini Notebooks, a persistent workspace for long-form AI reasoning and iterative project development. This feature directly targets the 'second brain' use case for AI assistants.
Massive Video Reasoning Dataset Released, Reportedly 1000x Larger Than Predecessors
An unverified report claims the release of a video reasoning dataset roughly 1000x larger than existing benchmarks. If true, it would be a significant resource for training next-generation video understanding models.
CMU Study: Top LLMs Fail Simple Contradiction Tests, Lack True Reasoning
Carnegie Mellon researchers tested 14 leading LLMs on simple contradiction tasks; all failed consistently, revealing fundamental reasoning gaps despite advanced benchmarks. (199 chars)
Scaling Law Plateau Not Universal: More Tokens Boost Reasoning AI Performance
Empirical evidence indicates the 'second scaling law'—performance gains from increased computation—does not fully plateau for many reasoning tasks. Benchmark results may be artificially limited by token budgets, not model capability.
Study Finds LLM 'Brain Activity' Collapses Under Hard Questions, Revealing Internal Reasoning Limits
New research shows language models' internal activation patterns shrink and simplify when faced with difficult reasoning tasks, suggesting they may rely on shortcuts rather than deep reasoning. The finding provides a new diagnostic for evaluating when models are truly 'thinking' versus pattern-matching.
ViGoR-Bench Exposes 'Logical Desert' in SOTA Visual AI: 20+ Models Fail Physical, Causal Reasoning Tasks
Researchers introduce ViGoR-Bench, a unified benchmark testing visual generative models on physical, causal, and spatial reasoning. It reveals significant deficits in over 20 leading models, challenging the 'performance mirage' of current evaluations.
QuatRoPE: New Positional Embedding Enables Linear-Scale 3D Spatial Reasoning in LLMs, Outperforming Quadratic Methods
Researchers propose QuatRoPE, a novel positional embedding method that encodes 3D object relations with linear input scaling. Paired with IGRE, it improves spatial reasoning in LLMs while preserving their original language capabilities.
LLM Multi-Agent Framework 'Shared Workspace' Proposed to Improve Complex Reasoning via Task Decomposition
A new research paper proposes a multi-agent framework where LLMs split complex reasoning tasks across specialized agents that collaborate via a shared workspace. This approach aims to overcome single-model limitations in planning and tool use.
SIDReasoner: A New Framework for Reasoning-Enhanced Generative Recommendation
Researchers propose SIDReasoner, a two-stage framework that improves LLM-based recommendation by enhancing reasoning over Semantic IDs. It strengthens the alignment between item tokens and language, enabling better interpretability and cross-domain generalization without extensive labeled reasoning data.
Luma Labs Launches Uni-1: An Autoregressive Transformer for Image Generation with a Pre-Generation Reasoning Phase
Luma Labs has released Uni-1, a foundational image model that uses an autoregressive transformer to reason about user intent before generating pixels. It aims to address the 'intent gap' common in diffusion models by adding a structured reasoning step.
New 'Step-by-Step Feedback' Reward Model Trains AI Agents to Fix Reasoning Errors
Researchers introduce a reward model that provides granular, step-by-step feedback to AI agents during training, helping them identify and correct reasoning errors. The approach aims to improve agent performance on complex, multi-step tasks.
Research Suggests Social Reasoning and Logical Thinking Improve AI Agent Team Collaboration
A research paper indicates that incorporating social reasoning and logical thinking capabilities into AI agent teams leads to more effective collaboration. The findings were highlighted in a tweet by AI researcher Rohan Paul.
Reasoning Training Fails to Improve Embedding Quality: Study Finds No Transfer to General Language Understanding
Research shows that training AI models for step-by-step reasoning does not improve their ability to create semantic embeddings for search or general QA. Advanced reasoning models perform identically to base models on standard retrieval benchmarks.
LLMs Score Only 22% Win Rate in Multi-Agent Clue Game, Revealing Deductive Reasoning Gaps
Researchers created a text-based Clue game to test LLM agents' multi-step deductive reasoning. Across 18 games with GPT-4o-mini and Gemini-2.5-Flash agents, only 4 correct wins were achieved, showing fine-tuning on logic puzzles doesn't reliably improve performance.
Fine-Tuning Gemma 3 1B-IT for Financial Reasoning with QLoRA
A technical guide details using QLoRA and reasoning-augmented data to fine-tune Google's Gemma 3 1B-IT model for financial analysis. This demonstrates a method to specialize small language models for complex, domain-specific tasks.
Video Reasoning Models Use Chain-of-Steps in Diffusion Denoising, Not Cross-Frame Analysis
New research reveals video reasoning models don't analyze frames sequentially but instead use a Chain-of-Steps mechanism within diffusion denoising, developing emergent working memory and self-correction.
ReasonGR: A Framework for Multi-Step Semantic Reasoning in Generative Retrieval
Researchers propose ReasonGR, a framework to enhance generative retrieval models' ability to handle complex, numerical queries requiring multi-step reasoning. Tested on financial QA, it improves accuracy for tasks like analyzing reports.
CRYSTAL Benchmark Reveals Universal Step-Disorder in MLLMs: No Model Preserves >60% of Reasoning Steps in Correct Order
Researchers introduce CRYSTAL, a 6,372-instance benchmark evaluating multimodal reasoning through verifiable steps. It reveals systematic failures in 20 tested MLLMs, including universal cherry-picking and disordered reasoning chains.
Anthropic Surpasses Google in Extended Context AI, Redefining Long-Form Reasoning
Anthropic's Claude has reportedly outperformed Google's models in maintaining attention and reasoning across extended contexts, marking a significant shift in the AI landscape where context length has become a critical competitive frontier.