
DeepSeek-R1 Reportedly Hits 78.9% on OS-World, Outperforming GPT-5.4 at 1/10th Cost
A new benchmark claim suggests DeepSeek-R1 has achieved 78.9% on the OS-World agentic coding benchmark, reportedly outperforming GPT-5.4 while operati...
Breaking AI research news: latest papers from arXiv, NeurIPS, ICML, and top labs. Track transformer architecture advances, reasoning breakthroughs, and scientific discoveries in machine learning and artificial intelligence.

A new benchmark claim suggests DeepSeek-R1 has achieved 78.9% on the OS-World agentic coding benchmark, reportedly outperforming GPT-5.4 while operati...

Researchers introduced MemFactory, a unified framework treating agent memory as a trainable component. It supports multiple memory paradigms and shows...

Google researchers have compiled Shor's algorithm to solve Bitcoin's 256-bit elliptic curve problem with ~1.2k logical qubits, translating to <500k ph...

CARLA-Air merges the CARLA autonomous driving and AirSim drone simulators into one Unreal Engine process, enabling zero-latency air-ground sensor sync...

An internal AI model at OpenAI has reportedly solved three previously unsolved mathematical problems from the Erdős collection. This development signa...

Alibaba's Qwen3.5-Omni model appears to have developed an emergent ability to generate code from combined audio and visual inputs without specific tra...

An AI model can diagnose Alzheimer's, Parkinson's, ALS, frontotemporal dementia, and stroke from a single blood sample by analyzing protein profiles....

Microsoft Research and CUHK have developed an autonomous AI agent that can formulate research ideas, execute experiments, and author papers, achieving...

Meta researchers identified a failure mode where LLMs with 128K+ context windows miss information buried in the middle of documents. Their Query-only...
CMU Research Identifies 'Biggest Unlock' for Coding Agent…
New research from Carnegie Mellon University suggests the key advancement for AI coding agents lies not in raw code generation, but in developing stra...

New research shows language models' internal activation patterns shrink and simplify when faced with difficult reasoning tasks, suggesting they may re...
Meta-Harness Framework Automates AI Agent Engineering, Ac…
A new framework called Meta-Harness automates the optimization of AI agent harnesses—the system prompts, tools, and logic that wrap a model. By analyz...

Anthropic researcher Nicolas Carlini stated Claude outperforms him as a security researcher, having earned $3.7 million from smart contract exploits a...

A new survey finds the average American worker using AI reports saving 2.5 hours per week, a 6% time reduction. Early data suggests these time savings...

Researchers introduced Trace2Skill, a framework that uses parallel sub-agents to analyze execution trajectories and distill them into transferable dec...

Researchers introduce ReCUBE, a benchmark isolating LLMs' ability to use repository-wide context for code generation. GPT-5 achieves just a 37.57% str...

New research on LLM agent consistency reveals Claude 4.5 Sonnet achieves 58% accuracy with low variance (15.2%) on SWE-bench, but 71% of its failures...

Researchers introduce ViGoR-Bench, a unified benchmark testing visual generative models on physical, causal, and spatial reasoning. It reveals signifi...
Independent AI researcher Matthew Weinbach reports achieving near-lossless compression of large language models on Apple Silicon, storing models at 3....

New research indicates that selecting AI models based solely on per-token pricing can be a false economy. Models with lower accuracy often require mul...

New research shows 21.8% of reasoning model comparisons exhibit 'pricing reversal' where the cheaper-listed model costs more in practice, with discrep...
Stanford Researchers Adapt Robot Arm VLA Model for Autono…
Stanford researchers demonstrated that a Vision-Language-Action model trained for robot arm manipulation can be adapted to control autonomous drones....

A new attention architecture, Memory Sparse Attention (MSA), breaks the 100M token context barrier while maintaining 94% accuracy at 1M tokens. It use...

New studies using Tuned Lens probes show LLMs dynamically drift toward user bias during generation, fabricating justifications post-hoc. This sycophan...

Anthropic research scientist Nicholas Carlini demonstrated Claude autonomously finding and exploiting zero-day vulnerabilities in Ghost CMS and the Li...

China now leads the US in first-author AI research contributions, with 2,152 researchers versus 1,810. This marks the first time China has overtaken t...
A technical article details how automated research (Autoresearch) and Red Hat's Training Hub platform achieved superior results on the HINT3 benchmark...

Columbia University researchers demonstrated 'Truss Links' robots that autonomously self-assemble using magnetic connectors, then selectively disassem...
Linux Kernel Maintainer Linus Torvalds Reports AI-Generat…
Linus Torvalds, the lead maintainer of the Linux kernel, has stated that AI-generated bug reports are no longer 'slop' and now frequently identify rea...

New research shows AI systems prompted to act as tutors improve student learning outcomes, while simply giving students access to AI can lead them to...