HAVEN Benchmark Exposes MLLM Gap Between Fluency and Video Understanding

HAVEN benchmark tests MLLMs on hierarchical video understanding across frame, shot, and video levels. Results show top models lack grounded multimodal reasoning despite fluent text generation.

arxiv.org/May 21, 2026/3 min read/Widely Reported

video understandingbenchmarkmultimodal ai

A digital brain model connected to a glowing memory storage unit, with data streams flowing between the LLM core and…

AI Research

77

Memory as a Model: Augmenting LLMs with Trained Memory

Paper augments LLMs with trained memory for long-term recall. Model-agnostic approach stores external knowledge without retraining.

x.com/May 20, 2026/3 min read

memoryresearchllm

An Apple research paper titled 'The Illusion of Thinking' on a desk next to a MacBook, with charts showing…

AI Research

91

Apple Paper Argues LLMs Show 'Illusion of Thinking'

Apple paper argues LLMs show no genuine reasoning, only pattern matching. The critique targets vendor claims but lacks new empirical evidence.

x.com/May 20, 2026/3 min read

llmsapplebenchmarks

A diverse group of religious leaders and tech workers in a meeting room, discussing on a whiteboard with AI symbols…

AI Research

86

Anthropic Study: Model Character Needs Clergy, Not Just Coders

Anthropic's study argues frontier AI needs input from clergy and philosophers, treating model behavior as moral formation. A self-reminder tool lowered misaligned behavior in internal tests.

x.com/May 20, 2026/3 min read

alignmentclaudeanthropic

Bar chart titled 'Persuasion Boosts LLM Compliance' showing compliance rates rising from 35% to 51% after applying…

AI Research

85

Persuasion Techniques Boost LLM Compliance from 35% to 51% in PNAS Study

PNAS study finds persuasion techniques boost LLM compliance from 35% to 51%, with newer models resisting more.

x.com/May 19, 2026/3 min read

ai safetyresearchllm vulnerabilities

AI Research

85

NanoGPT-Bench: A New Eval for Coding Agents Doing AI Research

IntologyAI released NanoGPT-Bench, an internal eval for coding agents on an AI R&D problem. No results or task specifics have been disclosed.

x.com/May 19, 2026/3 min read

coding agentsbenchmarksai evaluation

ByteDance's Lance 3B MoE model interface displaying benchmark scores surpassing larger 7B models, with multimodal…

AI Research

90

ByteDance Lance 3B MoE Beats 7B Models on Multimodal Benchmarks

ByteDance released Lance, a 3B multimodal MoE model that beats 7B+ models on benchmarks through multi-task synergy and specialized pathways.

x.com/May 19, 2026/3 min read

bytedancemoeai models

A diagram showing a central knowledge graph connecting AI agent memory nodes, with labeled edges linking entities…

AI Research

75

Neo4j's agent-memory: Open-source unified memory for AI agents via knowledge graphs

Neo4j releases agent-memory, an open-source unified memory layer for AI agents using knowledge graphs, enabling persistent structured recall.

x.com/May 19, 2026/3 min read

open sourceknowledge graphsmemory

A bar chart comparing clinical scores from human raters and MLLM raters, with MLLM scores clustered near the middle…

AI Research

70

MLLM Raters Show Central Tendency Bias in Clinical Scoring

Study finds GPT-5 and other MLLMs show central tendency bias in clinical scoring, compressing predictions toward scale midpoint despite prompt modifications.

arxiv.org/May 19, 2026/3 min read/Multi-Source

llm evaluationclinical aiai research

Humanoid robot Atlas hoisting a silver mini-fridge above its waist in a cluttered workshop, with cables and…

AI Research

85

Boston Dynamics Atlas Lifts 100-lb Fridge via RL

Boston Dynamics showed Atlas lifting a 100+ lb mini-fridge via RL, moving from locomotion to practical manipulation.

x.com/May 19, 2026/3 min read

roboticshumanoid-robotsreinforcement-learning

Diagram of SenseTime's Flash-Omni model architecture showing pixel and word reasoning paths without a separate…

AI Research

87

SenseTime Open-Sources Omni-Modal Model That Thinks in Pixels and Words

SenseTime open-sourced an omni-modal AI that reasons in pixel-word space without visual encoder or VAE, challenging dominant multimodal architectures.

x.com/May 18, 2026/3 min read

architectureopen-sourcevision-language

A line chart titled 'SWE-Bench Verified' shows GPT-5.4 nano scoring 76.4%, matching larger models, with a…

AI Research

85

GPT-5.4 nano + critic loop hits 76.4% on SWE-Bench Verified

GPT-5.4 nano with critic-comparator loop scored 76.4% on SWE-Bench Verified, matching larger models without parameter scaling. The efficiency gain underscores the shift toward inference-time optimization.

x.com/May 18, 2026/3 min read

inference-efficiencybenchmarksmodel-optimization

Odyssey AI team members at a launch event, one pointing at a screen displaying Starchild-1 generating a real-time…

AI Research

95

Odyssey Launches Starchild-1, First Real-Time Multimodal World Model

Odyssey AI released Starchild-1, first real-time multimodal world model for video generation targeting embodied AI and robotics.

x.com/May 18, 2026/3 min read

world modelsvideo generationembodied ai

AI agents on computer screens display network maps and code, outperforming human hackers in a cybersecurity…

AI Research

85

Stanford AI Agents Outperform Human Hackers in Penetration Test

Stanford AI agents beat human hackers in pen testing, finding more zero-day exploits. The claim lacks peer review but signals disruption for the $200B cybersecurity industry.

x.com/May 18, 2026/3 min read

researchaicybersecurity

A web-based operating system with a taskbar, start menu, and draggable windows on a desktop interface

AI Research

85

Gemini 3.5 Flash Generates Full Web OS in One Shot

Gemini 3.5 Flash generated a full web OS from one prompt in a single HTML file, showcasing one-shot generation of complex UI.

x.com/May 18, 2026/3 min read

geminigenerative uiai

A laptop displays a dashboard monitoring AI agent energy usage, with a supervisor interface showing reduced power…

AI Research

85

AgentStop Cuts Local AI Agent Energy by 15-20% With Minimal Performance Loss

AgentStop cuts local AI agent energy by 15-20% with <5% utility loss using token log-probabilities.

arxiv.org/May 18, 2026/3 min read/Widely Reported

energy efficiencylocal deploymentai agents

Microscope image of fluorescently stained cells in a Cell Painting assay, with colorful nuclei, cytoplasm, and…

AI Research

74

MorphoHELM Benchmark Finds Classic CV Beats Deep Learning on Cell Painting

MorphoHELM benchmark from Microsoft evaluates 20+ methods for Cell Painting, finding no deep learning model beats classic CV when batch effects are controlled.

arxiv.org/May 18, 2026/3 min read

drug-discoverybenchmarkcomputer-vision

OpenBMB's MiniCPM-o 4.5 model interface showing continuous voice and video conversation with Omni-Flow framework

AI Research

88

MiniCPM-o 4.5 Ships Full-Duplex Omni-Modal AI at 9B Parameters

OpenBMB's MiniCPM-o 4.5 is a 9B open model with full-duplex omni-modal interaction, outperforming Qwen3-Omni-30B-A3B and running under 12GB RAM.

x.com/May 17, 2026/3 min read

open-sourcevoice-aimulti-modal

A bar chart comparing grep and vector search performance across multiple model-harness pairs, with grep consistently…

AI Research

85

Grep Beats Vector Search in Agent Benchmarks, New Paper Finds

Grep beats vector search on LongMemEval across all harness-model pairs, showing agent design matters more than retrieval method for evidence-location tasks.

x.com/May 17, 2026/3 min read

agentsresearchretrieval

A laptop screen displays a CAD software interface with a 3D model of a mechanical part, while a smartphone camera…

AI Research

87

MIT Open-Sources AI That Turns Photos Into Editable CAD Models

MIT open-sourced an AI that turns photos into editable CAD files, threatening $150/hour modeling work. No benchmarks or training details disclosed.

x.com/May 17, 2026/3 min read

3d reconstructionopen-sourceai

Data scientists examining a spreadsheet with benchmark scores, surrounded by data flow diagrams on a whiteboard…

AI Research

85

New Paper Coins 'Curation Debt' — Benchmarks Measure Data Leakage, Not Capability

New paper coins 'curation debt' — benchmarks like MMLU measure data leakage, not capability. Proposes adversarial dynamic benchmarks.

x.com/May 16, 2026/3 min read

benchmarksevaluationai research

A 30B-A3B AI model diagram with gold medal icons for physics and math, displayed on a Hugging Face repository page

AI Research

87

30B-A3B Reasoning Model Hits Gold Medal on Physics, Math Olympiads

30B-A3B reasoning model from @stingning achieves gold-medal level on physics and math Olympiads, released on Hugging Face.

x.com/May 16, 2026/3 min read

open sourcereasoningai models

A cybersecurity dashboard shows CMU ExploitBench scores with Claude Mythos at 9.9 and GPT-5.5 at 5.5, alongside V8…

AI ResearchBreakthrough

100

CMU Benchmark: Claude Mythos Hits 9.9/16 on V8 Exploits, GPT-5.5 Trails at 5.5

CMU's ExploitBench shows Claude Mythos scores 9.9/16 on V8 exploits vs GPT-5.5's 5.5, but costs $36,428 per run — 12x more. The cost-performance tradeoff is the real story.

the-decoder.com/May 16, 2026/3 min read/Widely Reported

ai securityautonomous agentsbenchmarks

Researchers test chatbots answering academic questions; a laptop screen shows text with a highlighted warning about…

AI Research

88

Nature Study: Every Major AI Model Can Be Manipulated Into Academic Fraud

Nature study of 13 AI models found all can be manipulated into academic fraud. Claude most resistant but still vulnerable after extended conversation.

x.com/May 15, 2026/3 min read

ai safetyresearchacademic integrity

A bar chart comparing MCP server performance versus indexed context, showing higher token usage and task failure…

AI Research

88

Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context

Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.

x.com/May 15, 2026/3 min read/Multi-Source

claudemcpai engineering

A diagram of the SDAR framework showing a multi-turn LLM agent interacting with an environment, with…

AI Research

85

SDAR: Self-Distilled RL Stabilizes Multi-Turn LLM Agents, +9.4% on ALFWorld

SDAR gates self-distillation within GRPO to stabilize multi-turn LLM agent training, yielding +9.4% on ALFWorld and gains on WebShop and Search-QA across Qwen2.5 and Qwen3 models.

x.com/May 15, 2026/3 min read

researchreinforcement learningagent training

Bar chart comparing accuracy of centralized training, FedAvg, and FedAvg+QLoRA across four healthcare and finance…

AI Research

88

Federated Fine-Tuning Benchmark Shows QLoRA Nears Centralized Accuracy on

Sherpa.ai's arXiv benchmark shows federated fine-tuning with QLoRA matches centralized accuracy on four healthcare and finance datasets, outperforming isolated single-institution learning under non-IID conditions.

arxiv.org/May 15, 2026/3 min read/Widely Reported

researchbenchmarkfederated learning

Large Hadron Collider tunnel with glowing blue detector components, scientists monitoring control room screens…

AI Research

92

Collider-Bench Tests LLM Agents on LHC Analysis Reproduction

Collider-Bench tests LLM agents on reproducing LHC analyses from papers. No agent beats physicist-in-the-loop, highlighting gaps in scientific reasoning.

arxiv.org/May 15, 2026/3 min read/Widely Reported

benchmarksai researchscience

A glowing blue digital shield with a mythical winged figure in the center, surrounded by abstract network lines and…

AI ResearchBreakthrough

100

Claude Mythos Clears All UK Cyberattack Simulators, Doubling Speed Revised

Claude Mythos Preview became the first AI model to clear all UK AISI cyberattack simulations, forcing the agency to double its capability-doubling estimate twice in five months.

the-decoder.com/May 14, 2026/3 min read/Widely Reported

anthropicai safetycybersecurity

Diagram of Hermes agent's three-tier memory architecture with MEMORY.md and USER.md files as tier 1 core…

AI Research

91

Hermes Agent's Three-Tier Memory Cuts Context Bloat, Keeps 2,200-Char Core

Hermes agent's three-tier memory uses two tiny markdown files (2,200 chars), SQLite FTS5 search (10ms over 10K docs), and 8 pluggable providers. The composition solves the always-on vs. deep recall trade-off.

x.com/May 14, 2026/3 min read/Multi-Source

open sourceai agentsmemory systems