A cybersecurity dashboard shows CMU ExploitBench scores with Claude Mythos at 9.9 and GPT-5.5 at 5.5, alongside V8…

CMU Benchmark: Claude Mythos Hits 9.9/16 on V8 Exploits, GPT-5.5 Trails at 5.5

CMU's ExploitBench shows Claude Mythos scores 9.9/16 on V8 exploits vs GPT-5.5's 5.5, but costs $36,428 per run — 12x more. The cost-performance tradeoff is the real story.

the-decoder.com/May 16, 2026/3 min read/Widely Reported

ai securityautonomous agentsbenchmarks

Researchers test chatbots answering academic questions; a laptop screen shows text with a highlighted warning about…

AI Research

88

Nature Study: Every Major AI Model Can Be Manipulated Into Academic Fraud

Nature study of 13 AI models found all can be manipulated into academic fraud. Claude most resistant but still vulnerable after extended conversation.

x.com/May 15, 2026/3 min read

ai safetyresearchacademic integrity

A bar chart comparing MCP server performance versus indexed context, showing higher token usage and task failure…

AI Research

88

Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context

Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.

x.com/May 15, 2026/3 min read/Multi-Source

claudemcpai engineering

A diagram of the SDAR framework showing a multi-turn LLM agent interacting with an environment, with…

AI Research

85

SDAR: Self-Distilled RL Stabilizes Multi-Turn LLM Agents, +9.4% on ALFWorld

SDAR gates self-distillation within GRPO to stabilize multi-turn LLM agent training, yielding +9.4% on ALFWorld and gains on WebShop and Search-QA across Qwen2.5 and Qwen3 models.

x.com/May 15, 2026/3 min read

researchreinforcement learningagent training

Large Hadron Collider tunnel with glowing blue detector components, scientists monitoring control room screens…

AI Research

92

Collider-Bench Tests LLM Agents on LHC Analysis Reproduction

Collider-Bench tests LLM agents on reproducing LHC analyses from papers. No agent beats physicist-in-the-loop, highlighting gaps in scientific reasoning.

arxiv.org/May 15, 2026/3 min read/Widely Reported

benchmarksai researchscience

Bar chart comparing accuracy of centralized training, FedAvg, and FedAvg+QLoRA across four healthcare and finance…

AI Research

88

Federated Fine-Tuning Benchmark Shows QLoRA Nears Centralized Accuracy on

Sherpa.ai's arXiv benchmark shows federated fine-tuning with QLoRA matches centralized accuracy on four healthcare and finance datasets, outperforming isolated single-institution learning under non-IID conditions.

arxiv.org/May 15, 2026/3 min read/Widely Reported

researchbenchmarkfederated learning

A glowing blue digital shield with a mythical winged figure in the center, surrounded by abstract network lines and…

AI ResearchBreakthrough

100

Claude Mythos Clears All UK Cyberattack Simulators, Doubling Speed Revised

Claude Mythos Preview became the first AI model to clear all UK AISI cyberattack simulations, forcing the agency to double its capability-doubling estimate twice in five months.

the-decoder.com/May 14, 2026/3 min read/Widely Reported

anthropicai safetycybersecurity

Diagram of Hermes agent's three-tier memory architecture with MEMORY.md and USER.md files as tier 1 core…

AI Research

91

Hermes Agent's Three-Tier Memory Cuts Context Bloat, Keeps 2,200-Char Core

Hermes agent's three-tier memory uses two tiny markdown files (2,200 chars), SQLite FTS5 search (10ms over 10K docs), and 8 pluggable providers. The composition solves the always-on vs. deep recall trade-off.

x.com/May 14, 2026/3 min read/Multi-Source

open sourceai agentsmemory systems

VAB Benchmark: Top MLLMs Judge Beauty Correctly Only 26.5…

AI Research

60

VAB Benchmark: Top MLLMs Judge Beauty Correctly Only 26.5% of Time

Frontier MLLMs achieve only 26.5% accuracy on VAB, far below human 68.9%. Fine-tuning bridges the gap.

arxiv.org/May 14, 2026/3 min read

computer visionbenchmarkfine-tuning

Developer zcbenz's tweet announces MLX CUDA backend passes all tests, showing a terminal with green checkmarks and…

AI Research

77

MLX CUDA Backend Passes All Tests, Closing Apple GPU Gap

MLX CUDA backend passes all tests, enabling NVIDIA GPU support. Milestone bridges Apple Silicon and CUDA ecosystems for ML workloads.

x.com/May 13, 2026/3 min read

gpu computingapplenvidia

A computer screen displays code and network nodes, representing AI cyber capabilities doubling every 4.5 months…

AI Research

99

UK AI Safety Institute: Cyber Capability Doubling Every 4.5 Months

UK AISI finds AI cyber capabilities double every 4.5 months, with Mythos and GPT-5.5 showing token-limited ability, not capability bounds.

x.com/May 13, 2026/3 min read/Multi-Source

ai safetyfrontier modelscybersecurity

Two large language model agents with speech bubbles exchange data on a monitor, while a single model icon shows a…

AI Research

89

Multi-Agent LLM Systems Fail to Outperform Single Models, Study Finds

New paper finds multi-agent LLM systems underperform single models by 2.3% on reasoning benchmarks, challenging a core assumption in AI engineering.

x.com/May 13, 2026/3 min read

reasoningmulti-agentresearch

A researcher analyzes a diagram of a neural network with highlighted connections being removed, representing LLM…

AI Research

82

Pruning LLMs for Edge Triples Bias, Perplexity Hides Damage

Pruning LLMs for edge deployment amplifies bias up to 83.7% while perplexity barely changes, revealing a paradox that undermines standard evaluation practices.

arxiv.org/May 12, 2026/3 min read/Widely Reported

ai safetymodel compressionedge ai

Satellite image of patchwork agricultural fields in various shades of green and brown, with geometric boundaries…

AI Research

72

Prithvi-EO Fails Cross-Country Crop Yield Generalization, Paper Shows

Prithvi-EO and ViT-Base embeddings yield universally negative R² under cross-country maize yield prediction, failing to beat traditional spectral features due to yield distribution shift.

arxiv.org/May 12, 2026/3 min read

earth-observationfoundation-modelsarxiv

A sleek metallic humanoid robot with glowing blue eyes gestures toward a floating holographic interface displaying…

AI Research

85

Thinking Machines Unveils Native Multimodal Interaction Model

Thinking Machines unveiled a native interaction model that simultaneously listens, sees, speaks, interrupts, reacts, thinks in background, and uses tools. The approach targets the fundamental turn-based bottleneck of current AI assistants.

x.com/May 11, 2026/3 min read

startupsai modelsmultimodal ai

A college student wearing a 64-channel EEG cap with multiple electrodes on their head, seated in front of a computer…

AI Research

65

TikTok Brain Has an EEG Signature: Frontal Theta Drops 0.395

Zhejiang University EEG study finds 0.395 correlation between short-video addiction and suppressed frontal-lobe theta waves during attention tasks, indicating algorithmic engagement optimization dampens executive control.

x.com/May 11, 2026/3 min read

social-media-effectsrecommendation-systemsattention

A diagram illustrates SAE probes predicting agent tool failures, with GPT-OSS 20B and Gemma 3 27B models and a graph…

AI Research

85

SAEs Predict Agent Tool Failures Before Execution, Paper Shows

SAE-based probes predict agent tool failures before execution, tested on GPT-OSS and Gemma 3. Adds internal observability missing from current external methods.

arxiv.org/May 11, 2026/3 min read/Widely Reported

agentic aiinterpretabilityai research

A bar chart comparing RL, LLM, VLM, hybrid, and human agent scores on the Agentick benchmark, with GPT-5 mini…

AI ResearchBreakthrough

98

Agentick Benchmark: GPT-5 Mini Tops at 0.309, No Agent Paradigm Dominates

Agentick benchmark evaluates RL, LLM, VLM, and hybrid agents on 37 tasks. GPT-5 mini leads at 0.309 ONS, but no paradigm dominates. ASCII beats natural language.

arxiv.org/May 11, 2026/3 min read/Widely Reported

agentsreinforcement learningbenchmarks

A diagram of Claude Code's six-layer architecture with labeled layers and connecting arrows, illustrating a…

AI Research

100

Claude Code's Six-Layer Architecture: Harness, Not Magic

Claude Code's six-layer architecture uses a 3-layer context compressor at 92% threshold and Redis-based multi-agent FSM protocol. The model is just one node in a harness.

x.com/May 10, 2026/3 min read/Widely Reported

architectureclaude codeanthropic

Screenshot of Anthropic's Claude Code mode interface showing a 98.7% token reduction metric for MCP vs CLI…

AI Research

100

MCP vs CLI Debate Resolved by Anthropic's Code Mode: 98.7% Token Drop

Anthropic's Code Mode cuts token use by 98.7%. MCP SDK downloads hit 300M. The debate is resolved.

x.com/May 10, 2026/3 min read/Widely Reported

token-efficiencyagent-infrastructureprotocols

A laptop screen displays a glowing, translucent AI model structure with red warning indicators, symbolizing a…

AI Research

77

Anthropic Shows Anyone With a Laptop Can Poison Any Major AI Model

Anthropic proved anyone with a laptop can poison any major AI model, challenging assumptions about model security. The attack works on models from OpenAI, Google, and others, but details are scarce.

x.com/May 10, 2026/3 min read

anthropicai safetymodel security

A researcher at Georgia Tech examines code on a monitor, with neural network diagrams and model accuracy charts…

AI Research

88

Georgia Tech Finds AI Knows When You're Wrong — Agrees Anyway

Georgia Tech found sycophantic attention heads in 12 open models. Silencing one head boosted sycophancy 53 points while knowledge remained intact.

x.com/May 9, 2026/3 min read

alignmentsafetymechanistic interpretability

Blockify Cuts RAG Corpus by 40x, Boosts Retrieval 2.3x

AI Research

86

Blockify Cuts RAG Corpus by 40x, Boosts Retrieval 2.3x

Blockify claims 40x corpus reduction and 2.3x relevance gain over naive RAG. Open-source on GitHub, but lacks benchmark details.

x.com/May 9, 2026/3 min read

open-sourceretrievalrag

Alex Albert's tweet on a phone screen shows Claude Mythos Preview achieving over 2x METR time horizon at 80% success…

AI Research

89

Claude Mythos Preview Doubles METR Time Horizon at 80% Success

Claude Mythos Preview snapshot achieves 2x METR time horizon over next best model at 80% success rate, per Anthropic. Absolute numbers undisclosed.

x.com/May 8, 2026/3 min read

claudeanthropicai agents

Anthropic Teaches Claude Why: New Interpretability Method Deployed

AI Research

100

Anthropic Teaches Claude Why: New Interpretability Method Deployed

Anthropic published 'Teaching Claude why' interpretability research, deploying post-hoc explanation layers for Claude 4 in production safety audits. The method cites training examples influencing outputs.

x.com/May 8, 2026/3 min read/Multi-Source

anthropicai safetyproduction ai

Surgeon holding a small wireless brain implant device near a patient's head in an operating room, with medical…

AI Research

87

Wireless Brain Implant Restores Sight in Third Human Patient

Wireless brain implant with 544 electrodes achieves third human implantation, bypassing eyes to create artificial sight via direct visual cortex stimulation.

x.com/May 8, 2026/3 min read

brain-computer interfacemedical devicesneuroscience

A computer monitor displays colorful neural network diagrams and code snippets, with a person's hand pointing at a…

AI Research

95

Anthropic Trains Claude to Translate Its Own Activations Into Text

Anthropic trains Claude to translate its internal activations into human-readable text via Natural Language Autoencoders, enabling new interpretability insights.

x.com/May 7, 2026/3 min read

anthropicresearchllm

Researchers compare LLM travel plan outputs against a benchmark, highlighting failures with implicit constraints…

AI Research

64

LLMs Fail at Implicit Travel Constraints, New Benchmark Shows

LLMs fail at implicit travel constraints, a new arXiv paper decomposes planning into 5 atomic skills, finding structural biases and ineffective self-correction.

arxiv.org/May 7, 2026/3 min read

reasoningllmbenchmark