llm

30 articles about llm in AI news

BayesBench: LLMs Match Bayesian Posteriors But Fail Downstream Prediction

BayesBench tests 7 LLMs on multi-turn Bayesian reasoning. Scaling improves latent inference but not prediction, exposing a critical gap for agentic deployment.

Jul 1, 202687% relevant

LLMs Spontaneously Develop Human-Like Brain Regions for Language, Math

LLMs spontaneously develop human-like brain regions for language, math, physics, and social reasoning, per @LiorOnAI. Two optimization processes converged on the same solution.

Jun 30, 202685% relevant

3 MCP Gateway Security Gaps LiteLLM's Audit Found (And How to Fix Them in

LiteLLM's audit revealed 3 MCP gateway gaps: fail-open resolver, unpinned servers, opt-in least-privilege. Fix them in Claude Code with version pinning and allowed_tools.

Jun 30, 202685% relevant

FreeLLMAPI Aggregates 1.7B Free Tokens/Month Across 11 Providers

FreeLLMAPI aggregates 11 free LLM providers into one endpoint, offering 1.7B tokens/month with automatic fallover. Reduces friction for side projects but faces provider tolerance risks.

Jun 28, 202675% relevant

LLMs Default to Zod Schemas, Breaking MCPFusion Security Contracts

LLMs default to raw Zod schemas, bypassing MCPFusion's defineModel() and risking data leaks. The Developer Prover enforces MVA architecture via rejection.

Jun 28, 202685% relevant

OpenAI, Broadcom Unveil Jalapeño ASIC for LLM Inference

OpenAI and Broadcom unveiled Jalapeño, a custom ASIC for LLM inference, targeting volume deployment by late 2026. No performance metrics were disclosed.

Jun 24, 2026100% relevant

Miami Startup Claims 12M-Token LLM Inference at $8 vs. $2,600 on Claude

Miami startup claims 12M-token LLM inference for $8 vs. $2,600 on Claude Opus 4.6. No paper or benchmarks released yet.

Jun 21, 202690% relevant

Zalando Introduces MLLM-Based Evaluation for Product Retrieval

Zalando presents a multimodal LLM-based evaluation for product retrieval, aiming to enhance search relevance in e-commerce. This matters as it could set a new standard for assessing AI in retail search.

Jun 21, 202692% relevant

Never Let the LLM Write the Joins

This article details a two-phase text-to-SQL pipeline: Phase A deterministically plans (intent, entity resolution, joins, RBAC) and Phase B executes with bounded LLM calls. The subject graph caches entity mappings lazily, and security is enforced before the model sees any schema.

Jun 16, 202682% relevant

Metric Match Cuts LLM Judge Annotation Cost 32.5% via Subset Selection

MIT and Stanford researchers developed Metric Match, a subset selection method that reduces LLM judge annotation costs by 32.5% and estimation error by 18.7%, achieving a 0.838 win-rate against random selection.

Jun 16, 202670% relevant

Omaha Steaks Shrinks Average Delivery Time to 1.24 Days via Fulfillment

Omaha Steaks cut delivery from 6.2 to 1.24 days via five new fulfillment centers and a UPS Roadie partnership. CEO Nate Rempe says same-day delivery now covers 40-45% of the U.S.

Jun 15, 202674% relevant

General LLMs Beat Clinical AI Tools in Doctor Study

Frontier LLMs beat clinical AI tools like OpenEvidence in all evaluations, matching Google Search AI Overview.

Jun 12, 202683% relevant

Clinical LLM Rejection Predictor Hits AUROC 0.719 in 4.5-Month Study

Clinical LLM rejection predictor achieves AUROC 0.719 in 4.5-month study using deployment-specific context to forecast user rejection before response generation.

Jun 12, 202672% relevant

SVoT Boosts MLLM Spatial Reasoning by 65% via RL-Verified Visual Chains

SVoT uses RL to verify MLLM spatial reasoning states, achieving up to 65% accuracy gains on OOD tests across five domains including Pacman and Gather.

Jun 11, 202688% relevant

UniSound U2 Cuts Token Use 25%, Joins Top Chinese LLM Tier

UniSound's U2 foundation model cuts token consumption by 25% while matching top Chinese LLM performance, entering the top tier with an efficiency-first design.

Jun 9, 202671% relevant

PRS 2026: Netflix Workshop Reveals Industry Shift to LLM-Powered

Netflix's 2026 PRS workshop featured DoorDash, LinkedIn, Pinterest, Google DeepMind, and Stanford, showcasing how LLMs are transforming personalization, recommendation, and search. The event underscored the industry's shift toward integrating large language models into core recommendation pipelines.

Jun 8, 202698% relevant

WorldBench: Top MLLM Scores 64% on Visually Diverse Benchmark

WorldBench, a new multimodal benchmark, tests 15 MLLMs on visually diverse images. Top model scores 64.0%, exposing fundamental gaps in visual understanding.

Jun 8, 202692% relevant

Chinese LLMs Surge on OpenRouter as U.S. AI Traffic Shifts

Chinese LLMs now drive most weekly token growth on OpenRouter, with American startups routing more traffic to them, per @rohanpaul_ai. The shift reflects utility over brand loyalty.

Jun 8, 2026100% relevant

SMAC-Talk: StarCraft Benchmark Tests LLM Agents Against Deceptive Allies

SMAC-Talk extends StarCraft Multi-Agent Challenge with natural language communication, testing LLM agents against deceptive allies. Qwen3.5 models benchmarked; no model exceeds 72% win rate.

Jun 5, 202670% relevant

ChatHealthAI: EHR Foundation Model + Frozen LLM Hits 79.8% F1 on Length-of-Stay

ChatHealthAI aligns CLMBR-T-Base with a frozen LLM via a task-aware resampler, achieving 79.8% F1 on EHRSHOT length-of-stay prediction while enabling interpretable reasoning.

Jun 3, 202692% relevant

New 474-Game Benchmark Reveals LLMs Collapse on Counterfactual Reasoning

New 474-game benchmark reveals LLMs fail on counterfactual reasoning, with larger drops than contextual perturbations. Highlights metacognitive gaps in agentic AI.

Jun 2, 202692% relevant

Microsoft Markitdown: One-Command File-to-Markdown for LLMs

Microsoft open-sourced Markitdown, a one-command file-to-markdown converter for LLMs, improving output quality by leveraging markdown training data.

May 31, 202675% relevant

Claude.md Hits 152K GitHub Stars; Karpathy Notes LLM Failure Patterns

Claude.md hits 152K GitHub stars. Karpathy notes LLMs fail consistently, driving demand for standardized prompt templates.

May 25, 202677% relevant

ModelBest Drops BitCPM-CANN: First 1.58-bit LLM on Ascend 910B

ModelBest released BitCPM-CANN, the first 1.58-bit ternary LLM on Ascend 910B NPUs, using 6× less VRAM than BF16 with minimal capability loss.

May 24, 202687% relevant

Code-as-Agent Harness Thesis: 88.5% Gains Without Touching the LLM

Paper shows 88.5% improvement by adapting runtime interface around frozen LLM. Harness generalizes across 18 backbones, challenging model-centric agent improvement.

May 23, 202684% relevant

Claude Code Ships /workflows, Replaces LLM Orchestrator with Code

Claude Code /workflows replaces LLM orchestrator with code-based control flow, solving the token tax problem from multi-agent context buildup.

May 23, 2026100% relevant

HAVEN Benchmark Exposes MLLM Gap Between Fluency and Video Understanding

HAVEN benchmark tests MLLMs on hierarchical video understanding across frame, shot, and video levels. Results show top models lack grounded multimodal reasoning despite fluent text generation.

May 21, 202685% relevant

Memory as a Model: Augmenting LLMs with Trained Memory

Paper augments LLMs with trained memory for long-term recall. Model-agnostic approach stores external knowledge without retraining.

May 20, 202677% relevant

OpenAI Readies General-Purpose LLM With Test-Time Compute Scaling

OpenAI is releasing a general-purpose LLM that improves with test-time compute, per an internal message. The model shows math gains without specialized training.

May 20, 202685% relevant

Apple Paper Argues LLMs Show 'Illusion of Thinking'

Apple paper argues LLMs show no genuine reasoning, only pattern matching. The critique targets vendor claims but lacks new empirical evidence.

May 20, 202691% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety