llm
30 articles about llm in AI news
SMAC-Talk: StarCraft Benchmark Tests LLM Agents Against Deceptive Allies
SMAC-Talk extends StarCraft Multi-Agent Challenge with natural language communication, testing LLM agents against deceptive allies. Qwen3.5 models benchmarked; no model exceeds 72% win rate.
ChatHealthAI: EHR Foundation Model + Frozen LLM Hits 79.8% F1 on Length-of-Stay
ChatHealthAI aligns CLMBR-T-Base with a frozen LLM via a task-aware resampler, achieving 79.8% F1 on EHRSHOT length-of-stay prediction while enabling interpretable reasoning.
New 474-Game Benchmark Reveals LLMs Collapse on Counterfactual Reasoning
New 474-game benchmark reveals LLMs fail on counterfactual reasoning, with larger drops than contextual perturbations. Highlights metacognitive gaps in agentic AI.
Microsoft Markitdown: One-Command File-to-Markdown for LLMs
Microsoft open-sourced Markitdown, a one-command file-to-markdown converter for LLMs, improving output quality by leveraging markdown training data.
Claude.md Hits 152K GitHub Stars; Karpathy Notes LLM Failure Patterns
Claude.md hits 152K GitHub stars. Karpathy notes LLMs fail consistently, driving demand for standardized prompt templates.
ModelBest Drops BitCPM-CANN: First 1.58-bit LLM on Ascend 910B
ModelBest released BitCPM-CANN, the first 1.58-bit ternary LLM on Ascend 910B NPUs, using 6× less VRAM than BF16 with minimal capability loss.
Code-as-Agent Harness Thesis: 88.5% Gains Without Touching the LLM
Paper shows 88.5% improvement by adapting runtime interface around frozen LLM. Harness generalizes across 18 backbones, challenging model-centric agent improvement.
Claude Code Ships /workflows, Replaces LLM Orchestrator with Code
Claude Code /workflows replaces LLM orchestrator with code-based control flow, solving the token tax problem from multi-agent context buildup.
HAVEN Benchmark Exposes MLLM Gap Between Fluency and Video Understanding
HAVEN benchmark tests MLLMs on hierarchical video understanding across frame, shot, and video levels. Results show top models lack grounded multimodal reasoning despite fluent text generation.
Memory as a Model: Augmenting LLMs with Trained Memory
Paper augments LLMs with trained memory for long-term recall. Model-agnostic approach stores external knowledge without retraining.
OpenAI Readies General-Purpose LLM With Test-Time Compute Scaling
OpenAI is releasing a general-purpose LLM that improves with test-time compute, per an internal message. The model shows math gains without specialized training.
Apple Paper Argues LLMs Show 'Illusion of Thinking'
Apple paper argues LLMs show no genuine reasoning, only pattern matching. The critique targets vendor claims but lacks new empirical evidence.
train-llm-from-scratch: 1B-Parameter LLM on a Single GPU
train-llm-from-scratch trains billion-parameter LLMs on a single GPU, cutting costs from $10M+ to consumer hardware.
Persuasion Techniques Boost LLM Compliance from 35% to 51% in PNAS Study
PNAS study finds persuasion techniques boost LLM compliance from 35% to 51%, with newer models resisting more.
MLLM Raters Show Central Tendency Bias in Clinical Scoring
Study finds GPT-5 and other MLLMs show central tendency bias in clinical scoring, compressing predictions toward scale midpoint despite prompt modifications.
LLM-EDT: Dual-Phase Training Boosts Cross-Domain Rec by 12.4%
LLM-EDT improves cross-domain sequential recommendation by up to 12.4% using dual-phase training and LLM-based item generation.
Cascaded LLMs Lift E-Commerce Cart Adds 2.7% in Online Test
A cascaded LLM framework for e-commerce storefront generation lifted cart adds by +2.7% in online tests, using teacher-student fine-tuning to approach closed-weight LLM quality at production latency.
vLLM Optimizations Cut Voice AI Latency by 40% on 6-GPU Cluster
vLLM optimizations on a 6-GPU cluster reduced voice AI latency by 40% for a Qwen-based system, enabling 500 concurrent sessions per node without hardware upgrades.
SDAR: Self-Distilled RL Stabilizes Multi-Turn LLM Agents, +9.4% on ALFWorld
SDAR gates self-distillation within GRPO to stabilize multi-turn LLM agent training, yielding +9.4% on ALFWorld and gains on WebShop and Search-QA across Qwen2.5 and Qwen3 models.
Collider-Bench Tests LLM Agents on LHC Analysis Reproduction
Collider-Bench tests LLM agents on reproducing LHC analyses from papers. No agent beats physicist-in-the-loop, highlighting gaps in scientific reasoning.
VAB Benchmark: Top MLLMs Judge Beauty Correctly Only 26.5% of Time
Frontier MLLMs achieve only 26.5% accuracy on VAB, far below human 68.9%. Fine-tuning bridges the gap.
LLM Pipelines Beat Regex at Invoice Extraction at Scale
LLM pipelines outperform regex for structured extraction from unstructured documents, handling 20+ invoice formats without per-format rule maintenance.
Multi-Agent LLM Systems Fail to Outperform Single Models, Study Finds
New paper finds multi-agent LLM systems underperform single models by 2.3% on reasoning benchmarks, challenging a core assumption in AI engineering.
Pruning LLMs for Edge Triples Bias, Perplexity Hides Damage
Pruning LLMs for edge deployment amplifies bias up to 83.7% while perplexity barely changes, revealing a paradox that undermines standard evaluation practices.
OSA Injects Ordinal Semantics into LLM Recommenders, Beats CF Baselines
OSA injects ordinal semantics into LLM-based recommenders using token embeddings as anchors, outperforming prior CF-LLM methods on pairwise preference evaluation.
SalesSim: LLMs Score Below 79% on Retail Persona Alignment, RL Boosts 13.8%
SalesSim benchmarks MLLMs as retail customers; top models score below 79% on persona alignment. UserGRPO RL boosts alignment by 13.8%.
MM-LLM Framework Boosts Recommendation AUC 0.35%, Online Metrics 0.02%
arXiv paper proposes LLaMA2-based MM-LLM framework for recommendation, achieving 0.35% AUC gain and 0.02% online lift at scale.
RRCM Uses GRPO to Decide When to Retrieve for LLM Recommendation
RRCM uses GRPO to learn when to retrieve evidence for LLM recommendation, outperforming fixed-context baselines.
Two-Tower vs Vector DB + LLM: Which Wins for RecSys at Scale?
Two-tower models offer sub-10ms latency for cold-start; vector DB + LLM provides richer semantics. Hybrid architectures reduce churn by 15-20%.
Claude Code's HTML Output Beats Markdown for LLM-Readable Docs
Claude Code generates HTML docs that LLMs parse more accurately than Markdown, per Thariq's analysis. Trade-off: harder for humans to edit.