Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

llm

30 articles about llm in AI news

HAVEN Benchmark Exposes MLLM Gap Between Fluency and Video Understanding

HAVEN benchmark tests MLLMs on hierarchical video understanding across frame, shot, and video levels. Results show top models lack grounded multimodal reasoning despite fluent text generation.

85% relevant

Memory as a Model: Augmenting LLMs with Trained Memory

Paper augments LLMs with trained memory for long-term recall. Model-agnostic approach stores external knowledge without retraining.

77% relevant

OpenAI Readies General-Purpose LLM With Test-Time Compute Scaling

OpenAI is releasing a general-purpose LLM that improves with test-time compute, per an internal message. The model shows math gains without specialized training.

85% relevant

Apple Paper Argues LLMs Show 'Illusion of Thinking'

Apple paper argues LLMs show no genuine reasoning, only pattern matching. The critique targets vendor claims but lacks new empirical evidence.

91% relevant

train-llm-from-scratch: 1B-Parameter LLM on a Single GPU

train-llm-from-scratch trains billion-parameter LLMs on a single GPU, cutting costs from $10M+ to consumer hardware.

85% relevant

Persuasion Techniques Boost LLM Compliance from 35% to 51% in PNAS Study

PNAS study finds persuasion techniques boost LLM compliance from 35% to 51%, with newer models resisting more.

85% relevant

MLLM Raters Show Central Tendency Bias in Clinical Scoring

Study finds GPT-5 and other MLLMs show central tendency bias in clinical scoring, compressing predictions toward scale midpoint despite prompt modifications.

70% relevant

LLM-EDT: Dual-Phase Training Boosts Cross-Domain Rec by 12.4%

LLM-EDT improves cross-domain sequential recommendation by up to 12.4% using dual-phase training and LLM-based item generation.

74% relevant

Cascaded LLMs Lift E-Commerce Cart Adds 2.7% in Online Test

A cascaded LLM framework for e-commerce storefront generation lifted cart adds by +2.7% in online tests, using teacher-student fine-tuning to approach closed-weight LLM quality at production latency.

100% relevant

vLLM Optimizations Cut Voice AI Latency by 40% on 6-GPU Cluster

vLLM optimizations on a 6-GPU cluster reduced voice AI latency by 40% for a Qwen-based system, enabling 500 concurrent sessions per node without hardware upgrades.

82% relevant

SDAR: Self-Distilled RL Stabilizes Multi-Turn LLM Agents, +9.4% on ALFWorld

SDAR gates self-distillation within GRPO to stabilize multi-turn LLM agent training, yielding +9.4% on ALFWorld and gains on WebShop and Search-QA across Qwen2.5 and Qwen3 models.

85% relevant

Collider-Bench Tests LLM Agents on LHC Analysis Reproduction

Collider-Bench tests LLM agents on reproducing LHC analyses from papers. No agent beats physicist-in-the-loop, highlighting gaps in scientific reasoning.

92% relevant

VAB Benchmark: Top MLLMs Judge Beauty Correctly Only 26.5% of Time

Frontier MLLMs achieve only 26.5% accuracy on VAB, far below human 68.9%. Fine-tuning bridges the gap.

60% relevant

LLM Pipelines Beat Regex at Invoice Extraction at Scale

LLM pipelines outperform regex for structured extraction from unstructured documents, handling 20+ invoice formats without per-format rule maintenance.

80% relevant

Multi-Agent LLM Systems Fail to Outperform Single Models, Study Finds

New paper finds multi-agent LLM systems underperform single models by 2.3% on reasoning benchmarks, challenging a core assumption in AI engineering.

89% relevant

Pruning LLMs for Edge Triples Bias, Perplexity Hides Damage

Pruning LLMs for edge deployment amplifies bias up to 83.7% while perplexity barely changes, revealing a paradox that undermines standard evaluation practices.

82% relevant

MM-LLM Framework Boosts Recommendation AUC 0.35%, Online Metrics 0.02%

arXiv paper proposes LLaMA2-based MM-LLM framework for recommendation, achieving 0.35% AUC gain and 0.02% online lift at scale.

85% relevant

OSA Injects Ordinal Semantics into LLM Recommenders, Beats CF Baselines

OSA injects ordinal semantics into LLM-based recommenders using token embeddings as anchors, outperforming prior CF-LLM methods on pairwise preference evaluation.

88% relevant

SalesSim: LLMs Score Below 79% on Retail Persona Alignment, RL Boosts 13.8%

SalesSim benchmarks MLLMs as retail customers; top models score below 79% on persona alignment. UserGRPO RL boosts alignment by 13.8%.

91% relevant

RRCM Uses GRPO to Decide When to Retrieve for LLM Recommendation

RRCM uses GRPO to learn when to retrieve evidence for LLM recommendation, outperforming fixed-context baselines.

100% relevant

Two-Tower vs Vector DB + LLM: Which Wins for RecSys at Scale?

Two-tower models offer sub-10ms latency for cold-start; vector DB + LLM provides richer semantics. Hybrid architectures reduce churn by 15-20%.

100% relevant

Claude Code's HTML Output Beats Markdown for LLM-Readable Docs

Claude Code generates HTML docs that LLMs parse more accurately than Markdown, per Thariq's analysis. Trade-off: harder for humans to edit.

92% relevant

LLMs Fail at Implicit Travel Constraints, New Benchmark Shows

LLMs fail at implicit travel constraints, a new arXiv paper decomposes planning into 5 atomic skills, finding structural biases and ineffective self-correction.

64% relevant

Unsloth × NVIDIA Cut LLM Fine-Tuning ~25% — Three Glue-Code Wins on Blackwell

Daniel & Michael Han at Unsloth, in collaboration with NVIDIA, published a joint guide quantifying three glue-code optimizations that combine for ~25% faster LLM training on B200 Blackwell hardware. The wins target overhead around the main kernels — caching packed-sequence metadata, double-buffered gradient checkpoint reloads, and a cheaper GPT-OSS MoE router using argsort + bincount. All three are merged via public PRs.

87% relevant

ARMOR 2025: Military Safety Benchmark Exposes LLM Gaps Across 21 Models

ARMOR 2025 benchmark tests 21 LLMs against military legal doctrines, revealing critical safety gaps that civilian benchmarks miss.

92% relevant

Microsoft: LLMs Corrupt 25% of Docs in Long Edits

Microsoft paper shows LLMs corrupt ~25% of documents across 52 domains during 20-edit sessions, with failures compounding silently.

90% relevant

LLMs Shrink Neural Activity When Confused, New Paper Shows

LLMs compress neural activity when confused, measurable as a sparsity signal. Paper 2603.03415 proposes using this for adaptive prompting.

87% relevant

K-CARE: A New Framework Grounds LLMs in External Knowledge to Fix

K-CARE combines Symmetrical Contextual Anchoring (behavior data) and Analogical Prototype Reasoning (expert examples) to resolve e-commerce search relevance issues that pure LLM reasoning can't fix. Proven in offline and online A/B tests on a leading platform.

94% relevant

Vibe Training: SLM Replaces LLM-as-a-Judge, 8x Faster, 50% Fewer Errors

Plurai introduces 'vibe training,' using adversarial agent swarms to distill a small language model (SLM) for evaluating and guarding production AI agents. The SLM outperforms standard LLM-as-a-judge setups with ~8x faster inference and ~50% fewer evaluation errors.

86% relevant

KARL: RL Framework Cuts LLM Hallucinations Without Accuracy Loss

KARL introduces a reinforcement learning framework that dynamically estimates an LLM's knowledge boundary to reward abstention only when appropriate, achieving a superior accuracy-hallucination trade-off on multiple benchmarks without sacrificing correctness.

76% relevant