llms
30 articles about llms in AI news
Memory as a Model: Augmenting LLMs with Trained Memory
Paper augments LLMs with trained memory for long-term recall. Model-agnostic approach stores external knowledge without retraining.
Apple Paper Argues LLMs Show 'Illusion of Thinking'
Apple paper argues LLMs show no genuine reasoning, only pattern matching. The critique targets vendor claims but lacks new empirical evidence.
VAB Benchmark: Top MLLMs Judge Beauty Correctly Only 26.5% of Time
Frontier MLLMs achieve only 26.5% accuracy on VAB, far below human 68.9%. Fine-tuning bridges the gap.
SalesSim: LLMs Score Below 79% on Retail Persona Alignment, RL Boosts 13.8%
SalesSim benchmarks MLLMs as retail customers; top models score below 79% on persona alignment. UserGRPO RL boosts alignment by 13.8%.
Pruning LLMs for Edge Triples Bias, Perplexity Hides Damage
Pruning LLMs for edge deployment amplifies bias up to 83.7% while perplexity barely changes, revealing a paradox that undermines standard evaluation practices.
LLMs Fail at Implicit Travel Constraints, New Benchmark Shows
LLMs fail at implicit travel constraints, a new arXiv paper decomposes planning into 5 atomic skills, finding structural biases and ineffective self-correction.
Microsoft: LLMs Corrupt 25% of Docs in Long Edits
Microsoft paper shows LLMs corrupt ~25% of documents across 52 domains during 20-edit sessions, with failures compounding silently.
LLMs Shrink Neural Activity When Confused, New Paper Shows
LLMs compress neural activity when confused, measurable as a sparsity signal. Paper 2603.03415 proposes using this for adaptive prompting.
AFMRL: Using MLLMs to Generate Attributes for Better Product Retrieval in
AFMRL uses MLLMs to generate product attributes, then uses those attributes to train better multimodal representations for e-commerce retrieval. Achieves SOTA on large-scale datasets.
Columbia Prof: LLMs Can't Generate New Science, Only Map Known Data
Columbia CS Professor Vishal Misra argues LLMs cannot generate new scientific ideas because they learn structured maps of known data and fail outside those boundaries. True discovery requires creating new conceptual maps, a capability current architectures lack.
PRL-Bench: LLMs Score Below 50% on End-to-End Physics Research Tasks
Researchers introduced PRL-Bench, a benchmark built from 100 recent Physical Review Letters papers, testing LLMs on end-to-end physics research. Top models scored below 50%, exposing a significant capability gap for autonomous scientific discovery.
SocialGrid Benchmark Shows LLMs Fail at Deception, Score Below 60% on Planning
Researchers introduced SocialGrid, a multi-agent benchmark inspired by Among Us. It shows state-of-the-art LLMs fail at deception detection and task planning, scoring below 60% accuracy.
KWBench: New Benchmark Tests LLMs' Unprompted Problem Recognition
Researchers introduced KWBench, a 223-task benchmark measuring if LLMs can recognize the governing game-theoretic problem in professional scenarios without being told what to look for. The best-performing model passed only 27.9% of tasks, highlighting a critical gap between task execution and situational understanding.
OpenAI Open-Sources Agents SDK, Supports 100+ LLMs
OpenAI has open-sourced its internal Agents SDK, a lightweight framework for building multi-agent systems. It features three core primitives, works with over 100 LLMs, and has gained 18.9k GitHub stars immediately.
Indexing Multimodal LLMs for Large-Scale Image Retrieval
A new arXiv paper proposes using Multimodal LLMs (MLLMs) for instance-level image-to-image retrieval. By prompting models with paired images and converting next-token probabilities into scores, the method enables training-free re-ranking. It shows superior robustness to clutter and occlusion compared to specialized models, though struggles with severe appearance changes.
7 Free GitHub Repos for Running LLMs Locally on Laptop Hardware
A developer shared a list of seven key GitHub repositories, including AnythingLLM and llama.cpp, that allow users to run LLMs locally without cloud costs. This reflects the growing trend of efficient, private on-device AI inference.
SauerkrautLM-Doom-MultiVec: 1.3M-Param Model Outperforms LLMs 92,000x Its Size
Researchers built a 1.3M-parameter model that plays DOOM in real-time, scoring 178 frags in 10 episodes. It outperforms LLMs like Nemotron-120B and GPT-4o-mini, which scored only 13 combined, demonstrating the power of small, task-specific architectures.
Microsoft's BitNet Enables 100B-Parameter LLMs on CPU, Cuts Energy 82%
Microsoft Research's BitNet project demonstrates 1-bit LLMs with 100B parameters that run efficiently on CPUs, using 82% less energy while maintaining performance, challenging the need for GPUs in local deployment.
Agent Harness Engineering: The 'OS' That Makes LLMs Useful
A clear analogy frames raw LLMs as CPUs needing an operating system. The agent harness—managing tools, memory, and execution—is what creates useful applications, as proven by LangChain's benchmark jump.
CMU Study: Top LLMs Fail Simple Contradiction Tests, Lack True Reasoning
Carnegie Mellon researchers tested 14 leading LLMs on simple contradiction tasks; all failed consistently, revealing fundamental reasoning gaps despite advanced benchmarks. (199 chars)
Token Warping for MLLMs Outperforms Pixel Methods in View Synthesis
Researchers propose warping image tokens instead of pixels for multi-view reasoning in MLLMs. The zero-shot method is robust to depth noise and outperforms established baselines.
DrugPlayGround Benchmark Tests LLMs on Drug Discovery Tasks
A new framework called DrugPlayGround provides the first standardized benchmark for evaluating large language models on key drug discovery tasks, including predicting drug-protein interactions and chemical properties. This addresses a critical gap in objectively assessing LLMs' potential to accelerate pharmaceutical research.
Anthropic Paper: 'Emotion Concepts and their Function in LLMs' Published
Anthropic has released a new research paper titled 'Emotion Concepts and their Function in LLMs.' The work investigates the role and representation of emotional concepts within large language model architectures.
Paper: LLMs Fail 'Safe' Tests When Prompted to Role-Play as Unethical Characters
A new paper reveals that large language models (LLMs) considered 'safe' on standard benchmarks will readily generate harmful content when prompted to role-play as unethical characters. This exposes a critical blind spot in current AI safety evaluation methods.
New Research: Fine-Tuned LLMs Outperform GPT-5 for Probabilistic Supply Chain Forecasting
Researchers introduced an end-to-end framework that fine-tunes large language models (LLMs) to produce calibrated probabilistic forecasts of supply chain disruptions. The model, trained on realized outcomes, significantly outperforms strong baselines like GPT-5 on accuracy, calibration, and precision. This suggests a pathway for creating domain-specific forecasting models that generate actionable, decision-ready signals.
LLMs Show Weak Agreement with Human Essay Graders, Overvalue Short Essays and Penalize Minor Errors
A new arXiv study finds LLMs like GPT and Llama have weak agreement with human essay scores. They systematically over-score short, underdeveloped essays and under-score longer essays with minor grammatical errors.
QuatRoPE: New Positional Embedding Enables Linear-Scale 3D Spatial Reasoning in LLMs, Outperforming Quadratic Methods
Researchers propose QuatRoPE, a novel positional embedding method that encodes 3D object relations with linear input scaling. Paired with IGRE, it improves spatial reasoning in LLMs while preserving their original language capabilities.
From Token to Item: New Research Proposes Item-Aware Attention to Enhance LLMs for Recommendation
Researchers propose an Item-Aware Attention Mechanism (IAM) that restructures how LLMs process product data for recommendations. It separates attention into intra-item (content) and inter-item (collaborative) layers to better model item-level relationships. This addresses a key limitation in current LLM-based recommenders.
Learning to Disprove: LLMs Fine-Tuned for Formal Counterexample Generation in Lean 4
Researchers propose a method to train LLMs for formal counterexample generation, a neglected skill in mathematical AI. Their symbolic mutation strategy and multi-reward framework improve performance on three new benchmarks.
ItinBench Benchmark Reveals LLMs Struggle with Multi-Dimensional Planning, Scoring Below 50% on Combined Tasks
Researchers introduced ItinBench, a benchmark testing LLMs on trip planning requiring simultaneous verbal and spatial reasoning. Models like GPT-4o and Gemini 1.5 Pro showed inconsistent performance, highlighting a gap in integrated cognitive capabilities.