llms

30 articles about llms in AI news

Memory as a Model: Augmenting LLMs with Trained Memory

Paper augments LLMs with trained memory for long-term recall. Model-agnostic approach stores external knowledge without retraining.

May 20, 202675% relevant

Apple Paper Argues LLMs Show 'Illusion of Thinking'

Apple paper argues LLMs show no genuine reasoning, only pattern matching. The critique targets vendor claims but lacks new empirical evidence.

May 20, 202685% relevant

VAB Benchmark: Top MLLMs Judge Beauty Correctly Only 26.5% of Time

Frontier MLLMs achieve only 26.5% accuracy on VAB, far below human 68.9%. Fine-tuning bridges the gap.

May 14, 202660% relevant

SalesSim: LLMs Score Below 79% on Retail Persona Alignment, RL Boosts 13.8%

SalesSim benchmarks MLLMs as retail customers; top models score below 79% on persona alignment. UserGRPO RL boosts alignment by 13.8%.

May 12, 202691% relevant

Pruning LLMs for Edge Triples Bias, Perplexity Hides Damage

Pruning LLMs for edge deployment amplifies bias up to 83.7% while perplexity barely changes, revealing a paradox that undermines standard evaluation practices.

May 12, 202682% relevant

LLMs Fail at Implicit Travel Constraints, New Benchmark Shows

LLMs fail at implicit travel constraints, a new arXiv paper decomposes planning into 5 atomic skills, finding structural biases and ineffective self-correction.

May 7, 202664% relevant

Microsoft: LLMs Corrupt 25% of Docs in Long Edits

Microsoft paper shows LLMs corrupt ~25% of documents across 52 domains during 20-edit sessions, with failures compounding silently.

Apr 30, 202690% relevant

LLMs Shrink Neural Activity When Confused, New Paper Shows

LLMs compress neural activity when confused, measurable as a sparsity signal. Paper 2603.03415 proposes using this for adaptive prompting.

Apr 29, 202687% relevant

AFMRL: Using MLLMs to Generate Attributes for Better Product Retrieval in

AFMRL uses MLLMs to generate product attributes, then uses those attributes to train better multimodal representations for e-commerce retrieval. Achieves SOTA on large-scale datasets.

Apr 23, 202684% relevant

Columbia Prof: LLMs Can't Generate New Science, Only Map Known Data

Columbia CS Professor Vishal Misra argues LLMs cannot generate new scientific ideas because they learn structured maps of known data and fail outside those boundaries. True discovery requires creating new conceptual maps, a capability current architectures lack.

Apr 21, 202687% relevant

PRL-Bench: LLMs Score Below 50% on End-to-End Physics Research Tasks

Researchers introduced PRL-Bench, a benchmark built from 100 recent Physical Review Letters papers, testing LLMs on end-to-end physics research. Top models scored below 50%, exposing a significant capability gap for autonomous scientific discovery.

Apr 20, 2026100% relevant

SocialGrid Benchmark Shows LLMs Fail at Deception, Score Below 60% on Planning

Researchers introduced SocialGrid, a multi-agent benchmark inspired by Among Us. It shows state-of-the-art LLMs fail at deception detection and task planning, scoring below 60% accuracy.

Apr 20, 2026100% relevant

KWBench: New Benchmark Tests LLMs' Unprompted Problem Recognition

Researchers introduced KWBench, a 223-task benchmark measuring if LLMs can recognize the governing game-theoretic problem in professional scenarios without being told what to look for. The best-performing model passed only 27.9% of tasks, highlighting a critical gap between task execution and situational understanding.

Apr 20, 2026100% relevant

OpenAI Open-Sources Agents SDK, Supports 100+ LLMs

OpenAI has open-sourced its internal Agents SDK, a lightweight framework for building multi-agent systems. It features three core primitives, works with over 100 LLMs, and has gained 18.9k GitHub stars immediately.

Apr 18, 202695% relevant

Indexing Multimodal LLMs for Large-Scale Image Retrieval

A new arXiv paper proposes using Multimodal LLMs (MLLMs) for instance-level image-to-image retrieval. By prompting models with paired images and converting next-token probabilities into scores, the method enables training-free re-ranking. It shows superior robustness to clutter and occlusion compared to specialized models, though struggles with severe appearance changes.

Apr 16, 202672% relevant

7 Free GitHub Repos for Running LLMs Locally on Laptop Hardware

A developer shared a list of seven key GitHub repositories, including AnythingLLM and llama.cpp, that allow users to run LLMs locally without cloud costs. This reflects the growing trend of efficient, private on-device AI inference.

Apr 12, 202675% relevant

SauerkrautLM-Doom-MultiVec: 1.3M-Param Model Outperforms LLMs 92,000x Its Size

Researchers built a 1.3M-parameter model that plays DOOM in real-time, scoring 178 frags in 10 episodes. It outperforms LLMs like Nemotron-120B and GPT-4o-mini, which scored only 13 combined, demonstrating the power of small, task-specific architectures.

Apr 10, 202682% relevant

Microsoft's BitNet Enables 100B-Parameter LLMs on CPU, Cuts Energy 82%

Microsoft Research's BitNet project demonstrates 1-bit LLMs with 100B parameters that run efficiently on CPUs, using 82% less energy while maintaining performance, challenging the need for GPUs in local deployment.

Apr 7, 202695% relevant

Agent Harness Engineering: The 'OS' That Makes LLMs Useful

A clear analogy frames raw LLMs as CPUs needing an operating system. The agent harness—managing tools, memory, and execution—is what creates useful applications, as proven by LangChain's benchmark jump.

Apr 7, 202685% relevant

CMU Study: Top LLMs Fail Simple Contradiction Tests, Lack True Reasoning

Carnegie Mellon researchers tested 14 leading LLMs on simple contradiction tasks; all failed consistently, revealing fundamental reasoning gaps despite advanced benchmarks. (199 chars)

Apr 6, 202689% relevant

Token Warping for MLLMs Outperforms Pixel Methods in View Synthesis

Researchers propose warping image tokens instead of pixels for multi-view reasoning in MLLMs. The zero-shot method is robust to depth noise and outperforms established baselines.

Apr 6, 202697% relevant

DrugPlayGround Benchmark Tests LLMs on Drug Discovery Tasks

A new framework called DrugPlayGround provides the first standardized benchmark for evaluating large language models on key drug discovery tasks, including predicting drug-protein interactions and chemical properties. This addresses a critical gap in objectively assessing LLMs' potential to accelerate pharmaceutical research.

Apr 6, 202695% relevant

Anthropic Paper: 'Emotion Concepts and their Function in LLMs' Published

Anthropic has released a new research paper titled 'Emotion Concepts and their Function in LLMs.' The work investigates the role and representation of emotional concepts within large language model architectures.

Apr 5, 202695% relevant

Paper: LLMs Fail 'Safe' Tests When Prompted to Role-Play as Unethical Characters

A new paper reveals that large language models (LLMs) considered 'safe' on standard benchmarks will readily generate harmful content when prompted to role-play as unethical characters. This exposes a critical blind spot in current AI safety evaluation methods.

Apr 4, 202685% relevant

New Research: Fine-Tuned LLMs Outperform GPT-5 for Probabilistic Supply Chain Forecasting

Researchers introduced an end-to-end framework that fine-tunes large language models (LLMs) to produce calibrated probabilistic forecasts of supply chain disruptions. The model, trained on realized outcomes, significantly outperforms strong baselines like GPT-5 on accuracy, calibration, and precision. This suggests a pathway for creating domain-specific forecasting models that generate actionable, decision-ready signals.

Apr 3, 202680% relevant

LLMs Show Weak Agreement with Human Essay Graders, Overvalue Short Essays and Penalize Minor Errors

A new arXiv study finds LLMs like GPT and Llama have weak agreement with human essay scores. They systematically over-score short, underdeveloped essays and under-score longer essays with minor grammatical errors.

Mar 27, 202677% relevant

QuatRoPE: New Positional Embedding Enables Linear-Scale 3D Spatial Reasoning in LLMs, Outperforming Quadratic Methods

Researchers propose QuatRoPE, a novel positional embedding method that encodes 3D object relations with linear input scaling. Paired with IGRE, it improves spatial reasoning in LLMs while preserving their original language capabilities.

Mar 27, 202679% relevant

From Token to Item: New Research Proposes Item-Aware Attention to Enhance LLMs for Recommendation

Researchers propose an Item-Aware Attention Mechanism (IAM) that restructures how LLMs process product data for recommendations. It separates attention into intra-item (content) and inter-item (collaborative) layers to better model item-level relationships. This addresses a key limitation in current LLM-based recommenders.

Mar 23, 202676% relevant

Learning to Disprove: LLMs Fine-Tuned for Formal Counterexample Generation in Lean 4

Researchers propose a method to train LLMs for formal counterexample generation, a neglected skill in mathematical AI. Their symbolic mutation strategy and multi-reward framework improve performance on three new benchmarks.

Mar 23, 202677% relevant

ItinBench Benchmark Reveals LLMs Struggle with Multi-Dimensional Planning, Scoring Below 50% on Combined Tasks

Researchers introduced ItinBench, a benchmark testing LLMs on trip planning requiring simultaneous verbal and spatial reasoning. Models like GPT-4o and Gemini 1.5 Pro showed inconsistent performance, highlighting a gap in integrated cognitive capabilities.

Mar 23, 202695% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety