llm inference

30 articles about llm inference in AI news

Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck

A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.

Apr 20, 202685% relevant

arXiv Survey Maps KV Cache Optimization Landscape: 5 Strategies for Million-Token LLM Inference

A comprehensive arXiv review categorizes five principal KV cache optimization techniques—eviction, compression, hybrid memory, novel attention, and combinations—to address the linear memory scaling bottleneck in long-context LLM inference. The analysis finds no single dominant solution, with optimal strategy depending on context length, hardware, and workload.

Mar 24, 202695% relevant

PayPal Cuts LLM Inference Cost 50% with EAGLE3 Speculative Decoding on H100

PayPal engineers applied EAGLE3 speculative decoding to their fine-tuned 8B-parameter commerce agent, achieving up to 49% higher throughput and 33% lower latency. This allowed a single H100 GPU to match the performance of two H100s running NVIDIA NIM, cutting inference hardware cost by 50%.

Apr 23, 202690% relevant

Fractal Emphasizes LLM Inference Efficiency as Generative AI Moves to Production

AI consultancy Fractal highlights the critical shift from generative AI experimentation to production deployment, where inference efficiency—cost, latency, and scalability—becomes the primary business constraint. This marks a maturation phase where operational metrics trump model novelty.

Mar 25, 202676% relevant

FaithSteer-BENCH Reveals Systematic Failure Modes in LLM Inference-Time Steering Methods

Researchers introduce FaithSteer-BENCH, a stress-testing benchmark that exposes systematic failures in LLM steering methods under deployment constraints. The benchmark reveals illusory controllability, capability degradation, and brittleness across multiple models and steering approaches.

Mar 20, 202683% relevant

Ollama Now Supports Apple MLX Backend for Local LLM Inference on macOS

Ollama, the popular framework for running large language models locally, has added support for Apple's MLX framework as a backend. This enables more efficient execution of models like Llama 3.2 and Mistral on Apple Silicon Macs.

Mar 31, 202685% relevant

Dflash with Continuous Batch Inference Teased for Draft Models

A developer teased the upcoming release of 'Dflash' with continuous batch inference, targeting current text-only draft models used in speculative execution to speed up LLM inference.

Apr 16, 202685% relevant

Ollama vs. vLLM vs. llama.cpp

A technical benchmark compares three popular open-source LLM inference servers—Ollama, vLLM, and llama.cpp—under concurrent load. Ollama, despite its ease of use and massive adoption, collapsed at 5 concurrent users, highlighting a critical gap between developer-friendly tools and production-ready systems.

Apr 15, 202691% relevant

We Hosted a 35B LLM on an NVIDIA DGX Spark — A Technical Post-Mortem

A detailed, practical guide to deploying the Qwen3.5–35B model on NVIDIA's GB10 Blackwell hardware. The article serves as a crucial case study on the real-world challenges and solutions for on-premise LLM inference.

Mar 17, 202695% relevant

DualPath Architecture Shatters KV-Cache Bottleneck, Doubling LLM Throughput for AI Agents

Researchers have developed DualPath, a novel architecture that eliminates the KV-cache storage bottleneck in agentic LLM inference. By implementing dual-path loading with RDMA transfers, the system achieves nearly 2× throughput improvements for both offline and online scenarios.

Feb 28, 202685% relevant

MLX-LM v0.9.0 Adds Better Batching, Supports Gemma 4 on Apple Silicon

Apple's MLX-LM framework released version 0.9.0 with enhanced server batching and support for Google's Gemma 4 model, improving local LLM inference efficiency on Apple Silicon. This update addresses a key performance bottleneck for developers running models locally on Mac hardware.

Apr 7, 202675% relevant

Alibaba's CoPaw: The Open-Source Framework Democratizing Complex AI Agent Development

Alibaba has open-sourced CoPaw, a high-performance personal agent workstation designed to help developers build and scale sophisticated multi-channel AI workflows with persistent memory. This framework addresses the growing complexity of moving beyond simple LLM inference to autonomous agentic systems.

Mar 1, 202675% relevant

HyEvo Framework Automates Hybrid LLM-Code Workflows, Cuts Inference Cost 19x vs. SOTA

Researchers propose HyEvo, an automated framework that generates agentic workflows combining LLM nodes for reasoning with deterministic code nodes for execution. It reduces inference cost by up to 19x and latency by 16x while outperforming existing methods on reasoning benchmarks.

Mar 23, 202695% relevant

Meta's Adaptive Ranking Model: A Technical Breakthrough for Efficient LLM-Scale Inference

Meta has developed a novel Adaptive Ranking Model (ARM) architecture designed to drastically reduce the computational cost of serving large-scale ranking models for ads. This represents a core infrastructure breakthrough for deploying LLM-scale models in production at massive scale.

Mar 31, 202695% relevant

Why Cheaper LLMs Can Cost More: The Hidden Economics of AI Inference in 2026

A Medium article outlines a practical framework for balancing performance, cost, and operational risk in real-world LLM deployment, arguing that focusing solely on model cost can lead to higher total expenses.

Mar 27, 202682% relevant

TTQ: A New Framework for On-the-Fly Quantization of LLMs at Inference Time

Researchers propose TTQ, a test-time quantization method that compresses large language models dynamically during inference. It uses efficient online calibration to adapt to any prompt, aiming to solve domain-shift issues and accelerate inference without retraining.

Mar 23, 202670% relevant

Vibe Training: SLM Replaces LLM-as-a-Judge, 8x Faster, 50% Fewer Errors

Plurai introduces 'vibe training,' using adversarial agent swarms to distill a small language model (SLM) for evaluating and guarding production AI agents. The SLM outperforms standard LLM-as-a-judge setups with ~8x faster inference and ~50% fewer evaluation errors.

Apr 28, 202686% relevant

LLMAR: A Tuning-Free LLM Framework for Recommendation in Sparse

Researchers propose LLMAR, a tuning-free recommendation framework that uses LLM reasoning to infer user 'latent motives' from sparse text-rich data. It outperforms state-of-the-art models in sparse industrial scenarios while keeping inference costs low, offering a practical alternative to costly fine-tuning.

Apr 21, 202680% relevant

Cognitive Companion Monitors LLM Agent Reasoning with Zero Overhead

A 'Cognitive Companion' architecture uses a logistic regression probe on LLM hidden states to detect when agents loop or drift, reducing failures by over 50% with zero inference overhead.

Apr 17, 202695% relevant

7 Free GitHub Repos for Running LLMs Locally on Laptop Hardware

A developer shared a list of seven key GitHub repositories, including AnythingLLM and llama.cpp, that allow users to run LLMs locally without cloud costs. This reflects the growing trend of efficient, private on-device AI inference.

Apr 12, 202675% relevant

Meta's QTT Method Fixes Long-Context LLM 'Buried Facts' Problem, Boosts Retrieval Accuracy

Meta researchers identified a failure mode where LLMs with 128K+ context windows miss information buried in the middle of documents. Their Query-only Test-Time Training (QTT) method adapts models at inference, significantly improving retrieval accuracy.

Mar 31, 202685% relevant

Google's TurboQuant Compresses LLM KV Cache 6x with Zero Accuracy Loss, Cutting GPU Memory by 80%

Google researchers introduced TurboQuant, a method that compresses LLM KV cache from 32-bit to 3-bit precision without accuracy degradation. This reduces GPU memory consumption by over 80% and speeds up inference 8x on H100 GPUs.

Mar 28, 202697% relevant

LIDS Framework Revolutionizes LLM Summary Evaluation with Statistical Rigor

Researchers introduce LIDS, a novel method combining BERT embeddings, SVD decomposition, and statistical inference to evaluate LLM-generated summaries with unprecedented accuracy and interpretability. The framework provides layered theme analysis with controlled false discovery rates, addressing a critical gap in NLP assessment.

Mar 3, 202675% relevant

dLLM Framework Unifies Diffusion Language Models, Opening New Frontiers in AI Text Generation

Researchers have introduced dLLM, a unified framework that standardizes training, inference, and evaluation for diffusion language models. This breakthrough enables conversion of existing models like BERT into diffusion architectures and facilitates reproduction of cutting-edge models like LLaDA and Dream.

Mar 2, 202685% relevant

New AI Benchmark Exposes Critical Gap in Causal Reasoning: Why LLMs Struggle with Real-World Research Design

Researchers have introduced CausalReasoningBenchmark, a novel evaluation framework that separates causal identification from estimation. The benchmark reveals that while LLMs can identify high-level strategies 84% of the time, they correctly specify full research designs only 30% of the time, highlighting a critical bottleneck in automated causal inference.

Feb 25, 202670% relevant

Plano AI Proxy Promises 50% Cost Reduction by Intelligently Routing LLM Queries

Plano, an open-source AI proxy powered by the 1.5B parameter Arch-Router model, automatically directs prompts to optimal LLMs based on complexity, potentially halving inference costs while adding orchestration and safety layers.

Feb 24, 202685% relevant

Two-Tower vs Vector DB + LLM: Which Wins for RecSys at Scale?

Two-tower models offer sub-10ms latency for cold-start; vector DB + LLM provides richer semantics. Hybrid architectures reduce churn by 15-20%.

May 9, 2026100% relevant

LLMs Fail at Implicit Travel Constraints, New Benchmark Shows

LLMs fail at implicit travel constraints, a new arXiv paper decomposes planning into 5 atomic skills, finding structural biases and ineffective self-correction.

May 7, 202664% relevant

mlx-vlm v0.5.0 Adds Continuous Batching, Distributed Inference for Apple Silicon

mlx-vlm v0.5.0 adds continuous batching, speculative decoding, and distributed inference for Apple Silicon. The release supports Qwen3.5, Kimi K2.5, Gemma 4 video, and new models with 21 contributors.

May 6, 202687% relevant

LLMs Shrink Neural Activity When Confused, New Paper Shows

LLMs compress neural activity when confused, measurable as a sparsity signal. Paper 2603.03415 proposes using this for adaptive prompting.

Apr 29, 202687% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety