llm efficiency

30 articles about llm efficiency in AI news

ReDiPrune: Training-Free Token Pruning Before Projection Boosts MLLM Efficiency 6x, Gains 2% Accuracy

Researchers propose ReDiPrune, a plug-and-play method that prunes visual tokens before the vision-language projector in multimodal LLMs. On EgoSchema with LLaVA-NeXT-Video-7B, it achieves a +2.0% accuracy gain while reducing computation by over 6× in TFLOPs.

Mar 27, 202679% relevant

Fractal Emphasizes LLM Inference Efficiency as Generative AI Moves to Production

AI consultancy Fractal highlights the critical shift from generative AI experimentation to production deployment, where inference efficiency—cost, latency, and scalability—becomes the primary business constraint. This marks a maturation phase where operational metrics trump model novelty.

Mar 25, 202676% relevant

UniSound U2 Cuts Token Use 25%, Joins Top Chinese LLM Tier

UniSound's U2 foundation model cuts token consumption by 25% while matching top Chinese LLM performance, entering the top tier with an efficiency-first design.

Jun 9, 202671% relevant

ByteDance Seed's Mixture-of-Depths Attention Reaches 97.3% of FlashAttention-2 Efficiency with 3.7% FLOPs Overhead

ByteDance Seed researchers introduced Mixture-of-Depths Attention (MoDA), an attention mechanism that addresses signal degradation in deep LLMs by allowing heads to attend to both current and previous layer KV pairs. The method achieves 97.3% of FlashAttention-2's efficiency while improving downstream performance by 2.11% with only a 3.7% computational overhead.

Mar 21, 202695% relevant

Zalando to Deploy Up to 50 AI-Powered Nomagic Robots in European Fulfillment Centers

Zalando is scaling its warehouse automation by installing up to 50 AI-powered Nomagic picking robots across European fulfillment centers. This move aims to enhance efficiency and handle complex items, reflecting a major investment in robotic fulfillment for fashion e-commerce.

Mar 19, 202674% relevant

ToolTree: A New Planning Paradigm for LLM Agents That Could Transform Complex Retail Operations

Researchers propose ToolTree, a Monte Carlo tree search-inspired method for LLM agent tool planning. It uses dual-stage evaluation and bidirectional pruning to improve foresight and efficiency in multi-step tasks, achieving ~10% gains over state-of-the-art methods.

Mar 16, 202670% relevant

NVIDIA TwoTower: 2.4x Faster LLM Decoding, 98.7% Quality

NVIDIA TwoTower clones a pretrained LLM into a frozen context tower and trainable denoiser tower, achieving 2.42x faster generation with 98.7% quality on a 30B MoE model.

Jul 11, 202695% relevant

ZML releases free LLM inference server supporting Nvidia

ZML released LLMD, a free inference server for LLMs supporting Nvidia, AMD, Google TPU, Apple Metal, and Intel Arc, aiming to reduce AI costs and break vendor lock-in.

Jul 8, 202682% relevant

OpenAI, Broadcom Unveil Jalapeño ASIC for LLM Inference

OpenAI and Broadcom unveiled Jalapeño, a custom ASIC for LLM inference, targeting volume deployment by late 2026. No performance metrics were disclosed.

Jun 24, 2026100% relevant

Omaha Steaks Shrinks Average Delivery Time to 1.24 Days via Fulfillment

Omaha Steaks cut delivery from 6.2 to 1.24 days via five new fulfillment centers and a UPS Roadie partnership. CEO Nate Rempe says same-day delivery now covers 40-45% of the U.S.

Jun 15, 202674% relevant

PRS 2026: Netflix Workshop Reveals Industry Shift to LLM-Powered

Netflix's 2026 PRS workshop featured DoorDash, LinkedIn, Pinterest, Google DeepMind, and Stanford, showcasing how LLMs are transforming personalization, recommendation, and search. The event underscored the industry's shift toward integrating large language models into core recommendation pipelines.

Jun 8, 202698% relevant

Chinese LLMs Surge on OpenRouter as U.S. AI Traffic Shifts

Chinese LLMs now drive most weekly token growth on OpenRouter, with American startups routing more traffic to them, per @rohanpaul_ai. The shift reflects utility over brand loyalty.

Jun 8, 2026100% relevant

SMAC-Talk: StarCraft Benchmark Tests LLM Agents Against Deceptive Allies

SMAC-Talk extends StarCraft Multi-Agent Challenge with natural language communication, testing LLM agents against deceptive allies. Qwen3.5 models benchmarked; no model exceeds 72% win rate.

Jun 5, 202670% relevant

New 474-Game Benchmark Reveals LLMs Collapse on Counterfactual Reasoning

New 474-game benchmark reveals LLMs fail on counterfactual reasoning, with larger drops than contextual perturbations. Highlights metacognitive gaps in agentic AI.

Jun 2, 202692% relevant

Unsloth × NVIDIA Cut LLM Fine-Tuning ~25% — Three Glue-Code Wins on Blackwell

Daniel & Michael Han at Unsloth, in collaboration with NVIDIA, published a joint guide quantifying three glue-code optimizations that combine for ~25% faster LLM training on B200 Blackwell hardware. The wins target overhead around the main kernels — caching packed-sequence metadata, double-buffered gradient checkpoint reloads, and a cheaper GPT-OSS MoE router using argsort + bincount. All three are merged via public PRs.

May 6, 202687% relevant

DigitalOcean's Signal Sampling Finds Top Agent Trajectories Without LLM Cost

DigitalOcean's paper introduces lightweight behavioral signals to rank 80k agent-user trajectories, achieving 82% informativeness in sampled reviews compared to 54% for random sampling, with no LLM overhead.

Apr 25, 202678% relevant

Claude Code's Secret Efficiency Hack

Claude Code leverages speculative decoding to reduce LLM energy use by 100x. Learn how this built-in optimization makes your coding faster and cheaper.

Apr 23, 202685% relevant

AFMRL: Using MLLMs to Generate Attributes for Better Product Retrieval in

AFMRL uses MLLMs to generate product attributes, then uses those attributes to train better multimodal representations for e-commerce retrieval. Achieves SOTA on large-scale datasets.

Apr 23, 202684% relevant

Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck

A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.

Apr 20, 202685% relevant

ByteDance's PersonaVLM Boosts MLLM Personalization by 22.4%, Beats GPT-4o

ByteDance researchers unveiled PersonaVLM, a framework that transforms multimodal LLMs into personalized assistants with memory. It improves baseline performance by 22.4% and surpasses GPT-4o by 5.2% on personalized benchmarks.

Apr 20, 202697% relevant

Akshay Pachaar Inverts LLM Agent Architecture with 'Harness' Design

AI engineer Akshay Pachaar outlined a novel 'harness' architecture for LLM agents that externalizes intelligence into memory, skills, and protocols. He is building a minimal, didactic open-source implementation of this design.

Apr 18, 202689% relevant

HUOZIIME: A Research Framework for On-Device LLM-Powered Input Methods

A new research paper introduces HUOZIIME, a personalized on-device input method powered by a lightweight LLM. It uses a hierarchical memory mechanism to capture user-specific input history, enabling privacy-preserving, real-time text generation tailored to individual writing styles.

Apr 17, 202676% relevant

Bi-Predictability: A New Real-Time Metric for Monitoring LLM

A new arXiv paper introduces 'bi-predictability' (P), an information-theoretic measure, and a lightweight Information Digital Twin (IDT) architecture to monitor the structural integrity of multi-turn LLM conversations in real-time. It detects a 'silent uncoupling' regime where outputs remain semantically sound but the conversational thread degrades, offering a scalable tool for AI assurance.

Apr 16, 202678% relevant

MiniMax M2.7 Tops Open LLM Leaderboard with 230B Parameter Sparse Model

MiniMax announced its M2.7 model has taken the top spot on the Hugging Face Open LLM Leaderboard. The model uses a sparse mixture-of-experts architecture with 230B total parameters but only activates 10B per token.

Apr 15, 202685% relevant

Ollama vs. vLLM vs. llama.cpp

A technical benchmark compares three popular open-source LLM inference servers—Ollama, vLLM, and llama.cpp—under concurrent load. Ollama, despite its ease of use and massive adoption, collapsed at 5 concurrent users, highlighting a critical gap between developer-friendly tools and production-ready systems.

Apr 15, 202691% relevant

A-R Space Framework Profiles LLM Agent Execution Behavior Across Risk Contexts

Researchers propose the A-R Space, measuring Action Rate and Refusal Signal to profile LLM agent behavior across four risk contexts and three autonomy levels. This provides a deployment-oriented framework for selecting agents based on organizational risk tolerance.

Apr 15, 202696% relevant

LLM-HYPER: A Training-Free Framework for Cold-Start Ad CTR Prediction

A new arXiv paper introduces LLM-HYPER, a framework that treats large language models as hypernetworks to generate parameters for click-through rate estimators in a training-free manner. It uses multimodal ad content and few-shot prompting to infer feature weights, drastically reducing the cold-start period for new promotional ads and has been deployed on a major U.S. e-commerce platform.

Apr 15, 202696% relevant

SauerkrautLM-Doom-MultiVec: 1.3M-Param Model Outperforms LLMs 92,000x Its Size

Researchers built a 1.3M-parameter model that plays DOOM in real-time, scoring 178 frags in 10 episodes. It outperforms LLMs like Nemotron-120B and GPT-4o-mini, which scored only 13 combined, demonstrating the power of small, task-specific architectures.

Apr 10, 202682% relevant

MARS Method Boosts LLM Throughput 1.7x With No Architecture Changes

Researchers introduced MARS, a training-free method that allows autoregressive LLMs to generate multiple tokens per forward pass, boosting throughput by 1.5-1.7x without architectural modifications or accuracy loss.

Apr 9, 202685% relevant

Target's Tech Blog Teases 'Next-Gen Solution' for Digital Order Fulfillment

Target's internal tech blog has announced work on a next-generation solution for digital order fulfillment, specifically targeting the balance between operational speed and inventory accuracy. This is a core operational challenge for omnichannel retailers.

Apr 8, 202672% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety