cache management
30 articles about cache management in AI news
MLX-VLM Adds Continuous Batching, OpenAI API, and Vision Cache for Apple Silicon
The next release of MLX-VLM will introduce continuous batching, an OpenAI-compatible API, and vision feature caching for multimodal models running locally on Apple Silicon. These optimizations promise up to 228x speedups on cache hits for models like Gemma4.
Anthropic's Silent Cache TTL Cut
Claude Code's default cache TTL was silently reduced to 5 minutes on April 2, drastically increasing token costs. Use hooks and settings to mitigate the impact.
Microsoft's 'Compress-Thought' Cuts KV Cache 2-3x, Boosts Throughput 2x
A new Microsoft paper shows language models can learn to compress their reasoning steps on-the-fly, slashing memory use 2-3x and doubling throughput. Crucially, 15 percentage points of accuracy come from 'leaked' information in KV cache after explicit reasoning is erased.
arXiv Survey Maps KV Cache Optimization Landscape: 5 Strategies for Million-Token LLM Inference
A comprehensive arXiv review categorizes five principal KV cache optimization techniques—eviction, compression, hybrid memory, novel attention, and combinations—to address the linear memory scaling bottleneck in long-context LLM inference. The analysis finds no single dominant solution, with optimal strategy depending on context length, hardware, and workload.
Neural Paging: The Memory Management Breakthrough for Next-Gen AI Agents
Researchers propose Neural Paging, a hierarchical architecture that decouples symbolic reasoning from information management in AI agents. This approach dramatically reduces computational complexity for long-horizon reasoning tasks, moving from quadratic to linear scaling with context window size.
DualPath Architecture Shatters KV-Cache Bottleneck, Doubling LLM Throughput for AI Agents
Researchers have developed DualPath, a novel architecture that eliminates the KV-cache storage bottleneck in agentic LLM inference. By implementing dual-path loading with RDMA transfers, the system achieves nearly 2× throughput improvements for both offline and online scenarios.
Prism v1.8 Adds CLI, MCP Server, and SDKs — Here's How to Use Them with
Prism v1.8's MCP server gives Claude Code direct control over caches, budgets, and routing. Install it in 2 minutes and ditch the dashboard for terminal-based AI infrastructure management.
Claude Code Digest — May 01–May 04
CCmeter's cache-busting insights can slash your Claude Code costs by up to 40% instantly.
Claude Code Digest — Apr 28–May 01
CCmeter's cache-busting insights can cut your Claude Code costs by up to 40% instantly.
ML-Master 2.0 Hits 56.44% on MLE-Bench in 24-Hour Agentic Science Run
Researchers from Shanghai Jiao Tong University demonstrated ML-Master 2.0, an autonomous research agent that operated continuously for 24 hours on the MLE-Bench, achieving a 56.44% medal rate. The breakthrough centers on Hierarchical Cognitive Caching for state management, not reasoning, enabling long-horizon scientific workflows.
Atomic Chat's TurboQuant Enables Gemma 4 Local Inference on 16GB MacBook Air
Atomic Chat's new TurboQuant algorithm aggressively compresses the KV cache, allowing models requiring 32GB+ RAM to run on 16GB MacBook Airs at 25 tokens/sec, advancing local AI deployment.
Forge: The Open-Source TUI That Turns Claude Code into a Multi-Model Swarm
Forge is a new open-source tool that orchestrates multiple AI coding agents (including Claude Code) using git-native isolation and semantic context management to overcome token limits.
mlx-vlm v0.4.4 Launches with Falcon-Perception 300M, TurboQuant Metal Kernels & 1.9x Decode Speedup
The mlx-vlm library v0.4.4 adds support for TII's Falcon-Perception 300M vision model and introduces TurboQuant Metal kernels, achieving up to 1.9x faster decoding with 89% KV cache savings on Apple Silicon.
Edge Computing in Retail 2026: Examples, Benefits, and a Guide
Shopify outlines the strategic shift toward edge computing in retail, detailing its benefits—real-time personalization, inventory management, and enhanced in-store experiences—and providing a practical implementation guide for 2026.
LeCun's Team Uncovers Hidden Transformer Flaws: How Architectural Artifacts Sabotage AI Efficiency
NYU researchers led by Yann LeCun reveal that Transformer language models contain systematic artifacts—massive activations and attention sinks—that degrade efficiency. These phenomena, stemming from architectural choices rather than fundamental properties, directly impact quantization, pruning, and memory management.
NVIDIA's Memory Compression Breakthrough: How Forgetting Makes LLMs Smarter
NVIDIA researchers have developed Dynamic Memory Sparsification, a technique that compresses LLM working memory by 8× while improving reasoning capabilities. This counterintuitive approach addresses the critical KV cache bottleneck in long-context AI applications.
Claude Code's June 15 Agentic Credit Split: How to Avoid Hitting the $20 Wall
Claude Code's June 15 agentic credit split moves `claude -p` and CI workflows to a separate $20/month bucket on Pro. Upgrade to Max 5x or switch to direct API for production pipelines.
Claude Code's Six-Layer Architecture: Harness, Not Magic
Claude Code's six-layer architecture uses a 3-layer context compressor at 92% threshold and Redis-based multi-agent FSM protocol. The model is just one node in a harness.
How Andre Karpathy's CLAUDE.md Guidelines Save Millions of Tokens — and
Andre Karpathy's CLAUDE.md patterns cut token waste by 40%+. Copy his exact config to slash costs and speed up Claude Code.
The Claude Code Cheat Sheet You Need: 5 Commands That Save Hours
A comprehensive cheat sheet for Claude Code has been released, compiling critical CLI commands, MCP server setups, and workflow shortcuts to eliminate guesswork and speed up development.
Claude Code Digest — Apr 18–Apr 21
Switch to FastMCP for MCP server builds — eliminate copy-paste workflows in 15 minutes.
VMLOps Publishes NLP Engineer System Design Interview Guide
VMLOps has published 'The NLP Engineer's System Design Interview Guide,' a detailed resource covering architecture, scaling, and trade-offs for real-world NLP systems. It provides a structured framework for both interviewers and candidates.
Google, Marvell in Talks to Co-Develop New AI Chips, Including TPU-Optimized MPU
Google is reportedly in talks with Marvell Technology to co-develop two new AI chips: a memory processing unit (MPU) to pair with TPUs and a new, optimized TPU. This move is a direct effort to bolster Google's custom silicon stack and compete with Nvidia's dominance.
Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck
A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.
Claude Code Digest — Apr 14–Apr 17
By leveraging the 270-second rule, developers can slash Claude Code API costs by up to 90%.
Developer Swaps Dash Cam Analysis for Gemma 4 & Falcon Perception
A developer announced they are replacing their entire dash cam video analysis system with Google's Gemma 4 and Falcon Perception models, signaling a practical shift towards newer, specialized multimodal models for real-time edge applications.
Ollama vs. vLLM vs. llama.cpp
A technical benchmark compares three popular open-source LLM inference servers—Ollama, vLLM, and llama.cpp—under concurrent load. Ollama, despite its ease of use and massive adoption, collapsed at 5 concurrent users, highlighting a critical gap between developer-friendly tools and production-ready systems.
Claude Code Digest — Apr 11–Apr 14
Bypass Claude Code rate limits for just $2/month with a proxy API and unlock unlimited access.
Claude Managed Agents: How to Build on the Platform Instead of in Its Gaps
Claude Managed Agents turns long-running, stateful agents into an API call. For developers, this means building durable applications on a stable platform, not temporary solutions in its gaps.
Nous Research's Hermes Agent Features Self-Improving Skills, Persistent Memory
A new evaluation of Nous Research's Hermes Agent highlights its self-improving ability to build reusable tools from experience and a smarter persistent memory system that conserves token usage. The agent reportedly improves with continued use, representing a shift towards more adaptive AI systems.