caching
30 articles about caching in AI news
Continuous Semantic Caching
Researchers propose a theory-grounded semantic caching system that treats user queries as points in a continuous embedding space, using dynamic ε-net discretization and kernel ridge regression to cut inference costs and latency without switching overhead.
Google's Memory Caching Bridges RNN-Transformer Gap with O(NL) Complexity
Google's 'Memory Caching' method saves RNN memory states at segment boundaries, allowing tokens to reference past checkpoints. This O(NL) approach significantly improves RNN performance on recall tasks, narrowing the gap with Transformers.
Semantic Caching: The Key to Affordable, Real-Time AI for Luxury Clienteling
Semantic caching for LLMs reuses responses to similar customer queries, cutting API costs by 20-40% and slashing response times. This makes deploying AI-powered personal assistants and search at scale financially viable for luxury brands.
RedParrot: Semantic Caching Speeds Up NL-to-DSL for Business Analytics by
Xiaohongshu researchers propose RedParrot, a framework that caches normalized structural patterns of natural language queries to bypass expensive LLM pipelines, achieving 3.6x speedup and 8.26% accuracy improvement on enterprise datasets.
Unsloth × NVIDIA Cut LLM Fine-Tuning ~25% — Three Glue-Code Wins on Blackwell
Daniel & Michael Han at Unsloth, in collaboration with NVIDIA, published a joint guide quantifying three glue-code optimizations that combine for ~25% faster LLM training on B200 Blackwell hardware. The wins target overhead around the main kernels — caching packed-sequence metadata, double-buffered gradient checkpoint reloads, and a cheaper GPT-OSS MoE router using argsort + bincount. All three are merged via public PRs.
ML-Master 2.0 Hits 56.44% on MLE-Bench in 24-Hour Agentic Science Run
Researchers from Shanghai Jiao Tong University demonstrated ML-Master 2.0, an autonomous research agent that operated continuously for 24 hours on the MLE-Bench, achieving a 56.44% medal rate. The breakthrough centers on Hierarchical Cognitive Caching for state management, not reasoning, enabling long-horizon scientific workflows.
MLX-VLM Adds Continuous Batching, OpenAI API, and Vision Cache for Apple Silicon
The next release of MLX-VLM will introduce continuous batching, an OpenAI-compatible API, and vision feature caching for multimodal models running locally on Apple Silicon. These optimizations promise up to 228x speedups on cache hits for models like Gemma4.
Helium: A New Framework for Efficient LLM Serving in Agentic Workflows
Researchers introduce Helium, a workflow-aware LLM serving framework that treats agentic workflows as query plans. It uses proactive caching and cache-aware scheduling to reduce redundancy, achieving up to 1.56x speedup over current systems.
Prism v1.8 Adds CLI, MCP Server, and SDKs — Here's How to Use Them with
Prism v1.8's MCP server gives Claude Code direct control over caches, budgets, and routing. Install it in 2 minutes and ditch the dashboard for terminal-based AI infrastructure management.
OpenAI's ChatGPT 'Dreaming' Memory Retains Preferences Across Sessions
OpenAI launched a dreaming memory system for ChatGPT that retains user preferences across conversations by compressing and replaying session data, enabling persistent personalization.
skillkit: The Per-Project Claude Code Skill Manager That Finally Tames
skillkit gives Claude Code users per-project skill management via a `skills.toml` manifest and `skillkit sync` command, ending the global skill directory chaos.
Claude Code Ships /workflows, Replaces LLM Orchestrator with Code
Claude Code /workflows replaces LLM orchestrator with code-based control flow, solving the token tax problem from multi-agent context buildup.
Median Coding Agent Hits 96k Input Tokens, Rewriting Inference Economics
SemiAnalysis found median coding agent uses 96k input tokens from 432k requests, shifting inference cost focus from output to context.
Compute Shortage to Split AI Market: Rich Get Agents, Poor Get Chatbots
Mollick warns compute shortage makes agents expensive while chatbots cheapen, splitting AI market by company resources.
Composer 2.5 Scores 62 on Coding Index at $0.07 vs. $4-5 for Rivals
Composer 2.5 scores 62 on coding index at $0.07/task vs $4-5 for rivals scoring 65-66. 60x cost savings with near-parity performance.
Memory as a Model: Augmenting LLMs with Trained Memory
Paper augments LLMs with trained memory for long-term recall. Model-agnostic approach stores external knowledge without retraining.
Agent4POI: LLM Agents Beat Static Embeddings by 23.2% on POI Rec
Agent4POI achieves 23.2% relative gain over baselines by generating context-aware POI representations at inference time, proving static embeddings insufficient.
Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context
Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.
CLAUDE.md Wastes 7K+ Tokens Per Turn; Skills Cut to 50
A 1,000-line CLAUDE.md burns 7,000-10,000 tokens per turn on instructions the model already knows. Skills using progressive disclosure cut that to ~50 tokens.
Shopify Drops Redis for MySQL in Inventory Reservations, Scales 10x
Shopify replaced Redis with MySQL for inventory reservations, achieving 10x scalability and handling 50,000 writes per second.
mlx-vlm v0.5.0 Adds Continuous Batching, Distributed Inference for Apple Silicon
mlx-vlm v0.5.0 adds continuous batching, speculative decoding, and distributed inference for Apple Silicon. The release supports Qwen3.5, Kimi K2.5, Gemma 4 video, and new models with 21 contributors.
Claude Code Digest — Apr 28–May 01
CCmeter's cache-busting insights can cut your Claude Code costs by up to 40% instantly.
Doby Cuts Claude Code Navigation Tokens by 95% with Spec-First Workflow
A spec-first fix workflow that slashes navigation tokens 95% and enforces plan docs as source of truth before code changes.
How Andre Karpathy's CLAUDE.md Guidelines Save Millions of Tokens — and
Andre Karpathy's CLAUDE.md patterns cut token waste by 40%+. Copy his exact config to slash costs and speed up Claude Code.
VMLOps Publishes NLP Engineer System Design Interview Guide
VMLOps has published 'The NLP Engineer's System Design Interview Guide,' a detailed resource covering architecture, scaling, and trade-offs for real-world NLP systems. It provides a structured framework for both interviewers and candidates.
Moonshot AI's Kimi K2.6 Hits 58.6% on SWE-Bench Pro, Leads Open-Source Coding
Moonshot AI released Kimi K2.6, an open-source coding model achieving 58.6% on SWE-Bench Pro and 54.0% on HLE with tools. This positions it as a top-tier open alternative to proprietary models like Claude 3.5 Sonnet.
Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck
A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.
How I Built a Production RAG Pipeline for Fintech at 1M+ Daily Transactions
A technical case study from a fintech ML engineer outlines the end-to-end design of a Retrieval-Augmented Generation pipeline built for production at extreme scale, processing over a million daily transactions. It provides a rare, real-world blueprint for building reliable, high-volume AI systems.
WOZCODE Launches Free Claude Code Plugin, Claims 40% Speed Boost
WOZCODE has launched a free plugin for Claude Code, claiming it makes coding sessions 30-40% faster and reduces costs by up to 55%. The plugin is available now.
How One Developer Achieved a 46:1 Context Cache Ratio to Manage 39 Projects
The key takeaway is that maximizing Claude Code's prompt cache through long, context-dense sessions is the most effective way to scale individual productivity across multiple projects.