caching

30 articles about caching in AI news

Continuous Semantic Caching

Researchers propose a theory-grounded semantic caching system that treats user queries as points in a continuous embedding space, using dynamic ε-net discretization and kernel ridge regression to cut inference costs and latency without switching overhead.

Apr 24, 202678% relevant

Google's Memory Caching Bridges RNN-Transformer Gap with O(NL) Complexity

Google's 'Memory Caching' method saves RNN memory states at segment boundaries, allowing tokens to reference past checkpoints. This O(NL) approach significantly improves RNN performance on recall tasks, narrowing the gap with Transformers.

Apr 15, 202695% relevant

Semantic Caching: The Key to Affordable, Real-Time AI for Luxury Clienteling

Semantic caching for LLMs reuses responses to similar customer queries, cutting API costs by 20-40% and slashing response times. This makes deploying AI-powered personal assistants and search at scale financially viable for luxury brands.

Mar 5, 202670% relevant

RedParrot: Semantic Caching Speeds Up NL-to-DSL for Business Analytics by

Xiaohongshu researchers propose RedParrot, a framework that caches normalized structural patterns of natural language queries to bypass expensive LLM pipelines, achieving 3.6x speedup and 8.26% accuracy improvement on enterprise datasets.

Apr 28, 202684% relevant

Unsloth × NVIDIA Cut LLM Fine-Tuning ~25% — Three Glue-Code Wins on Blackwell

Daniel & Michael Han at Unsloth, in collaboration with NVIDIA, published a joint guide quantifying three glue-code optimizations that combine for ~25% faster LLM training on B200 Blackwell hardware. The wins target overhead around the main kernels — caching packed-sequence metadata, double-buffered gradient checkpoint reloads, and a cheaper GPT-OSS MoE router using argsort + bincount. All three are merged via public PRs.

May 6, 202687% relevant

ML-Master 2.0 Hits 56.44% on MLE-Bench in 24-Hour Agentic Science Run

Researchers from Shanghai Jiao Tong University demonstrated ML-Master 2.0, an autonomous research agent that operated continuously for 24 hours on the MLE-Bench, achieving a 56.44% medal rate. The breakthrough centers on Hierarchical Cognitive Caching for state management, not reasoning, enabling long-horizon scientific workflows.

Apr 19, 202687% relevant

MLX-VLM Adds Continuous Batching, OpenAI API, and Vision Cache for Apple Silicon

The next release of MLX-VLM will introduce continuous batching, an OpenAI-compatible API, and vision feature caching for multimodal models running locally on Apple Silicon. These optimizations promise up to 228x speedups on cache hits for models like Gemma4.

Apr 16, 202695% relevant

Helium: A New Framework for Efficient LLM Serving in Agentic Workflows

Researchers introduce Helium, a workflow-aware LLM serving framework that treats agentic workflows as query plans. It uses proactive caching and cache-aware scheduling to reduce redundancy, achieving up to 1.56x speedup over current systems.

Mar 18, 202674% relevant

Prism v1.8 Adds CLI, MCP Server, and SDKs — Here's How to Use Them with

Prism v1.8's MCP server gives Claude Code direct control over caches, budgets, and routing. Install it in 2 minutes and ditch the dashboard for terminal-based AI infrastructure management.

Jun 7, 202673% relevant

OpenAI's ChatGPT 'Dreaming' Memory Retains Preferences Across Sessions

OpenAI launched a dreaming memory system for ChatGPT that retains user preferences across conversations by compressing and replaying session data, enabling persistent personalization.

Jun 5, 2026100% relevant

skillkit: The Per-Project Claude Code Skill Manager That Finally Tames

skillkit gives Claude Code users per-project skill management via a `skills.toml` manifest and `skillkit sync` command, ending the global skill directory chaos.

Jun 1, 202690% relevant

Claude Code Ships /workflows, Replaces LLM Orchestrator with Code

Claude Code /workflows replaces LLM orchestrator with code-based control flow, solving the token tax problem from multi-agent context buildup.

May 23, 2026100% relevant

Median Coding Agent Hits 96k Input Tokens, Rewriting Inference Economics

SemiAnalysis found median coding agent uses 96k input tokens from 432k requests, shifting inference cost focus from output to context.

May 22, 202695% relevant

Compute Shortage to Split AI Market: Rich Get Agents, Poor Get Chatbots

Mollick warns compute shortage makes agents expensive while chatbots cheapen, splitting AI market by company resources.

May 21, 202675% relevant

Composer 2.5 Scores 62 on Coding Index at $0.07 vs. $4-5 for Rivals

Composer 2.5 scores 62 on coding index at $0.07/task vs $4-5 for rivals scoring 65-66. 60x cost savings with near-parity performance.

May 21, 202683% relevant

Memory as a Model: Augmenting LLMs with Trained Memory

Paper augments LLMs with trained memory for long-term recall. Model-agnostic approach stores external knowledge without retraining.

May 20, 202677% relevant

Agent4POI: LLM Agents Beat Static Embeddings by 23.2% on POI Rec

Agent4POI achieves 23.2% relative gain over baselines by generating context-aware POI representations at inference time, proving static embeddings insufficient.

May 18, 202676% relevant

Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context

Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.

May 15, 202688% relevant

CLAUDE.md Wastes 7K+ Tokens Per Turn; Skills Cut to 50

A 1,000-line CLAUDE.md burns 7,000-10,000 tokens per turn on instructions the model already knows. Skills using progressive disclosure cut that to ~50 tokens.

May 15, 2026100% relevant

Shopify Drops Redis for MySQL in Inventory Reservations, Scales 10x

Shopify replaced Redis with MySQL for inventory reservations, achieving 10x scalability and handling 50,000 writes per second.

May 12, 202693% relevant

mlx-vlm v0.5.0 Adds Continuous Batching, Distributed Inference for Apple Silicon

mlx-vlm v0.5.0 adds continuous batching, speculative decoding, and distributed inference for Apple Silicon. The release supports Qwen3.5, Kimi K2.5, Gemma 4 video, and new models with 21 contributors.

May 6, 202687% relevant

Claude Code Digest — Apr 28–May 01

CCmeter's cache-busting insights can cut your Claude Code costs by up to 40% instantly.

May 1, 2026100% relevant

Doby Cuts Claude Code Navigation Tokens by 95% with Spec-First Workflow

A spec-first fix workflow that slashes navigation tokens 95% and enforces plan docs as source of truth before code changes.

Apr 24, 2026100% relevant

How Andre Karpathy's CLAUDE.md Guidelines Save Millions of Tokens — and

Andre Karpathy's CLAUDE.md patterns cut token waste by 40%+. Copy his exact config to slash costs and speed up Claude Code.

Apr 23, 202686% relevant

VMLOps Publishes NLP Engineer System Design Interview Guide

VMLOps has published 'The NLP Engineer's System Design Interview Guide,' a detailed resource covering architecture, scaling, and trade-offs for real-world NLP systems. It provides a structured framework for both interviewers and candidates.

Apr 20, 202675% relevant

Moonshot AI's Kimi K2.6 Hits 58.6% on SWE-Bench Pro, Leads Open-Source Coding

Moonshot AI released Kimi K2.6, an open-source coding model achieving 58.6% on SWE-Bench Pro and 54.0% on HLE with tools. This positions it as a top-tier open alternative to proprietary models like Claude 3.5 Sonnet.

Apr 20, 2026100% relevant

Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck

A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.

Apr 20, 202685% relevant

How I Built a Production RAG Pipeline for Fintech at 1M+ Daily Transactions

A technical case study from a fintech ML engineer outlines the end-to-end design of a Retrieval-Augmented Generation pipeline built for production at extreme scale, processing over a million daily transactions. It provides a rare, real-world blueprint for building reliable, high-volume AI systems.

Apr 18, 202694% relevant

WOZCODE Launches Free Claude Code Plugin, Claims 40% Speed Boost

WOZCODE has launched a free plugin for Claude Code, claiming it makes coding sessions 30-40% faster and reduces costs by up to 55%. The plugin is available now.

Apr 18, 2026100% relevant

How One Developer Achieved a 46:1 Context Cache Ratio to Manage 39 Projects

The key takeaway is that maximizing Claude Code's prompt cache through long, context-dense sessions is the most effective way to scale individual productivity across multiple projects.

Apr 17, 2026100% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety