caching
30 articles about caching in AI news
Continuous Semantic Caching
Researchers propose a theory-grounded semantic caching system that treats user queries as points in a continuous embedding space, using dynamic ε-net discretization and kernel ridge regression to cut inference costs and latency without switching overhead.
Google's Memory Caching Bridges RNN-Transformer Gap with O(NL) Complexity
Google's 'Memory Caching' method saves RNN memory states at segment boundaries, allowing tokens to reference past checkpoints. This O(NL) approach significantly improves RNN performance on recall tasks, narrowing the gap with Transformers.
Semantic Caching: The Key to Affordable, Real-Time AI for Luxury Clienteling
Semantic caching for LLMs reuses responses to similar customer queries, cutting API costs by 20-40% and slashing response times. This makes deploying AI-powered personal assistants and search at scale financially viable for luxury brands.
ML-Master 2.0 Hits 56.44% on MLE-Bench in 24-Hour Agentic Science Run
Researchers from Shanghai Jiao Tong University demonstrated ML-Master 2.0, an autonomous research agent that operated continuously for 24 hours on the MLE-Bench, achieving a 56.44% medal rate. The breakthrough centers on Hierarchical Cognitive Caching for state management, not reasoning, enabling long-horizon scientific workflows.
MLX-VLM Adds Continuous Batching, OpenAI API, and Vision Cache for Apple Silicon
The next release of MLX-VLM will introduce continuous batching, an OpenAI-compatible API, and vision feature caching for multimodal models running locally on Apple Silicon. These optimizations promise up to 228x speedups on cache hits for models like Gemma4.
Helium: A New Framework for Efficient LLM Serving in Agentic Workflows
Researchers introduce Helium, a workflow-aware LLM serving framework that treats agentic workflows as query plans. It uses proactive caching and cache-aware scheduling to reduce redundancy, achieving up to 1.56x speedup over current systems.
Doby Cuts Claude Code Navigation Tokens by 95% with Spec-First Workflow
A spec-first fix workflow that slashes navigation tokens 95% and enforces plan docs as source of truth before code changes.
GPT-5.5 Dominates AI Cost-Performance Frontier
OpenAI's GPT-5.5 model family leads the Artificial Analysis Index in cost-performance, signaling a new efficiency standard for AI deployments.
How Andre Karpathy's CLAUDE.md Guidelines Save Millions of Tokens — and
Andre Karpathy's CLAUDE.md patterns cut token waste by 40%+. Copy his exact config to slash costs and speed up Claude Code.
VMLOps Publishes NLP Engineer System Design Interview Guide
VMLOps has published 'The NLP Engineer's System Design Interview Guide,' a detailed resource covering architecture, scaling, and trade-offs for real-world NLP systems. It provides a structured framework for both interviewers and candidates.
Moonshot AI's Kimi K2.6 Hits 58.6% on SWE-Bench Pro, Leads Open-Source Coding
Moonshot AI released Kimi K2.6, an open-source coding model achieving 58.6% on SWE-Bench Pro and 54.0% on HLE with tools. This positions it as a top-tier open alternative to proprietary models like Claude 3.5 Sonnet.
Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck
A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.
How I Built a Production RAG Pipeline for Fintech at 1M+ Daily Transactions
A technical case study from a fintech ML engineer outlines the end-to-end design of a Retrieval-Augmented Generation pipeline built for production at extreme scale, processing over a million daily transactions. It provides a rare, real-world blueprint for building reliable, high-volume AI systems.
WOZCODE Launches Free Claude Code Plugin, Claims 40% Speed Boost
WOZCODE has launched a free plugin for Claude Code, claiming it makes coding sessions 30-40% faster and reduces costs by up to 55%. The plugin is available now.
How One Developer Achieved a 46:1 Context Cache Ratio to Manage 39 Projects
The key takeaway is that maximizing Claude Code's prompt cache through long, context-dense sessions is the most effective way to scale individual productivity across multiple projects.
The Silent Threat to AI Benchmarks: 8 Sources of Eval Contamination
The article warns that subtle data contamination in evaluation pipelines—from benchmark leakage to temporal overlap—can create misleading performance metrics. Identifying these eight leakage sources is essential for trustworthy AI validation.
Why the Best Generative AI Projects Start With the Most Powerful Model —
The article suggests that while initial AI projects leverage the broad capabilities of large foundation models, the most successful implementations eventually transition to smaller, more targeted systems. This reflects a maturation from experimentation to production optimization.
The 270-Second Rule: How to Cut Claude Code API Costs by 90% with Smart
Anthropic's prompt cache has a 5-minute TTL. Orchestrator loops running faster than 270 seconds pay ~10% of full input token costs.
Claude Code's Edge: Why Sonnet 4.5 Beats GPT-4o for Multi-File Projects
Claude Code's underlying model excels at understanding existing codebases and maintaining instruction fidelity in long sessions, making it the better choice for complex, multi-file development tasks.
GitHub Launches 'Caveman' Tool, Claims 75% AI Cost Reduction
GitHub has released a new tool named 'Caveman' designed to reduce AI inference costs by up to 75% for developers. The announcement, made via a developer's tweet, suggests a focus on optimizing resource usage for AI-powered applications.
Anthropic Ends Cheap Claude Subscriptions, Moves Businesses to API-Only Pricing
Anthropic has terminated its $20-$200/month Claude subscription plans for businesses, shifting all commercial access to its API pricing. This ends a period of subsidized access and aligns its model with competitors like OpenAI.
How Telemetry Settings Are Silently Costing You Cache Tiers (And How To Fix It)
A confirmed bug links telemetry settings to cache TTL; disabling telemetry defaults you to 5-minute cache, increasing costs. Use environment variables and hooks to mitigate.
Anthropic's Silent Cache TTL Cut
Claude Code's default cache TTL was silently reduced to 5 minutes on April 2, drastically increasing token costs. Use hooks and settings to mitigate the impact.
Claude Opus 4.6 Unlimited Access Deal Sparks Developer Interest
A developer reports finding a deal for unlimited Claude Opus 4.6 usage without rate limits, potentially offering significant cost savings for heavy users compared to Anthropic's official API pricing.
The Hidden Operational Costs of GenAI Products
The article deconstructs the illusion of simplicity in GenAI products, detailing how predictable costs (APIs, compute) are dwarfed by hidden operational expenses for data pipelines, monitoring, and quality assurance. This is a critical financial reality check for any company scaling AI.
Anthropic's Claude Code Boosts @-Mention Speed 3x for Large Enterprise Codebases
Anthropic has released technical details on optimizing the @-mention feature in Claude Code, achieving a 3x speedup for large enterprise codebases. This addresses a critical performance bottleneck for developers working in massive, legacy code repositories.
Graphify: Open-Source Tool Builds Knowledge Graphs from Code & Docs in One Command
Developer shipped Graphify, an open-source tool that builds queryable knowledge graphs from code, docs, and images in one command. It uses a two-pass pipeline with tree-sitter and Claude subagents, achieving 71.5x fewer tokens per query versus reading raw files.
Stanford/MIT Paper: AI Performance Depends on 'Model Harnesses'
A new paper from Stanford and MIT introduces the concept of 'Model Harnesses,' arguing that the wrapper of prompts, tools, and infrastructure around a base model is a primary determinant of real-world AI performance.
Meta-Harness from Stanford/MIT Shows System Code Creates 6x AI Performance Gap
Stanford and MIT researchers show AI performance depends as much on the surrounding system code (the 'harness') as the model itself. Their Meta-Harness framework automatically improves this code, yielding significant gains in reasoning and classification tasks.
Anthropic's Claude Mythos Compute Needs Delay Release, 'Spud' Likely First
Anthropic's leaked internal note reveals its next flagship model, Claude Mythos, is too computationally expensive for general release. The company states it needs to become 'much more efficient,' likely delaying Mythos and prioritizing the 'Spud' model.