prompt caching

30 articles about prompt caching in AI news

Stop Prompting, Start System Building

Move from prompting to system-building with Claude Code. Use CLAUDE.md, MCP servers, and plan mode to create an agentic coding system that learns your codebase and automates workflows.

Jul 18, 202680% relevant

3 Official System Prompts That Stop Claude Code From Hallucinating APIs

Anthropic's official documentation reveals three system prompt instructions that dramatically reduce hallucinations when Claude Code researches APIs or libraries.

Mar 21, 202684% relevant

Semantic Caching: The Key to Affordable, Real-Time AI for Luxury Clienteling

Semantic caching for LLMs reuses responses to similar customer queries, cutting API costs by 20-40% and slashing response times. This makes deploying AI-powered personal assistants and search at scale financially viable for luxury brands.

Mar 5, 202670% relevant

RedParrot: Semantic Caching Speeds Up NL-to-DSL for Business Analytics by

Xiaohongshu researchers propose RedParrot, a framework that caches normalized structural patterns of natural language queries to bypass expensive LLM pipelines, achieving 3.6x speedup and 8.26% accuracy improvement on enterprise datasets.

Apr 28, 202684% relevant

7 AI Agent Cost Optimization Strategies That Cut LLM Bills by Up to 90%

The source outlines seven cost optimization strategies for AI agents, including prompt compression and model routing, that can reduce LLM bills by up to 90%. This matters for retail and luxury brands deploying AI at scale where inference costs can become prohibitive.

Jul 19, 202669% relevant

ML-Master 2.0 Hits 56.44% on MLE-Bench in 24-Hour Agentic Science Run

Researchers from Shanghai Jiao Tong University demonstrated ML-Master 2.0, an autonomous research agent that operated continuously for 24 hours on the MLE-Bench, achieving a 56.44% medal rate. The breakthrough centers on Hierarchical Cognitive Caching for state management, not reasoning, enabling long-horizon scientific workflows.

Apr 19, 202687% relevant

How One Developer Achieved a 46:1 Context Cache Ratio to Manage 39 Projects

The key takeaway is that maximizing Claude Code's prompt cache through long, context-dense sessions is the most effective way to scale individual productivity across multiple projects.

Apr 17, 2026100% relevant

The 270-Second Rule: How to Cut Claude Code API Costs by 90% with Smart

Anthropic's prompt cache has a 5-minute TTL. Orchestrator loops running faster than 270 seconds pay ~10% of full input token costs.

Apr 16, 2026100% relevant

Stanford/MIT Paper: AI Performance Depends on 'Model Harnesses'

A new paper from Stanford and MIT introduces the concept of 'Model Harnesses,' arguing that the wrapper of prompts, tools, and infrastructure around a base model is a primary determinant of real-world AI performance.

Apr 7, 202685% relevant

Google DeepMind Unveils Gemini-Powered Browser That Generates Websites in Real-Time

Google DeepMind has demonstrated a browser prototype powered by Gemini 3.1 Flash-Lite that generates complete HTML/CSS websites dynamically based on user prompts and navigation context, shifting from static page retrieval to on-demand interface generation.

Mar 25, 202695% relevant

Helium: A New Framework for Efficient LLM Serving in Agentic Workflows

Researchers introduce Helium, a workflow-aware LLM serving framework that treats agentic workflows as query plans. It uses proactive caching and cache-aware scheduling to reduce redundancy, achieving up to 1.56x speedup over current systems.

Mar 18, 202674% relevant

Anthropic Ships Claude Opus 5: Fable-Level Intelligence at Half the Price

Anthropic released Claude Opus 5 on July 24 with a 1M token context, 128k output, and Fable-5-approaching intelligence at half the price, unchanged from Opus 4.8.

Jul 26, 2026100% relevant

CacheBlend: 2-4x Faster KV Cache for Multi-Doc Queries

CacheBlend reuses per-document KV caches by recomputing only boundary tokens, achieving 2-4x speedups on multi-document queries. Alibaba data shows 10% of blocks serve 77% of hits.

Jul 20, 202692% relevant

OpenAI GPT-5.6 Sol matches Fable 5 at 1/3 cost, adds multi-agent API

OpenAI's GPT-5.6 Sol nearly matches Claude Fable 5 on aggregate benchmarks at one-third the cost, with new multi-agent and tool-calling APIs.

Jul 10, 202695% relevant

Build an Adversarial Verifier Loop in Claude Code: Catch Bugs Before They Land

Stop trusting Claude Code's self-reports. Add a 3-verifier panel that refutes changes with concrete repro cases, catching bugs tests miss. Capped at 3 rounds.

Jul 9, 202678% relevant

MCP Cuts Token Costs 75% But Adds 30x Latency vs REST APIs

MCP cuts token costs by 75% but adds 30x latency versus REST. The protocol, backed by Anthropic and OpenAI, trades speed for dynamic tool discovery.

Jul 8, 202685% relevant

How Simon Willison Ported a 0.2B Image Model to the Browser with Claude

Simon Willison used Claude Code to port a 0.2B image inpainting model to WebGPU, running it as a parallel side project while his main agent worked on Datasette. The technique? Research with Claude.ai, then hand off to Claude Code with research.md.

Jun 22, 202670% relevant

Claude Code's June 15 Agentic Credit Split: How to Avoid Hitting the $20 Wall

Claude Code's June 15 agentic credit split moves `claude -p` and CI workflows to a separate $20/month bucket on Pro. Upgrade to Max 5x or switch to direct API for production pipelines.

Jun 10, 2026100% relevant

Prism v1.8 Adds CLI, MCP Server, and SDKs — Here's How to Use Them with

Prism v1.8's MCP server gives Claude Code direct control over caches, budgets, and routing. Install it in 2 minutes and ditch the dashboard for terminal-based AI infrastructure management.

Jun 7, 202673% relevant

skillkit: The Per-Project Claude Code Skill Manager That Finally Tames

skillkit gives Claude Code users per-project skill management via a `skills.toml` manifest and `skillkit sync` command, ending the global skill directory chaos.

Jun 1, 202690% relevant

Median Coding Agent Hits 96k Input Tokens, Rewriting Inference Economics

SemiAnalysis found median coding agent uses 96k input tokens from 432k requests, shifting inference cost focus from output to context.

May 22, 202695% relevant

Compute Shortage to Split AI Market: Rich Get Agents, Poor Get Chatbots

Mollick warns compute shortage makes agents expensive while chatbots cheapen, splitting AI market by company resources.

May 21, 202675% relevant

Agent4POI: LLM Agents Beat Static Embeddings by 23.2% on POI Rec

Agent4POI achieves 23.2% relative gain over baselines by generating context-aware POI representations at inference time, proving static embeddings insufficient.

May 18, 202676% relevant

Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context

Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.

May 15, 202688% relevant

mlx-vlm v0.5.0 Adds Continuous Batching, Distributed Inference for Apple Silicon

mlx-vlm v0.5.0 adds continuous batching, speculative decoding, and distributed inference for Apple Silicon. The release supports Qwen3.5, Kimi K2.5, Gemma 4 video, and new models with 21 contributors.

May 6, 202687% relevant

Claude Code Digest — Apr 28–May 01

CCmeter's cache-busting insights can cut your Claude Code costs by up to 40% instantly.

May 1, 2026100% relevant

Doby Cuts Claude Code Navigation Tokens by 95% with Spec-First Workflow

A spec-first fix workflow that slashes navigation tokens 95% and enforces plan docs as source of truth before code changes.

Apr 24, 2026100% relevant

How Andre Karpathy's CLAUDE.md Guidelines Save Millions of Tokens — and

Andre Karpathy's CLAUDE.md patterns cut token waste by 40%+. Copy his exact config to slash costs and speed up Claude Code.

Apr 23, 202686% relevant

Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck

A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.

Apr 20, 202685% relevant

WOZCODE Launches Free Claude Code Plugin, Claims 40% Speed Boost

WOZCODE has launched a free plugin for Claude Code, claiming it makes coding sessions 30-40% faster and reduces costs by up to 55%. The plugin is available now.

Apr 18, 2026100% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety