prompt caching
30 articles about prompt caching in AI news
3 Official System Prompts That Stop Claude Code From Hallucinating APIs
Anthropic's official documentation reveals three system prompt instructions that dramatically reduce hallucinations when Claude Code researches APIs or libraries.
Semantic Caching: The Key to Affordable, Real-Time AI for Luxury Clienteling
Semantic caching for LLMs reuses responses to similar customer queries, cutting API costs by 20-40% and slashing response times. This makes deploying AI-powered personal assistants and search at scale financially viable for luxury brands.
RedParrot: Semantic Caching Speeds Up NL-to-DSL for Business Analytics by
Xiaohongshu researchers propose RedParrot, a framework that caches normalized structural patterns of natural language queries to bypass expensive LLM pipelines, achieving 3.6x speedup and 8.26% accuracy improvement on enterprise datasets.
ML-Master 2.0 Hits 56.44% on MLE-Bench in 24-Hour Agentic Science Run
Researchers from Shanghai Jiao Tong University demonstrated ML-Master 2.0, an autonomous research agent that operated continuously for 24 hours on the MLE-Bench, achieving a 56.44% medal rate. The breakthrough centers on Hierarchical Cognitive Caching for state management, not reasoning, enabling long-horizon scientific workflows.
How One Developer Achieved a 46:1 Context Cache Ratio to Manage 39 Projects
The key takeaway is that maximizing Claude Code's prompt cache through long, context-dense sessions is the most effective way to scale individual productivity across multiple projects.
The 270-Second Rule: How to Cut Claude Code API Costs by 90% with Smart
Anthropic's prompt cache has a 5-minute TTL. Orchestrator loops running faster than 270 seconds pay ~10% of full input token costs.
Stanford/MIT Paper: AI Performance Depends on 'Model Harnesses'
A new paper from Stanford and MIT introduces the concept of 'Model Harnesses,' arguing that the wrapper of prompts, tools, and infrastructure around a base model is a primary determinant of real-world AI performance.
Google DeepMind Unveils Gemini-Powered Browser That Generates Websites in Real-Time
Google DeepMind has demonstrated a browser prototype powered by Gemini 3.1 Flash-Lite that generates complete HTML/CSS websites dynamically based on user prompts and navigation context, shifting from static page retrieval to on-demand interface generation.
Helium: A New Framework for Efficient LLM Serving in Agentic Workflows
Researchers introduce Helium, a workflow-aware LLM serving framework that treats agentic workflows as query plans. It uses proactive caching and cache-aware scheduling to reduce redundancy, achieving up to 1.56x speedup over current systems.
Claude Code's June 15 Agentic Credit Split: How to Avoid Hitting the $20 Wall
Claude Code's June 15 agentic credit split moves `claude -p` and CI workflows to a separate $20/month bucket on Pro. Upgrade to Max 5x or switch to direct API for production pipelines.
Prism v1.8 Adds CLI, MCP Server, and SDKs — Here's How to Use Them with
Prism v1.8's MCP server gives Claude Code direct control over caches, budgets, and routing. Install it in 2 minutes and ditch the dashboard for terminal-based AI infrastructure management.
skillkit: The Per-Project Claude Code Skill Manager That Finally Tames
skillkit gives Claude Code users per-project skill management via a `skills.toml` manifest and `skillkit sync` command, ending the global skill directory chaos.
Median Coding Agent Hits 96k Input Tokens, Rewriting Inference Economics
SemiAnalysis found median coding agent uses 96k input tokens from 432k requests, shifting inference cost focus from output to context.
Compute Shortage to Split AI Market: Rich Get Agents, Poor Get Chatbots
Mollick warns compute shortage makes agents expensive while chatbots cheapen, splitting AI market by company resources.
Agent4POI: LLM Agents Beat Static Embeddings by 23.2% on POI Rec
Agent4POI achieves 23.2% relative gain over baselines by generating context-aware POI representations at inference time, proving static embeddings insufficient.
Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context
Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.
mlx-vlm v0.5.0 Adds Continuous Batching, Distributed Inference for Apple Silicon
mlx-vlm v0.5.0 adds continuous batching, speculative decoding, and distributed inference for Apple Silicon. The release supports Qwen3.5, Kimi K2.5, Gemma 4 video, and new models with 21 contributors.
Claude Code Digest — Apr 28–May 01
CCmeter's cache-busting insights can cut your Claude Code costs by up to 40% instantly.
Doby Cuts Claude Code Navigation Tokens by 95% with Spec-First Workflow
A spec-first fix workflow that slashes navigation tokens 95% and enforces plan docs as source of truth before code changes.
How Andre Karpathy's CLAUDE.md Guidelines Save Millions of Tokens — and
Andre Karpathy's CLAUDE.md patterns cut token waste by 40%+. Copy his exact config to slash costs and speed up Claude Code.
Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck
A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.
WOZCODE Launches Free Claude Code Plugin, Claims 40% Speed Boost
WOZCODE has launched a free plugin for Claude Code, claiming it makes coding sessions 30-40% faster and reduces costs by up to 55%. The plugin is available now.
Why the Best Generative AI Projects Start With the Most Powerful Model —
The article suggests that while initial AI projects leverage the broad capabilities of large foundation models, the most successful implementations eventually transition to smaller, more targeted systems. This reflects a maturation from experimentation to production optimization.
Claude Code's Edge: Why Sonnet 4.5 Beats GPT-4o for Multi-File Projects
Claude Code's underlying model excels at understanding existing codebases and maintaining instruction fidelity in long sessions, making it the better choice for complex, multi-file development tasks.
Anthropic Ends Cheap Claude Subscriptions, Moves Businesses to API-Only Pricing
Anthropic has terminated its $20-$200/month Claude subscription plans for businesses, shifting all commercial access to its API pricing. This ends a period of subsidized access and aligns its model with competitors like OpenAI.
How Telemetry Settings Are Silently Costing You Cache Tiers (And How To Fix It)
A confirmed bug links telemetry settings to cache TTL; disabling telemetry defaults you to 5-minute cache, increasing costs. Use environment variables and hooks to mitigate.
Anthropic's Silent Cache TTL Cut
Claude Code's default cache TTL was silently reduced to 5 minutes on April 2, drastically increasing token costs. Use hooks and settings to mitigate the impact.
Claude Opus 4.6 Unlimited Access Deal Sparks Developer Interest
A developer reports finding a deal for unlimited Claude Opus 4.6 usage without rate limits, potentially offering significant cost savings for heavy users compared to Anthropic's official API pricing.
The Hidden Operational Costs of GenAI Products
The article deconstructs the illusion of simplicity in GenAI products, detailing how predictable costs (APIs, compute) are dwarfed by hidden operational expenses for data pipelines, monitoring, and quality assurance. This is a critical financial reality check for any company scaling AI.
Anthropic's Claude Code Boosts @-Mention Speed 3x for Large Enterprise Codebases
Anthropic has released technical details on optimizing the @-mention feature in Claude Code, achieving a 3x speedup for large enterprise codebases. This addresses a critical performance bottleneck for developers working in massive, legacy code repositories.