Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

prompt caching

30 articles about prompt caching in AI news

3 Official System Prompts That Stop Claude Code From Hallucinating APIs

Anthropic's official documentation reveals three system prompt instructions that dramatically reduce hallucinations when Claude Code researches APIs or libraries.

84% relevant

Semantic Caching: The Key to Affordable, Real-Time AI for Luxury Clienteling

Semantic caching for LLMs reuses responses to similar customer queries, cutting API costs by 20-40% and slashing response times. This makes deploying AI-powered personal assistants and search at scale financially viable for luxury brands.

70% relevant

RedParrot: Semantic Caching Speeds Up NL-to-DSL for Business Analytics by

Xiaohongshu researchers propose RedParrot, a framework that caches normalized structural patterns of natural language queries to bypass expensive LLM pipelines, achieving 3.6x speedup and 8.26% accuracy improvement on enterprise datasets.

84% relevant

ML-Master 2.0 Hits 56.44% on MLE-Bench in 24-Hour Agentic Science Run

Researchers from Shanghai Jiao Tong University demonstrated ML-Master 2.0, an autonomous research agent that operated continuously for 24 hours on the MLE-Bench, achieving a 56.44% medal rate. The breakthrough centers on Hierarchical Cognitive Caching for state management, not reasoning, enabling long-horizon scientific workflows.

87% relevant

How One Developer Achieved a 46:1 Context Cache Ratio to Manage 39 Projects

The key takeaway is that maximizing Claude Code's prompt cache through long, context-dense sessions is the most effective way to scale individual productivity across multiple projects.

100% relevant

The 270-Second Rule: How to Cut Claude Code API Costs by 90% with Smart

Anthropic's prompt cache has a 5-minute TTL. Orchestrator loops running faster than 270 seconds pay ~10% of full input token costs.

100% relevant

Stanford/MIT Paper: AI Performance Depends on 'Model Harnesses'

A new paper from Stanford and MIT introduces the concept of 'Model Harnesses,' arguing that the wrapper of prompts, tools, and infrastructure around a base model is a primary determinant of real-world AI performance.

85% relevant

Google DeepMind Unveils Gemini-Powered Browser That Generates Websites in Real-Time

Google DeepMind has demonstrated a browser prototype powered by Gemini 3.1 Flash-Lite that generates complete HTML/CSS websites dynamically based on user prompts and navigation context, shifting from static page retrieval to on-demand interface generation.

95% relevant

Helium: A New Framework for Efficient LLM Serving in Agentic Workflows

Researchers introduce Helium, a workflow-aware LLM serving framework that treats agentic workflows as query plans. It uses proactive caching and cache-aware scheduling to reduce redundancy, achieving up to 1.56x speedup over current systems.

74% relevant

Claude Code's June 15 Agentic Credit Split: How to Avoid Hitting the $20 Wall

Claude Code's June 15 agentic credit split moves `claude -p` and CI workflows to a separate $20/month bucket on Pro. Upgrade to Max 5x or switch to direct API for production pipelines.

89% relevant

Prism v1.8 Adds CLI, MCP Server, and SDKs — Here's How to Use Them with

Prism v1.8's MCP server gives Claude Code direct control over caches, budgets, and routing. Install it in 2 minutes and ditch the dashboard for terminal-based AI infrastructure management.

73% relevant

skillkit: The Per-Project Claude Code Skill Manager That Finally Tames

skillkit gives Claude Code users per-project skill management via a `skills.toml` manifest and `skillkit sync` command, ending the global skill directory chaos.

90% relevant

Median Coding Agent Hits 96k Input Tokens, Rewriting Inference Economics

SemiAnalysis found median coding agent uses 96k input tokens from 432k requests, shifting inference cost focus from output to context.

95% relevant

Compute Shortage to Split AI Market: Rich Get Agents, Poor Get Chatbots

Mollick warns compute shortage makes agents expensive while chatbots cheapen, splitting AI market by company resources.

75% relevant

Agent4POI: LLM Agents Beat Static Embeddings by 23.2% on POI Rec

Agent4POI achieves 23.2% relative gain over baselines by generating context-aware POI representations at inference time, proving static embeddings insufficient.

76% relevant

Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context

Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.

88% relevant

mlx-vlm v0.5.0 Adds Continuous Batching, Distributed Inference for Apple Silicon

mlx-vlm v0.5.0 adds continuous batching, speculative decoding, and distributed inference for Apple Silicon. The release supports Qwen3.5, Kimi K2.5, Gemma 4 video, and new models with 21 contributors.

87% relevant

Claude Code Digest — Apr 28–May 01

CCmeter's cache-busting insights can cut your Claude Code costs by up to 40% instantly.

100% relevant

Doby Cuts Claude Code Navigation Tokens by 95% with Spec-First Workflow

A spec-first fix workflow that slashes navigation tokens 95% and enforces plan docs as source of truth before code changes.

100% relevant

How Andre Karpathy's CLAUDE.md Guidelines Save Millions of Tokens — and

Andre Karpathy's CLAUDE.md patterns cut token waste by 40%+. Copy his exact config to slash costs and speed up Claude Code.

86% relevant

Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck

A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.

85% relevant

WOZCODE Launches Free Claude Code Plugin, Claims 40% Speed Boost

WOZCODE has launched a free plugin for Claude Code, claiming it makes coding sessions 30-40% faster and reduces costs by up to 55%. The plugin is available now.

100% relevant

Why the Best Generative AI Projects Start With the Most Powerful Model —

The article suggests that while initial AI projects leverage the broad capabilities of large foundation models, the most successful implementations eventually transition to smaller, more targeted systems. This reflects a maturation from experimentation to production optimization.

72% relevant

Claude Code's Edge: Why Sonnet 4.5 Beats GPT-4o for Multi-File Projects

Claude Code's underlying model excels at understanding existing codebases and maintaining instruction fidelity in long sessions, making it the better choice for complex, multi-file development tasks.

100% relevant

Anthropic Ends Cheap Claude Subscriptions, Moves Businesses to API-Only Pricing

Anthropic has terminated its $20-$200/month Claude subscription plans for businesses, shifting all commercial access to its API pricing. This ends a period of subsidized access and aligns its model with competitors like OpenAI.

89% relevant

How Telemetry Settings Are Silently Costing You Cache Tiers (And How To Fix It)

A confirmed bug links telemetry settings to cache TTL; disabling telemetry defaults you to 5-minute cache, increasing costs. Use environment variables and hooks to mitigate.

90% relevant

Anthropic's Silent Cache TTL Cut

Claude Code's default cache TTL was silently reduced to 5 minutes on April 2, drastically increasing token costs. Use hooks and settings to mitigate the impact.

100% relevant

Claude Opus 4.6 Unlimited Access Deal Sparks Developer Interest

A developer reports finding a deal for unlimited Claude Opus 4.6 usage without rate limits, potentially offering significant cost savings for heavy users compared to Anthropic's official API pricing.

93% relevant

The Hidden Operational Costs of GenAI Products

The article deconstructs the illusion of simplicity in GenAI products, detailing how predictable costs (APIs, compute) are dwarfed by hidden operational expenses for data pipelines, monitoring, and quality assurance. This is a critical financial reality check for any company scaling AI.

85% relevant

Anthropic's Claude Code Boosts @-Mention Speed 3x for Large Enterprise Codebases

Anthropic has released technical details on optimizing the @-mention feature in Claude Code, achieving a 3x speedup for large enterprise codebases. This addresses a critical performance bottleneck for developers working in massive, legacy code repositories.

100% relevant