Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

cache

30 articles about cache in AI news

How One Developer Achieved a 46:1 Context Cache Ratio to Manage 39 Projects

The key takeaway is that maximizing Claude Code's prompt cache through long, context-dense sessions is the most effective way to scale individual productivity across multiple projects.

100% relevant

MLX-VLM Adds Continuous Batching, OpenAI API, and Vision Cache for Apple Silicon

The next release of MLX-VLM will introduce continuous batching, an OpenAI-compatible API, and vision feature caching for multimodal models running locally on Apple Silicon. These optimizations promise up to 228x speedups on cache hits for models like Gemma4.

95% relevant

How Telemetry Settings Are Silently Costing You Cache Tiers (And How To Fix It)

A confirmed bug links telemetry settings to cache TTL; disabling telemetry defaults you to 5-minute cache, increasing costs. Use environment variables and hooks to mitigate.

90% relevant

Anthropic's Silent Cache TTL Cut

Claude Code's default cache TTL was silently reduced to 5 minutes on April 2, drastically increasing token costs. Use hooks and settings to mitigate the impact.

100% relevant

Microsoft's 'Compress-Thought' Cuts KV Cache 2-3x, Boosts Throughput 2x

A new Microsoft paper shows language models can learn to compress their reasoning steps on-the-fly, slashing memory use 2-3x and doubling throughput. Crucially, 15 percentage points of accuracy come from 'leaked' information in KV cache after explicit reasoning is erased.

95% relevant

Google's TurboQuant Compresses LLM KV Cache 6x with Zero Accuracy Loss, Cutting GPU Memory by 80%

Google researchers introduced TurboQuant, a method that compresses LLM KV cache from 32-bit to 3-bit precision without accuracy degradation. This reduces GPU memory consumption by over 80% and speeds up inference 8x on H100 GPUs.

97% relevant

Google's TurboQuant Cuts LLM KV Cache Memory by 6x, Enables 3-Bit Storage Without Accuracy Loss

Google released TurboQuant, a novel two-stage quantization algorithm that compresses the KV cache in long-context LLMs. It reduces memory by 6x, achieves 3-bit storage with no accuracy drop, and speeds up attention scoring by up to 8x on H100 GPUs.

95% relevant

arXiv Survey Maps KV Cache Optimization Landscape: 5 Strategies for Million-Token LLM Inference

A comprehensive arXiv review categorizes five principal KV cache optimization techniques—eviction, compression, hybrid memory, novel attention, and combinations—to address the linear memory scaling bottleneck in long-context LLM inference. The analysis finds no single dominant solution, with optimal strategy depending on context length, hardware, and workload.

95% relevant

DualPath Architecture Shatters KV-Cache Bottleneck, Doubling LLM Throughput for AI Agents

Researchers have developed DualPath, a novel architecture that eliminates the KV-cache storage bottleneck in agentic LLM inference. By implementing dual-path loading with RDMA transfers, the system achieves nearly 2× throughput improvements for both offline and online scenarios.

85% relevant

Claude Code's New Read Cache Blocks 8% of Token Waste Automatically

The claude-context-optimizer plugin v3.1 now actively blocks Claude from re-reading unchanged files, saving an average 8% of tokens per session.

87% relevant

DeepSeek v4 Pricing Cuts 75%: $0.43/M Tokens In

DeepSeek v4 API pricing permanently cut 75% to $0.43/M input, $0.87/M output, enabled by 27% compute and 10% cache vs v3.2.

100% relevant

Claude Code Digest — May 01–May 04

CCmeter's cache-busting insights can slash your Claude Code costs by up to 40% instantly.

95% relevant

Claude Code Digest — Apr 28–May 01

CCmeter's cache-busting insights can cut your Claude Code costs by up to 40% instantly.

100% relevant

CCmeter: The Open-Source Dashboard That Reveals Exactly Why Your Claude

CCmeter parses Claude Code's local session logs to surface cache-busting patterns, cost leaks, and model-swap simulations. Free, local-first, zero telemetry.

100% relevant

RedParrot: Semantic Caching Speeds Up NL-to-DSL for Business Analytics by

Xiaohongshu researchers propose RedParrot, a framework that caches normalized structural patterns of natural language queries to bypass expensive LLM pipelines, achieving 3.6x speedup and 8.26% accuracy improvement on enterprise datasets.

84% relevant

The 270-Second Rule: How to Cut Claude Code API Costs by 90% with Smart

Anthropic's prompt cache has a 5-minute TTL. Orchestrator loops running faster than 270 seconds pay ~10% of full input token costs.

100% relevant

Atomic Chat's TurboQuant Enables Gemma 4 Local Inference on 16GB MacBook Air

Atomic Chat's new TurboQuant algorithm aggressively compresses the KV cache, allowing models requiring 32GB+ RAM to run on 16GB MacBook Airs at 25 tokens/sec, advancing local AI deployment.

85% relevant

mlx-vlm v0.4.4 Launches with Falcon-Perception 300M, TurboQuant Metal Kernels & 1.9x Decode Speedup

The mlx-vlm library v0.4.4 adds support for TII's Falcon-Perception 300M vision model and introduces TurboQuant Metal kernels, achieving up to 1.9x faster decoding with 89% KV cache savings on Apple Silicon.

85% relevant

Helium: A New Framework for Efficient LLM Serving in Agentic Workflows

Researchers introduce Helium, a workflow-aware LLM serving framework that treats agentic workflows as query plans. It uses proactive caching and cache-aware scheduling to reduce redundancy, achieving up to 1.56x speedup over current systems.

74% relevant

NVIDIA's Memory Compression Breakthrough: How Forgetting Makes LLMs Smarter

NVIDIA researchers have developed Dynamic Memory Sparsification, a technique that compresses LLM working memory by 8× while improving reasoning capabilities. This counterintuitive approach addresses the critical KV cache bottleneck in long-context AI applications.

85% relevant

Sleep Phase Cuts Transformer Costs by Consolidating Memory

Paper proposes sleep phase to consolidate context into fixed-size memory, reducing inference cost while improving long-horizon task performance on GSM-Infinite.

84% relevant

Alibaba + Nanjing Univ Claim 9.36X Faster Million-Token Prefill vs FlashAttention-2

Alibaba + Nanjing Univ claim 9.36X faster million-token prefill vs FlashAttention-2, targeting the key bottleneck in long-context LLM inference.

85% relevant

Median Coding Agent Hits 96k Input Tokens, Rewriting Inference Economics

SemiAnalysis found median coding agent uses 96k input tokens from 432k requests, shifting inference cost focus from output to context.

95% relevant

Compute Shortage to Split AI Market: Rich Get Agents, Poor Get Chatbots

Mollick warns compute shortage makes agents expensive while chatbots cheapen, splitting AI market by company resources.

75% relevant

vLLM Optimizations Cut Voice AI Latency by 40% on 6-GPU Cluster

vLLM optimizations on a 6-GPU cluster reduced voice AI latency by 40% for a Qwen-based system, enabling 500 concurrent sessions per node without hardware upgrades.

82% relevant

Glean benchmark: Off-the-shelf MCP costs 30% more tokens than indexed context

Glean benchmark: off-the-shelf MCP in Claude Cowork loses 2.5x more tasks and uses 30% more tokens than indexed context.

88% relevant

Hermes Agent's Three-Tier Memory Cuts Context Bloat, Keeps 2,200-Char Core

Hermes agent's three-tier memory uses two tiny markdown files (2,200 chars), SQLite FTS5 search (10ms over 10K docs), and 8 pluggable providers. The composition solves the always-on vs. deep recall trade-off.

91% relevant

Shopify Drops Redis for MySQL in Inventory Reservations, Scales 10x

Shopify replaced Redis with MySQL for inventory reservations, achieving 10x scalability and handling 50,000 writes per second.

93% relevant

Perplexity Claims 3x Blackwell Inference Throughput for 70B Models

Perplexity AI claims 3x inference throughput for 70B models on Nvidia Blackwell GPUs via FP4 and custom scheduling. The gain exceeds Nvidia's own 2x marketing claim.

85% relevant

AMD ROCm Performance Jumps 75x in 14 Days Post-DeepSeek v4

AMD ROCm stack improved 75x in 14 days post-DeepSeek v4 via fused operations. Still needs 5x more to match B200 performance.

100% relevant