cache
30 articles about cache in AI news
How Telemetry Settings Are Silently Costing You Cache Tiers (And How To Fix It)
A confirmed bug links telemetry settings to cache TTL; disabling telemetry defaults you to 5-minute cache, increasing costs. Use environment variables and hooks to mitigate.
Microsoft's 'Compress-Thought' Cuts KV Cache 2-3x, Boosts Throughput 2x
A new Microsoft paper shows language models can learn to compress their reasoning steps on-the-fly, slashing memory use 2-3x and doubling throughput. Crucially, 15 percentage points of accuracy come from 'leaked' information in KV cache after explicit reasoning is erased.
Google's TurboQuant Compresses LLM KV Cache 6x with Zero Accuracy Loss, Cutting GPU Memory by 80%
Google researchers introduced TurboQuant, a method that compresses LLM KV cache from 32-bit to 3-bit precision without accuracy degradation. This reduces GPU memory consumption by over 80% and speeds up inference 8x on H100 GPUs.
Google's TurboQuant Cuts LLM KV Cache Memory by 6x, Enables 3-Bit Storage Without Accuracy Loss
Google released TurboQuant, a novel two-stage quantization algorithm that compresses the KV cache in long-context LLMs. It reduces memory by 6x, achieves 3-bit storage with no accuracy drop, and speeds up attention scoring by up to 8x on H100 GPUs.
arXiv Survey Maps KV Cache Optimization Landscape: 5 Strategies for Million-Token LLM Inference
A comprehensive arXiv review categorizes five principal KV cache optimization techniques—eviction, compression, hybrid memory, novel attention, and combinations—to address the linear memory scaling bottleneck in long-context LLM inference. The analysis finds no single dominant solution, with optimal strategy depending on context length, hardware, and workload.
DualPath Architecture Shatters KV-Cache Bottleneck, Doubling LLM Throughput for AI Agents
Researchers have developed DualPath, a novel architecture that eliminates the KV-cache storage bottleneck in agentic LLM inference. By implementing dual-path loading with RDMA transfers, the system achieves nearly 2× throughput improvements for both offline and online scenarios.
Claude Code's New Read Cache Blocks 8% of Token Waste Automatically
The claude-context-optimizer plugin v3.1 now actively blocks Claude from re-reading unchanged files, saving an average 8% of tokens per session.
Atomic Chat's TurboQuant Enables Gemma 4 Local Inference on 16GB MacBook Air
Atomic Chat's new TurboQuant algorithm aggressively compresses the KV cache, allowing models requiring 32GB+ RAM to run on 16GB MacBook Airs at 25 tokens/sec, advancing local AI deployment.
mlx-vlm v0.4.4 Launches with Falcon-Perception 300M, TurboQuant Metal Kernels & 1.9x Decode Speedup
The mlx-vlm library v0.4.4 adds support for TII's Falcon-Perception 300M vision model and introduces TurboQuant Metal kernels, achieving up to 1.9x faster decoding with 89% KV cache savings on Apple Silicon.
Helium: A New Framework for Efficient LLM Serving in Agentic Workflows
Researchers introduce Helium, a workflow-aware LLM serving framework that treats agentic workflows as query plans. It uses proactive caching and cache-aware scheduling to reduce redundancy, achieving up to 1.56x speedup over current systems.
NVIDIA's Memory Compression Breakthrough: How Forgetting Makes LLMs Smarter
NVIDIA researchers have developed Dynamic Memory Sparsification, a technique that compresses LLM working memory by 8× while improving reasoning capabilities. This counterintuitive approach addresses the critical KV cache bottleneck in long-context AI applications.
Claude Managed Agents: How to Build on the Platform Instead of in Its Gaps
Claude Managed Agents turns long-running, stateful agents into an API call. For developers, this means building durable applications on a stable platform, not temporary solutions in its gaps.
The 100th Tool Call Problem: Why Most CI Agents Fail in Production
The article identifies a common failure mode for CI agents in production: they can get stuck in infinite loops or make excessive tool calls. It proposes implementing stop conditions—step/time/tool budgets and no-progress termination—as a solution. This is a critical engineering insight for deploying reliable AI agents.
How Claude Code's Upstream Proxy Solves Corporate Network Headaches
Claude Code's CCR feature transparently routes subprocess HTTP traffic through a secure WebSocket tunnel, handling corporate MITM certificates and complex network routing automatically.
Graphify: Open-Source Tool Builds Knowledge Graphs from Code & Docs in One Command
Developer shipped Graphify, an open-source tool that builds queryable knowledge graphs from code, docs, and images in one command. It uses a two-pass pipeline with tree-sitter and Claude subagents, achieving 71.5x fewer tokens per query versus reading raw files.
Kering's 80% Opportunity: A Strategic Pivot from Operational AI to Brand Meaning
Kering CEO Luca de Meo frames luxury as a €350B market where Kering only plays in 20%. The article argues that Gucci's decade-long growth has been erased and Balenciaga hasn't recovered from its 2022 scandal because both lost their core brand meaning. De Meo's strategy—proven at Renault—is to define meaning first, then execute operationally.
Kerf-CLI: The SQLite-Powered Cost Dashboard Every Claude Code User Needs
Install Kerf-CLI to track Claude Code spending, enforce budgets, and identify wasted Opus spend with a local SQLite database and polished dashboard.
How Claude Code's 3-Tier Compaction System Saves You Money and Keeps Context
Learn how Claude Code's intelligent, three-tiered compaction system works to manage long conversations efficiently, preserving key context while optimizing for token usage and cost.
Microsoft's BitNet Enables 100B-Parameter LLMs on CPU, Cuts Energy 82%
Microsoft Research's BitNet project demonstrates 1-bit LLMs with 100B parameters that run efficiently on CPUs, using 82% less energy while maintaining performance, challenging the need for GPUs in local deployment.
Nous Research's Hermes Agent Features Self-Improving Skills, Persistent Memory
A new evaluation of Nous Research's Hermes Agent highlights its self-improving ability to build reusable tools from experience and a smarter persistent memory system that conserves token usage. The agent reportedly improves with continued use, representing a shift towards more adaptive AI systems.
Forge: The Open-Source TUI That Turns Claude Code into a Multi-Model Swarm
Forge is a new open-source tool that orchestrates multiple AI coding agents (including Claude Code) using git-native isolation and semantic context management to overcome token limits.
Anthropic's Claude Sonnet 4.8, Opus 4.7 Internally Tested, Leak Suggests
A leak reveals Anthropic has internally tested Claude Sonnet 4.8 and Opus 4.7. This suggests a public release of these model upgrades is likely imminent.
Opus+Codex Crossover Point: Use Pure Opus Below 500 Lines, Switch Above 800
The 'plan with Opus, execute with Codex' workflow has a clear cost crossover at ~600 lines of code. For smaller tasks (<500 LOC), stick with pure Claude Code.
Memory Systems for AI Agents: Architectures, Frameworks, and Challenges
A technical analysis details the multi-layered memory architectures—short-term, episodic, semantic, procedural—required to transform stateless LLMs into persistent, reliable AI agents. It compares frameworks like MemGPT and LangMem that manage context limits and prevent memory drift.
Building a Multimodal Product Similarity Engine for Fashion Retail
The source presents a practical guide to constructing a product similarity engine for fashion retail. It focuses on using multimodal embeddings from text and images to find similar items, a core capability for recommendations and search.
How Anthropic's Team Uses Skills as Knowledge Containers (And What It Means For Your CLAUDE.md)
Learn how to use Claude Code skills not just for automation but as living knowledge bases, following patterns from Anthropic's own engineering team.
PicoClaw: $10 RISC-V AI Agent Challenges OpenClaw's $599 Mac Mini Requirement
Developers have launched PicoClaw, a $10 RISC-V alternative to OpenClaw that runs on 10MB RAM versus OpenClaw's $599 Mac Mini requirement. The Go-based binary offers the same AI agent capabilities at 1/60th the hardware cost.
Claude Code v2.1.90: /powerup Tutorials, Performance Gains, and Critical Auto Mode Fix
Claude Code v2.1.90 adds interactive tutorials, improves performance for MCP and long sessions, and fixes a critical Auto Mode bug that ignored user boundaries.
Google's AI Infrastructure Strategy: What Retail Leaders Should Watch in 2026
Google's evolving AI infrastructure and compute strategy, including data center investments and model compression techniques, will directly impact how retail brands deploy and scale AI applications by 2026. The company's focus on efficiency and real-time capabilities signals a shift toward more accessible, powerful retail AI tools.
The Axios 1.14.1 Attack: Why Claude Code Users Must Audit Their Lockfiles Now
A compromised version of axios (1.14.1) is a supply chain attack targeting AI-assisted workflows. Check your lockfiles immediately.