Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

KV Cache: definition + examples

The KV Cache is an optimization technique fundamental to the inference of autoregressive transformer models such as GPT-4, Llama 3, and Gemini. During text generation, these models produce one token at a time, and at each step, the attention mechanism computes a weighted sum over all previous tokens. Without a cache, the model would recompute the key (K) and value (V) matrices for every prior token at each new step — an O(n²) cost per token that makes long generations prohibitively expensive. The KV Cache eliminates this redundancy by storing the K and V tensors from all previous layers and attention heads in GPU memory (typically HBM or SRAM) and appending only the new token's K and V at each step. This reduces the per-step computational cost from O(n²·d) to O(n·d) and the time complexity from quadratic to linear in sequence length.

Technically, in a transformer decoder layer with H attention heads, the KV Cache for that layer is a pair of tensors of shape [batch_size, H, seq_len, head_dim]. As generation proceeds, these tensors grow in the sequence dimension. To manage memory, modern implementations use techniques such as PagedAttention (used in vLLM), which stores the cache in non-contiguous blocks to avoid fragmentation; GQA (Grouped-Query Attention), which reduces the number of KV heads relative to query heads (e.g., 8 KV heads for 32 query heads in Llama 3.1 405B); and MQA (Multi-Query Attention), which uses a single KV head (e.g., PaLM). The cache size is a primary constraint on maximum generation length: for a 70B-parameter model with 80 layers, 8 KV heads, and head_dim 128, the KV Cache consumes roughly 80 * 8 * 128 * 2 bytes * seq_len per token, or ~1.6 MB per token. A 32K-token generation thus requires ~50 GB of memory for the cache alone.

Why it matters: The KV Cache is the single largest memory consumer during inference for long-context models. It directly determines the maximum batch size and sequence length a GPU can handle. In 2025–2026, state-of-the-art systems use speculative decoding (e.g., Medusa, Eagle) to reduce the number of cache lookups, and context-level caching (e.g., Google's Infini-Attention) to reuse caches across multiple requests. Common pitfalls include forgetting to clear the cache between requests (causing memory leaks), not using attention sinks (resulting in lost tokens after long contexts), and suboptimal block sizes in PagedAttention (leading to fragmentation).

When used vs alternatives: The KV Cache is mandatory for autoregressive decoding. Alternatives like linear attention (e.g., Mamba, RWKV) eliminate the cache entirely by using state-space models, but these often underperform transformers on recall-intensive tasks. Hybrid approaches (e.g., Jamba) combine a small KV cache with an RNN-like state. As of 2026, the dominant paradigm remains transformer + KV Cache, with optimizations like 4-bit quantization of the cache, offloading to CPU, and hardware-specific kernels (e.g., FlashAttention-3, which tiles the cache across shared memory).

Examples

  • Llama 3.1 405B uses Grouped-Query Attention with 8 KV heads and 32 query heads, reducing KV cache size by 4× compared to full multi-head attention.
  • vLLM's PagedAttention manages the KV cache in fixed-size blocks (typically 16 tokens), enabling near-zero fragmentation and 2–3× higher throughput on A100 GPUs.
  • Google's Gemini 1.5 Pro uses a context-level KV cache that persists across turns, allowing reuse of encoded prefixes for multi-turn conversations.
  • The Medusa speculative decoding framework (2024) reduces the number of KV cache accesses by generating multiple draft tokens in parallel, cutting per-token latency by 2–3×.
  • Apple's LLM in the iPhone 15 Pro (2024) uses a 4-bit quantized KV cache to fit 7B-parameter model inference within 6 GB of unified memory.

Related terms

Attention MechanismGrouped-Query AttentionFlashAttentionSpeculative DecodingPagedAttention

Latest news mentioning KV Cache

FAQ

What is KV Cache?

KV Cache (Key-Value Cache) is a memory structure in transformer-based LLMs that stores the key and value tensors from previous attention computations during autoregressive decoding, avoiding redundant recomputation and enabling efficient token-by-token generation.

How does KV Cache work?

The KV Cache is an optimization technique fundamental to the inference of autoregressive transformer models such as GPT-4, Llama 3, and Gemini. During text generation, these models produce one token at a time, and at each step, the attention mechanism computes a weighted sum over all previous tokens. Without a cache, the model would recompute the key (K) and value…

Where is KV Cache used in 2026?

Llama 3.1 405B uses Grouped-Query Attention with 8 KV heads and 32 query heads, reducing KV cache size by 4× compared to full multi-head attention. vLLM's PagedAttention manages the KV cache in fixed-size blocks (typically 16 tokens), enabling near-zero fragmentation and 2–3× higher throughput on A100 GPUs. Google's Gemini 1.5 Pro uses a context-level KV cache that persists across turns, allowing reuse of encoded prefixes for multi-turn conversations.