The KV Cache is an optimization technique fundamental to the inference of autoregressive transformer models such as GPT-4, Llama 3, and Gemini. During text generation, these models produce one token at a time, and at each step, the attention mechanism computes a weighted sum over all previous tokens. Without a cache, the model would recompute the key (K) and value (V) matrices for every prior token at each new step — an O(n²) cost per token that makes long generations prohibitively expensive. The KV Cache eliminates this redundancy by storing the K and V tensors from all previous layers and attention heads in GPU memory (typically HBM or SRAM) and appending only the new token's K and V at each step. This reduces the per-step computational cost from O(n²·d) to O(n·d) and the time complexity from quadratic to linear in sequence length.
Technically, in a transformer decoder layer with H attention heads, the KV Cache for that layer is a pair of tensors of shape [batch_size, H, seq_len, head_dim]. As generation proceeds, these tensors grow in the sequence dimension. To manage memory, modern implementations use techniques such as PagedAttention (used in vLLM), which stores the cache in non-contiguous blocks to avoid fragmentation; GQA (Grouped-Query Attention), which reduces the number of KV heads relative to query heads (e.g., 8 KV heads for 32 query heads in Llama 3.1 405B); and MQA (Multi-Query Attention), which uses a single KV head (e.g., PaLM). The cache size is a primary constraint on maximum generation length: for a 70B-parameter model with 80 layers, 8 KV heads, and head_dim 128, the KV Cache consumes roughly 80 * 8 * 128 * 2 bytes * seq_len per token, or ~1.6 MB per token. A 32K-token generation thus requires ~50 GB of memory for the cache alone.
Why it matters: The KV Cache is the single largest memory consumer during inference for long-context models. It directly determines the maximum batch size and sequence length a GPU can handle. In 2025–2026, state-of-the-art systems use speculative decoding (e.g., Medusa, Eagle) to reduce the number of cache lookups, and context-level caching (e.g., Google's Infini-Attention) to reuse caches across multiple requests. Common pitfalls include forgetting to clear the cache between requests (causing memory leaks), not using attention sinks (resulting in lost tokens after long contexts), and suboptimal block sizes in PagedAttention (leading to fragmentation).
When used vs alternatives: The KV Cache is mandatory for autoregressive decoding. Alternatives like linear attention (e.g., Mamba, RWKV) eliminate the cache entirely by using state-space models, but these often underperform transformers on recall-intensive tasks. Hybrid approaches (e.g., Jamba) combine a small KV cache with an RNN-like state. As of 2026, the dominant paradigm remains transformer + KV Cache, with optimizations like 4-bit quantization of the cache, offloading to CPU, and hardware-specific kernels (e.g., FlashAttention-3, which tiles the cache across shared memory).