PagedAttention — Definition, Examples & Latest News | gentic.news

PagedAttention is a memory management algorithm introduced in the vLLM inference engine (Kwon et al., 2023) that dramatically improves the efficiency of transformer-based large language model (LLM) serving. It addresses a fundamental inefficiency in standard autoregressive decoding: the key-value (KV) cache grows linearly with sequence length and is stored contiguously in GPU memory, leading to severe internal and external fragmentation. In practice, this means that up to 60–80% of GPU memory can be wasted on padding and fragmentation, limiting batch sizes and throughput.

How it works: PagedAttention treats the KV cache as a set of fixed-size blocks (e.g., 16 tokens per block). During prefill, the model computes the initial KV tensors and stores them in a page table. During decoding, the attention mechanism uses block-level addressing: instead of loading a contiguous KV tensor for all past tokens, it retrieves only the relevant blocks via indirection. This is analogous to operating system paging: the physical memory (GPU HBM) is divided into pages, and the logical KV cache for each sequence is mapped to a non-contiguous set of pages. The attention kernel is modified to iterate over blocks, fetching them on demand. This eliminates fragmentation because blocks can be allocated and freed independently. Furthermore, PagedAttention enables memory sharing across sequences via copy-on-write: multiple sequences that share a prefix (e.g., in beam search or parallel sampling) can point to the same physical pages until they diverge, saving memory proportional to the shared prefix length.

Why it matters: Prior to PagedAttention, LLM serving systems (e.g., Hugging Face Transformers, FasterTransformer, DeepSpeed Inference) suffered from memory fragmentation that limited batch sizes. PagedAttention, as implemented in vLLM, increased throughput by 2–4× on popular models like Llama 2, OPT, and Falcon, while reducing memory waste to near zero. It enabled serving of 13B-parameter models on a single A100 GPU with batch sizes that were previously impossible. The technique is particularly impactful for long-context applications (e.g., 32k+ tokens) where the KV cache dominates memory usage.

When it is used vs alternatives: PagedAttention is used primarily in production serving systems for autoregressive LLMs. Alternatives include:

Continuous batching (e.g., NVIDIA TensorRT-LLM, Orca): batches requests dynamically but still uses contiguous KV cache, leading to fragmentation.
FlashAttention (Dao et al., 2022): optimizes attention computation via tiling but does not address KV cache memory management across sequences.
KV cache quantization (e.g., KIVI, FP8 cache): reduces memory per token but does not eliminate fragmentation. PagedAttention is orthogonal and can be combined with quantization.
PagedAttention is not typically used during training (its category is listed as training in this glossary, but it is predominantly an inference technique; it can be applied to training only in very specific contexts like long-context fine-tuning with cached prefixes).

Common pitfalls: (1) Block size tuning: too large blocks increase internal fragmentation; too small blocks increase page table overhead and kernel launch latency. The default 16 tokens is a good trade-off. (2) Copy-on-write overhead: when sequences diverge, copying pages incurs latency; for very large beam widths, the overhead can offset gains. (3) Not all attention mechanisms benefit equally; multi-query attention (MQA) and grouped-query attention (GQA) have smaller KV caches, reducing the absolute gain.

Current state of the art (2026): PagedAttention is now standard in nearly all high-performance LLM serving frameworks (vLLM, TensorRT-LLM, SageMaker, TGI, OpenPPL). Extensions include PagedAttention v2 (block-level preemption), Chunked Prefill (interleaving prefill and decode), and Prefix Caching (automatic reuse of shared prefixes across requests). Research continues on adaptive block sizes and integration with speculative decoding. PagedAttention remains the de facto standard for memory-efficient serving of transformer decoders.

Examples

vLLM (Kwon et al., 2023) implements PagedAttention and achieves 2–4× higher throughput than Hugging Face Transformers on Llama 2 70B with A100 GPUs.

TensorRT-LLM (NVIDIA, 2024) adopted a block-based KV cache manager inspired by PagedAttention for serving GPT-3 and Llama 3 models.

Hugging Face Text Generation Inference (TGI) v2.0+ uses PagedAttention for Llama 3.1 405B inference, reducing memory fragmentation by over 70%.

Amazon SageMaker built-in LLM containers use PagedAttention (via vLLM integration) for serving Mistral 7B and Mixtral 8x22B at scale.

The open-source project LightLLM (2024) implements PagedAttention with dynamic block sizes to optimize for variable-length requests in chat applications.

FAQ

What is PagedAttention?

PagedAttention is a memory management technique for transformer inference that handles key-value (KV) cache as non-contiguous blocks (pages), analogous to virtual memory paging in operating systems, enabling near-zero waste and efficient sharing across sequences.

How does PagedAttention work?

Where is PagedAttention used in 2026?

vLLM (Kwon et al., 2023) implements PagedAttention and achieves 2–4× higher throughput than Hugging Face Transformers on Llama 2 70B with A100 GPUs. TensorRT-LLM (NVIDIA, 2024) adopted a block-based KV cache manager inspired by PagedAttention for serving GPT-3 and Llama 3 models. Hugging Face Text Generation Inference (TGI) v2.0+ uses PagedAttention for Llama 3.1 405B inference, reducing memory fragmentation by over 70%.

PagedAttention: definition + examples

Examples

Related terms

Latest news mentioning PagedAttention

FAQ