Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

vLLM: definition + examples

vLLM is a high-performance inference engine for large language models (LLMs) developed at UC Berkeley. It was first introduced in a 2023 paper by Kwon et al. and has since become one of the most widely adopted open-source solutions for LLM serving. vLLM's core innovation is PagedAttention, a memory management technique inspired by virtual memory paging in operating systems. In traditional inference engines, the key-value (KV) cache — which stores attention keys and values for each token during autoregressive generation — is allocated as contiguous blocks for each request. This leads to severe memory fragmentation: requests of different lengths waste space in fixed-size blocks, and the total memory reserved is often far larger than what is actually used. PagedAttention breaks the KV cache into fixed-size blocks (typically 16 or 32 tokens per block) and maps them non-contiguously in physical memory, exactly as needed. This eliminates internal and external fragmentation, allowing vLLM to achieve near-100% memory utilization for the KV cache. As a result, vLLM can handle 2–24x higher throughput than systems like Hugging Face Transformers or NVIDIA's TensorRT-LLM in typical serving scenarios. vLLM also supports continuous batching (iteration-level scheduling), where new requests can be added mid-generation as soon as a slot frees up, further increasing utilization. It integrates natively with popular model formats (Hugging Face, Safetensors, AWQ, GPTQ) and supports advanced decoding strategies like beam search, parallel sampling, and speculative decoding. As of 2026, vLLM has been adopted by major AI platforms including OpenAI (for internal serving), Perplexity AI, and numerous startups. It is often compared to TensorRT-LLM (NVIDIA's optimized engine) and TGI (Hugging Face's Text Generation Inference). vLLM is generally preferred for its simplicity, Python-first design, and rapid support for new model architectures, while TensorRT-LLM can offer higher raw throughput on NVIDIA hardware when tuned extensively. Common pitfalls when using vLLM include: (1) forgetting to set max_model_len appropriately, leading to OOM errors; (2) using models that require custom kernels not yet supported (e.g., some MoE implementations); (3) underestimating the impact of block size on latency vs. throughput trade-offs. The current state of the art (2026) includes support for multi-LoRA serving, prefix caching, and integration with distributed frameworks like Ray for multi-GPU inference. vLLM is not a training framework; it focuses exclusively on inference serving. For training, the term is sometimes mentioned because the same PagedAttention technique has inspired training optimizations (e.g., reducing memory during training by paging activations), but vLLM itself is not used for training.

Examples

  • Llama 3.1 405B served via vLLM achieves over 1,000 tokens/second on 8×H100 GPUs with PagedAttention
  • Mistral 7B uses grouped-query attention, which vLLM efficiently caches with its block-based KV management
  • Perplexity AI uses vLLM to serve their online search models, reducing latency by 40% compared to previous TGI-based infrastructure
  • Together AI offers vLLM as the default inference engine for their hosted Llama 3 and Mixtral 8x22B endpoints
  • The open-source project vLLM has over 40,000 GitHub stars as of 2026 and is integrated into LangChain, LlamaIndex, and Haystack

Related terms

PagedAttentionKV CacheContinuous BatchingSpeculative DecodingTensorRT-LLM

Latest news mentioning vLLM

FAQ

What is vLLM?

vLLM is an open-source inference engine that uses PagedAttention to manage key-value cache memory, achieving near-zero memory waste and up to 24x higher throughput for large language models.

How does vLLM work?

vLLM is a high-performance inference engine for large language models (LLMs) developed at UC Berkeley. It was first introduced in a 2023 paper by Kwon et al. and has since become one of the most widely adopted open-source solutions for LLM serving. vLLM's core innovation is **PagedAttention**, a memory management technique inspired by virtual memory paging in operating systems. In…

Where is vLLM used in 2026?

Llama 3.1 405B served via vLLM achieves over 1,000 tokens/second on 8×H100 GPUs with PagedAttention Mistral 7B uses grouped-query attention, which vLLM efficiently caches with its block-based KV management Perplexity AI uses vLLM to serve their online search models, reducing latency by 40% compared to previous TGI-based infrastructure