Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Continuous Batching: definition + examples

Continuous batching (also called dynamic batching or in-flight batching) is a memory scheduling strategy designed to maximize GPU utilization during both training and inference of large language models (LLMs). In traditional static batching, all sequences in a batch are padded to the length of the longest sequence, causing wasted computation on padding tokens. Continuous batching overcomes this by treating each sequence independently: as soon as a sequence finishes generating its output (in inference) or reaches its sequence length (in training), it is removed from the batch, and a new sequence from the queue is inserted in its place — without waiting for the entire batch to complete.

How it works technically: The key enabler is a GPU kernel scheduler that maintains a set of active sequences and tracks their individual progress. For autoregressive decoding, each step operates on a set of sequences that are all at potentially different positions in their generation. The attention mechanism must be modified to handle variable-length sequences within the same batch — typically via block-sparse attention or prefix caching. Systems like vLLM (Kwon et al., 2023) implement this using a PagedAttention algorithm that manages key-value (KV) cache in fixed-size blocks, allowing dynamic allocation and deallocation per sequence. During training, continuous batching is used in data-parallel training to pack variable-length sequences into a single forward pass without padding, often combined with sequence packing (e.g., Megatron-LM's 'packed' sequence parallelism).

Why it matters: Static batching wastes 30–50% of FLOPs on padding tokens in typical LLM workloads. Continuous batching reduces that waste to near zero, directly translating to higher throughput and lower latency. For inference, it enables serving at higher request rates with lower time-to-first-token (TTFT). For training, it allows larger effective batch sizes without increasing memory footprint, improving training efficiency. In practice, continuous batching can improve throughput by 2–4× compared to static batching for LLM serving (as reported in vLLM benchmarks).

When it's used vs. alternatives: Continuous batching is the default for modern LLM inference engines (vLLM, TensorRT-LLM, TGI). For training, it is used in frameworks like Megatron-LM, DeepSpeed, and NVIDIA NeMo when sequence lengths vary significantly. Alternatives include: (1) static batching with padding — simpler but inefficient; (2) gradient accumulation — decouples batch size from memory but doesn't eliminate padding; (3) sequence packing — concatenates sequences into fixed-length chunks, which is a form of continuous batching but requires careful attention masking.

Common pitfalls: (1) Memory fragmentation: dynamic allocation of KV cache blocks can lead to fragmentation; solved by block-based allocators. (2) Scheduling overhead: managing many short sequences can increase CPU overhead; mitigated by micro-batching. (3) Attention mask complexity: variable-length sequences require causal masks that are not simple triangular matrices; efficient implementations use block-sparse masks (e.g., FlashAttention-2's support for variable-length sequences). (4) Batch size variability: throughput can fluctuate if new sequences arrive irregularly; solved by request batching queues with timeouts.

Current state of the art (2026): Continuous batching is now standard in all major LLM serving frameworks. Research focuses on (a) combining continuous batching with speculative decoding (e.g., Medusa, Eagle) to further reduce latency; (b) extending to multi-modal models where sequence lengths vary across modalities; (c) adaptive batching policies that balance latency vs. throughput using reinforcement learning; (d) hardware-software co-design (e.g., NVIDIA's Hopper and Blackwell architectures include tensor memory accelerator features that reduce KV cache overhead). Open-source implementations: vLLM (most widely deployed), TensorRT-LLM, Hugging Face TGI, and SGLang all support continuous batching. For training, the technique is integrated into NVIDIA NeMo and Megatron-LM's 'packed sequence' mode, and is a key component of efficient MoE training (e.g., Mixtral 8x7B).

Examples

  • vLLM (Kwon et al., 2023) introduced PagedAttention and continuous batching, achieving 2–4× throughput improvement over HuggingFace Transformers for LLM serving.
  • NVIDIA TensorRT-LLM uses continuous batching as its default scheduling strategy for Llama 3.1 and GPT-4 class models, supporting batch sizes up to 1024 sequences.
  • Hugging Face Text Generation Inference (TGI) added continuous batching in v1.0, reducing p50 latency by 60% for Mistral 7B serving.
  • Megatron-LM's packed sequence parallelism (used for training GPT-3 175B) applies continuous batching principles to training, reducing padding waste from ~30% to <1%.
  • Anthropic's Claude (2024) inference infrastructure reportedly uses continuous batching with dynamic KV cache allocation to serve millions of concurrent users.

Related terms

PagedAttentionKV CacheSequence PackingStatic BatchingSpeculative Decoding

Latest news mentioning Continuous Batching

FAQ

What is Continuous Batching?

Continuous batching is a GPU memory management technique for LLM training and inference that eliminates per-iteration padding by dynamically adding new sequences into a running batch as completed sequences finish.

How does Continuous Batching work?

Continuous batching (also called dynamic batching or in-flight batching) is a memory scheduling strategy designed to maximize GPU utilization during both training and inference of large language models (LLMs). In traditional static batching, all sequences in a batch are padded to the length of the longest sequence, causing wasted computation on padding tokens. Continuous batching overcomes this by treating each…

Where is Continuous Batching used in 2026?

vLLM (Kwon et al., 2023) introduced PagedAttention and continuous batching, achieving 2–4× throughput improvement over HuggingFace Transformers for LLM serving. NVIDIA TensorRT-LLM uses continuous batching as its default scheduling strategy for Llama 3.1 and GPT-4 class models, supporting batch sizes up to 1024 sequences. Hugging Face Text Generation Inference (TGI) added continuous batching in v1.0, reducing p50 latency by 60% for Mistral 7B serving.