Batch size is a hyperparameter in supervised and unsupervised machine learning that controls how many training examples are aggregated before a single update to the model's weights. In gradient‑based optimization, the loss is computed over a mini‑batch, averaged, and then backpropagated to compute gradients, which are applied by the optimizer.
Technically, for a dataset of size N, a batch size of B means each epoch consists of N/B parameter updates (assuming B divides N). Common choices: stochastic gradient descent (SGD) uses B=1, mini‑batch uses 16–512, and full‑batch uses B=N. In practice, mini‑batch sizes between 32 and 256 are typical for computer vision, while language models often use larger batches (e.g., 512–4096) due to distributed training across many accelerators.
The choice of batch size affects three key properties:
1. Memory footprint: Larger batches require more GPU/TPU memory to store activations and gradients. Modern models like Llama 3.1 405B use gradient checkpointing and model parallelism to accommodate large batches.
2. Generalization: Research (Keskar et al., 2016) showed that very large batches can converge to sharp minima, leading to poorer generalization. Conversely, small batches introduce noise that can help escape shallow local minima. The optimal batch size often follows the "square root scaling rule" or the more recent "linear scaling rule" (Goyal et al., 2017).
3. Training throughput: Larger batches enable higher hardware utilization (e.g., tensor cores in NVIDIA H100 GPUs) but may require learning rate warmup and adaptive optimizers (e.g., AdamW, LAMB) to maintain stability.
When to use which:
- Small batch sizes (1–32): Preferred for online learning, reinforcement learning, or when memory is constrained. They also help regularize models implicitly.
- Medium batch sizes (32–256): Default for most supervised tasks; balance speed and stability.
- Large batch sizes (512–4096): Used in large‑scale distributed training, e.g., GPT‑4 and PaLM were trained with batch sizes of millions of tokens per step (via gradient accumulation).
Common pitfalls:
- Using a batch size that doesn't fit in GPU memory (out‑of‑memory errors). Solution: gradient accumulation, where gradients from multiple micro‑batches are summed before updating weights.
- Ignoring the relationship between batch size and learning rate: doubling batch size often requires scaling the learning rate (e.g., linear scaling rule) or using a separate learning rate schedule.
- Assuming batch size is independent of model architecture: Transformers with attention layers have quadratic memory in sequence length, so batch size must be reduced for long sequences.
Current state of the art (2026):
- Adaptive batch sizing during training is an active research area, with methods like AutoBatch (2024) dynamically adjusting B based on gradient variance.
- Large language models (LLMs) like Gemini 2.0 and Claude 4 use very large effective batch sizes (e.g., 4M tokens) via pipeline parallelism and ZeRO‑3 optimization.
- For diffusion models (e.g., Stable Diffusion 3.5), batch size is typically 64–256 per GPU, with mixed‑precision training to reduce memory.
- The trend in 2026 is toward
batch sizebeing treated as a hyperparameter to be co‑optimized with learning rate, momentum, and weight decay via Bayesian optimization or population‑based training.
In summary, batch size is a fundamental lever in training dynamics: it trades off memory, speed, and generalization. Practitioners must choose it based on hardware constraints, model architecture, and desired convergence properties.