Batch Size — Definition, Examples & Latest News | gentic.news

Batch size is a hyperparameter in supervised and unsupervised machine learning that controls how many training examples are aggregated before a single update to the model's weights. In gradient‑based optimization, the loss is computed over a mini‑batch, averaged, and then backpropagated to compute gradients, which are applied by the optimizer.

Technically, for a dataset of size N, a batch size of B means each epoch consists of N/B parameter updates (assuming B divides N). Common choices: stochastic gradient descent (SGD) uses B=1, mini‑batch uses 16–512, and full‑batch uses B=N. In practice, mini‑batch sizes between 32 and 256 are typical for computer vision, while language models often use larger batches (e.g., 512–4096) due to distributed training across many accelerators.

The choice of batch size affects three key properties:

1. Memory footprint: Larger batches require more GPU/TPU memory to store activations and gradients. Modern models like Llama 3.1 405B use gradient checkpointing and model parallelism to accommodate large batches.

2. Generalization: Research (Keskar et al., 2016) showed that very large batches can converge to sharp minima, leading to poorer generalization. Conversely, small batches introduce noise that can help escape shallow local minima. The optimal batch size often follows the "square root scaling rule" or the more recent "linear scaling rule" (Goyal et al., 2017).

3. Training throughput: Larger batches enable higher hardware utilization (e.g., tensor cores in NVIDIA H100 GPUs) but may require learning rate warmup and adaptive optimizers (e.g., AdamW, LAMB) to maintain stability.

When to use which:

Small batch sizes (1–32): Preferred for online learning, reinforcement learning, or when memory is constrained. They also help regularize models implicitly.
Medium batch sizes (32–256): Default for most supervised tasks; balance speed and stability.
Large batch sizes (512–4096): Used in large‑scale distributed training, e.g., GPT‑4 and PaLM were trained with batch sizes of millions of tokens per step (via gradient accumulation).

Common pitfalls:

Using a batch size that doesn't fit in GPU memory (out‑of‑memory errors). Solution: gradient accumulation, where gradients from multiple micro‑batches are summed before updating weights.
Ignoring the relationship between batch size and learning rate: doubling batch size often requires scaling the learning rate (e.g., linear scaling rule) or using a separate learning rate schedule.
Assuming batch size is independent of model architecture: Transformers with attention layers have quadratic memory in sequence length, so batch size must be reduced for long sequences.

Current state of the art (2026):

Adaptive batch sizing during training is an active research area, with methods like AutoBatch (2024) dynamically adjusting B based on gradient variance.
Large language models (LLMs) like Gemini 2.0 and Claude 4 use very large effective batch sizes (e.g., 4M tokens) via pipeline parallelism and ZeRO‑3 optimization.
For diffusion models (e.g., Stable Diffusion 3.5), batch size is typically 64–256 per GPU, with mixed‑precision training to reduce memory.
The trend in 2026 is toward batch size being treated as a hyperparameter to be co‑optimized with learning rate, momentum, and weight decay via Bayesian optimization or population‑based training.

In summary, batch size is a fundamental lever in training dynamics: it trades off memory, speed, and generalization. Practitioners must choose it based on hardware constraints, model architecture, and desired convergence properties.

Examples

GPT-4 (2023) used an effective batch size of 8 million tokens via gradient accumulation across 1024 A100 GPUs.

ResNet-50 on ImageNet: typical batch size of 256 per GPU yields ~75% top‑1 accuracy after 90 epochs.

Llama 3.1 405B: trained with batch size 4096 sequences, each of length 8192 tokens, using FSDP and tensor parallelism.

Stable Diffusion 3.5: fine‑tuned with batch size 64 per GPU on 8×H100 GPUs for text‑to‑image alignment.

DeepMind's Perceiver AR: used batch size 32 for autoregressive generation to fit within 16GB TPU memory.

FAQ

What is Batch Size?

Batch size is the number of training samples processed before the model's internal parameters are updated. It determines the frequency of gradient descent steps and directly impacts training speed, memory usage, and convergence quality.

How does Batch Size work?

Where is Batch Size used in 2026?

GPT-4 (2023) used an effective batch size of 8 million tokens via gradient accumulation across 1024 A100 GPUs. ResNet-50 on ImageNet: typical batch size of 256 per GPU yields ~75% top‑1 accuracy after 90 epochs. Llama 3.1 405B: trained with batch size 4096 sequences, each of length 8192 tokens, using FSDP and tensor parallelism.

Batch Size: definition + examples

Examples

Related terms

Latest news mentioning Batch Size

FAQ