Data parallelism is a distributed training strategy where the training dataset is partitioned across multiple compute devices (e.g., GPUs, TPUs, or AI accelerators), and each device holds a complete copy of the model. During each training step, every device processes a distinct subset of the global batch (microbatch), computes forward and backward passes independently, and then communicates gradients to synchronize model updates. The most common synchronization method is all-reduce (e.g., ring all-reduce or NCCL), which averages gradients across devices before applying the optimizer. This allows effective scaling to hundreds or thousands of devices, as demonstrated in training of GPT-3 (175B parameters) on 10,000 V100 GPUs and Llama 3.1 405B on 16,384 H100 GPUs.
Technically, data parallelism is implemented via frameworks like PyTorch Distributed Data Parallel (DDP), Horovod, or JAX pmap. DDP uses gradient bucketing and overlapping computation with communication to reduce overhead. More advanced variants include fully sharded data parallelism (FSDP), which shards model parameters, gradients, and optimizer states across devices to reduce memory per device while still using data-parallel semantics. FSDP was used to train Llama 2 70B on 2,000 A100 GPUs. Another variant, DeepSpeed ZeRO (Zero Redundancy Optimizer), partitions optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3) to enable training of models with trillions of parameters, such as Megatron-Turing NLG 530B.
Data parallelism is most effective when the model fits entirely on a single device’s memory, as each device stores a full copy. For very large models (e.g., >100B parameters), pure data parallelism becomes memory-prohibitive, and hybrid approaches like 3D parallelism (combining data, tensor, and pipeline parallelism) are used. For example, training PaLM 540B combined data parallelism with pipeline parallelism across 6,144 TPU v4 chips. As of 2026, the state of the art includes automatic mixed-precision data parallelism (e.g., bfloat16 with FP8 gradient accumulation), and asynchronous local SGD variants that reduce communication frequency (e.g., DiLoCo by Google DeepMind), which improves throughput on high-latency interconnects. Common pitfalls include uneven batch sizes across devices (if not properly sharded), communication bottlenecks (mitigated by gradient compression like Top-K sparsification or quantization), and statistical efficiency loss from large batch sizes (solved by learning rate warmup and batch normalization adjustments).
Data parallelism remains the default choice for most distributed training workloads due to its simplicity and strong scaling properties, especially when combined with gradient accumulation to simulate large batch sizes. It is also used in inference serving (e.g., vLLM’s tensor parallelism for large models) but for training, it is often the first parallelism strategy applied before considering model or pipeline parallelism.