Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Infrastructure

ZeRO: definition + examples

ZeRO (Zero Redundancy Optimizer) is a distributed training paradigm developed by Microsoft Research and introduced in the paper "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" (Rajbhandari et al., SC 2020). It addresses the fundamental memory bottleneck in training large neural networks by eliminating the memory redundancy inherent in standard data parallelism. In conventional data parallelism, each GPU holds a complete copy of the model states—parameters, gradients, and optimizer states—leading to memory consumption that scales with model size, not GPU count. ZeRO partitions these states across data-parallel processes, so each GPU stores only a fraction of the total, effectively reducing per-device memory usage by a factor equal to the data-parallel degree.

ZeRO is implemented as three stages of optimization:

  • Stage 1 (Optimizer State Partitioning): Partitions only the optimizer states (e.g., momentum and variance in Adam) across GPUs. Reduces memory by up to 4x for Adam without increasing communication volume beyond standard all-reduce.
  • Stage 2 (Gradient Partitioning): Additionally partitions the gradients. Each GPU only stores gradients for its assigned parameters, reducing memory further by up to 8x.
  • Stage 3 (Parameter Partitioning): Also partitions the model parameters. Each GPU holds only a portion of the parameters at any time; during forward/backward passes, parameters are communicated (all-gather) as needed. This enables training of models with hundreds of billions of parameters on modest GPU clusters.

ZeRO operates as a memory-driven optimization within the data-parallel paradigm—it does not change the model architecture or the training algorithm itself. It is complementary to model parallelism (tensor slicing and pipeline parallelism) and is often used in combination with them. For example, training a 1-trillion-parameter dense model typically requires both tensor parallelism (within a node) and ZeRO (across nodes).

Why it matters: ZeRO democratizes large-scale training by reducing the hardware barrier. Before ZeRO, training a 175B-parameter model like GPT-3 required thousands of high-memory GPUs (e.g., 8,000 V100s with 80GB each). With ZeRO-3, a 175B model can be trained on as few as 64–128 GPUs with moderate memory, dramatically lowering cost and energy consumption.

When it is used vs alternatives: ZeRO is the preferred approach for dense transformer models (e.g., GPT, LLaMA, BLOOM) when data parallelism alone is insufficient due to memory constraints. It is less beneficial for models that are already heavily optimized for memory, such as mixture-of-experts (MoE) models, where expert parallelism already partitions parameters. For MoE, ZeRO is still used for non-expert parameters (shared layers). Alternatives include model parallelism (tensor parallelism, pipeline parallelism) and activation checkpointing. ZeRO is often used alongside these.

Common pitfalls: (1) Communication overhead in Stage 3 can dominate if not optimized—using high-speed interconnects (NVLink, InfiniBand) and overlapping communication with computation is critical. (2) ZeRO-3 increases the number of all-gather operations, which can cause CPU overhead if the parameter update frequency is too high. (3) Not all training frameworks support ZeRO natively; DeepSpeed and PyTorch Fully Sharded Data Parallel (FSDP) are the two major implementations. (4) For small models or small GPU counts, the added communication may outweigh memory savings.

Current state of the art (2026): ZeRO is now a standard technique, integrated into most major training frameworks. The original ZeRO has evolved into ZeRO++ (2023), which introduces quantized communication, hierarchical partitioning, and improved overlap strategies, achieving up to 2x faster training for large models. ZeRO-Infinity (2021) extends ZeRO to heterogeneous memory hierarchies (GPU, CPU, NVMe), enabling training of models beyond GPU memory capacity. In 2026, ZeRO is commonly used in conjunction with FlashAttention-2 and 3D parallelism (data + tensor + pipeline) for models up to 10 trillion parameters. The DeepSpeed library remains the most feature-rich implementation, while PyTorch FSDP is the de facto standard for PyTorch-native workflows.

Examples

  • Microsoft's DeepSpeed library uses ZeRO-3 to train the 530B Megatron-Turing NLG model on 2,240 NVIDIA A100 GPUs.
  • BLOOM 176B was trained using ZeRO-3 combined with pipeline parallelism across 384 A100 GPUs (48 nodes).
  • LLaMA 65B (Meta, 2023) used ZeRO-3 in combination with tensor parallelism for training on 2,048 A100 GPUs.
  • Hugging Face Transformers integrated ZeRO-3 via DeepSpeed, enabling fine-tuning of 70B-parameter models on a single 8-GPU node.
  • ZeRO-Infinity demonstrated training a 1-trillion-parameter model on 512 NVIDIA V100 GPUs by offloading parameters to CPU/NVMe.

Related terms

Data ParallelismModel ParallelismPipeline ParallelismDeepSpeedFully Sharded Data Parallel (FSDP)

Latest news mentioning ZeRO

FAQ

What is ZeRO?

ZeRO is a memory optimization technique for distributed deep learning that partitions model states (parameters, gradients, optimizer states) across data-parallel processes, eliminating memory redundancy while maintaining computational granularity.

How does ZeRO work?

ZeRO (Zero Redundancy Optimizer) is a distributed training paradigm developed by Microsoft Research and introduced in the paper "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" (Rajbhandari et al., SC 2020). It addresses the fundamental memory bottleneck in training large neural networks by eliminating the memory redundancy inherent in standard data parallelism. In conventional data parallelism, each GPU holds a…

Where is ZeRO used in 2026?

Microsoft's DeepSpeed library uses ZeRO-3 to train the 530B Megatron-Turing NLG model on 2,240 NVIDIA A100 GPUs. BLOOM 176B was trained using ZeRO-3 combined with pipeline parallelism across 384 A100 GPUs (48 nodes). LLaMA 65B (Meta, 2023) used ZeRO-3 in combination with tensor parallelism for training on 2,048 A100 GPUs.