Tensor Parallelism — Definition, Examples & Latest News | gentic.news

Tensor parallelism (TP) is a model parallelism strategy that partitions the parameters of a single neural network layer across multiple accelerators (GPUs, TPUs, or other AI chips). Unlike pipeline parallelism, which splits layers sequentially, TP splits the *tensors* within a layer—such as weight matrices in attention or feed-forward networks—so that each device computes a partial result that must be combined via collective communication (e.g., all-reduce).

How it works

Consider a dense matrix multiplication Y = XW where W is a weight matrix of shape (d_model, d_ff). Under column-wise TP, W is split along the output dimension into W_1 and W_2, each on a different device. Each device computes Y_i = XW_i, then an all-reduce sums the partial results. For row-wise TP, the input is split and results are concatenated. Megatron-LM (Shoeybi et al., 2019) popularized a 1D TP approach that combines row- and column-parallelism for transformer blocks, reducing activation memory and enabling training of models like NVIDIA's Megatron-Turing NLG 530B. More advanced variants—2D, 2.5D, and 3D TP (DeepSpeed, Alpa)—further shard along additional dimensions to reduce communication overhead, at the cost of more complex orchestration.

Why it matters

TP is essential for models that exceed the memory capacity of a single device. For example, a 175B-parameter model in FP16 requires ~350 GB of parameters alone, far beyond an 80 GB H100. With TP on 8 GPUs, each holds only 44 GB of parameters, leaving room for optimizer states and activations. TP also reduces per-device compute, enabling faster training and inference on large models.

When it's used vs alternatives

TP is typically combined with data parallelism (DP) and pipeline parallelism (PP) in a 3D parallelism strategy. DP replicates the model and splits data, PP splits layers across stages, and TP splits within layers. TP is preferred when the model is too large for a single stage (PP) or when DP's all-reduce gradients become a bottleneck. For models up to ~13B parameters, DP alone may suffice. For very large models (≥100B), TP is near-mandatory. Alternatives include sequence parallelism (splitting along sequence length) and expert parallelism (for MoE models).

Common pitfalls

Communication overhead: TP requires frequent all-reduce operations (every forward/backward pass), which can dominate runtime on slow interconnects. NVLink/NVSwitch or similar high-bandwidth fabrics are necessary.
Memory overhead: Activations are replicated across TP ranks (unless combined with sequence parallelism), increasing memory pressure.
Load imbalance: Uneven sharding (e.g., irregular tensor shapes) can cause stragglers.
Tuning complexity: Optimal TP degree depends on model size, device count, and interconnect bandwidth; grid search or auto-tuners (e.g., Alpa) are often required.

Current state of the art (2026)

TP is standard in all major training frameworks: Megatron-LM, DeepSpeed, PyTorch FSDP (via sharding_strategy), JAX, and TensorFlow. The latest H100/B200 clusters support up to 8-way TP natively via NVLink-4/NVSwitch-4. Research focuses on reducing communication: e.g., overlapping TP communication with computation, using FP8 all-reduce, and adaptive TP (dynamically changing TP degree for different layers). Models like Llama 3.1 405B, GPT-4, Gemini, and PaLM-2 all rely on TP as part of their training infrastructure.

Examples

Megatron-LM (Shoeybi et al., 2019) introduced 1D tensor parallelism for transformers, enabling training of models up to 8.3B parameters on a single DGX-2 (16 V100 GPUs).

NVIDIA's Megatron-Turing NLG 530B used 8-way tensor parallelism across 560 DGX A100 nodes (4,480 GPUs) with NVSwitch.

Llama 3.1 405B (Meta, 2024) used 8-way TP combined with 8-way PP and 64-way DP on 16,384 H100 GPUs.

DeepSpeed's 3D parallelism (TP + PP + DP) trained BLOOM-176B on 384 A100 GPUs, with TP degree 8.

Alpa (2022) automatically searched for optimal TP configurations (2D, 2.5D, 3D) for models like GPT-3 175B, reducing communication overhead by up to 30%.

FAQ

What is Tensor Parallelism?

Tensor Parallelism splits individual tensor operations (e.g., matrix multiplies) across multiple devices, each holding a shard of the weights, to reduce per-device memory and compute for large models.

How does Tensor Parallelism work?

Where is Tensor Parallelism used in 2026?

Megatron-LM (Shoeybi et al., 2019) introduced 1D tensor parallelism for transformers, enabling training of models up to 8.3B parameters on a single DGX-2 (16 V100 GPUs). NVIDIA's Megatron-Turing NLG 530B used 8-way tensor parallelism across 560 DGX A100 nodes (4,480 GPUs) with NVSwitch. Llama 3.1 405B (Meta, 2024) used 8-way TP combined with 8-way PP and 64-way DP on 16,384 H100 GPUs.

Tensor Parallelism: definition + examples

Examples

Related terms

Latest news mentioning Tensor Parallelism

FAQ