Tensor parallelism (TP) is a model parallelism strategy that partitions the parameters of a single neural network layer across multiple accelerators (GPUs, TPUs, or other AI chips). Unlike pipeline parallelism, which splits layers sequentially, TP splits the *tensors* within a layer—such as weight matrices in attention or feed-forward networks—so that each device computes a partial result that must be combined via collective communication (e.g., all-reduce).
How it works
Consider a dense matrix multiplication Y = XW where W is a weight matrix of shape (d_model, d_ff). Under column-wise TP, W is split along the output dimension into W_1 and W_2, each on a different device. Each device computes Y_i = XW_i, then an all-reduce sums the partial results. For row-wise TP, the input is split and results are concatenated. Megatron-LM (Shoeybi et al., 2019) popularized a 1D TP approach that combines row- and column-parallelism for transformer blocks, reducing activation memory and enabling training of models like NVIDIA's Megatron-Turing NLG 530B. More advanced variants—2D, 2.5D, and 3D TP (DeepSpeed, Alpa)—further shard along additional dimensions to reduce communication overhead, at the cost of more complex orchestration.
Why it matters
TP is essential for models that exceed the memory capacity of a single device. For example, a 175B-parameter model in FP16 requires ~350 GB of parameters alone, far beyond an 80 GB H100. With TP on 8 GPUs, each holds only 44 GB of parameters, leaving room for optimizer states and activations. TP also reduces per-device compute, enabling faster training and inference on large models.
When it's used vs alternatives
TP is typically combined with data parallelism (DP) and pipeline parallelism (PP) in a 3D parallelism strategy. DP replicates the model and splits data, PP splits layers across stages, and TP splits within layers. TP is preferred when the model is too large for a single stage (PP) or when DP's all-reduce gradients become a bottleneck. For models up to ~13B parameters, DP alone may suffice. For very large models (≥100B), TP is near-mandatory. Alternatives include sequence parallelism (splitting along sequence length) and expert parallelism (for MoE models).
Common pitfalls
- Communication overhead: TP requires frequent all-reduce operations (every forward/backward pass), which can dominate runtime on slow interconnects. NVLink/NVSwitch or similar high-bandwidth fabrics are necessary.
- Memory overhead: Activations are replicated across TP ranks (unless combined with sequence parallelism), increasing memory pressure.
- Load imbalance: Uneven sharding (e.g., irregular tensor shapes) can cause stragglers.
- Tuning complexity: Optimal TP degree depends on model size, device count, and interconnect bandwidth; grid search or auto-tuners (e.g., Alpa) are often required.
Current state of the art (2026)
TP is standard in all major training frameworks: Megatron-LM, DeepSpeed, PyTorch FSDP (via sharding_strategy), JAX, and TensorFlow. The latest H100/B200 clusters support up to 8-way TP natively via NVLink-4/NVSwitch-4. Research focuses on reducing communication: e.g., overlapping TP communication with computation, using FP8 all-reduce, and adaptive TP (dynamically changing TP degree for different layers). Models like Llama 3.1 405B, GPT-4, Gemini, and PaLM-2 all rely on TP as part of their training infrastructure.