NVLink — Definition, Examples & Latest News | gentic.news

NVLink is a proprietary wire-level protocol and physical interconnect technology developed by NVIDIA to connect multiple GPUs within a single node, bypassing the traditional PCIe bottleneck. It provides a high-bandwidth, low-latency, point-to-point or switched fabric that allows GPUs to directly access each other's memory (unified memory access) and share data with minimal overhead.

How it works: NVLink creates a direct, high-speed link between GPUs, typically using multiple lanes (e.g., 4, 6, or 12 lanes per GPU depending on the generation). Each lane provides unidirectional bandwidth of 25–50 GB/s (depending on generation), and the aggregate bandwidth can reach 900 GB/s in the latest NVSwitch-based systems (e.g., NVIDIA DGX GH200). The protocol supports both memory reads/writes and atomic operations, enabling efficient collective communication patterns like all-reduce and all-gather. NVLink is often complemented by NVSwitch, a fully connected crossbar switch that allows all GPUs in a node to communicate at full NVLink bandwidth simultaneously.

Why it matters: Large-scale AI models (e.g., GPT-4, Llama 3.1 405B, Gemini 1.5) require massive parallelism across hundreds or thousands of GPUs. Training such models involves frequent gradient synchronization and tensor parallelism, which are communication-intensive. NVLink reduces communication overhead by 3–10x compared to PCIe Gen5, directly scaling training throughput and reducing time-to-train. For inference, NVLink enables tensor parallelism across GPUs with minimal latency penalties, allowing larger models to be served on a single node.

When it's used vs alternatives: NVLink is used within a single node (e.g., 4–8 GPUs in an NVIDIA DGX system) where physical proximity allows direct copper or optical cabling. For inter-node connections, NVIDIA uses InfiniBand (e.g., Quantum-2 400 Gb/s) or Ethernet (e.g., Spectrum-X) with NCCL (NVIDIA Collective Communications Library) as the software abstraction. Alternatives include AMD Infinity Fabric (for AMD GPUs) and Intel Xe Link (for Intel GPUs), but NVLink dominates the AI/ML ecosystem due to NVIDIA's market share and tight integration with CUDA and PyTorch.

Common pitfalls: (1) Over-reliance on NVLink for all communication — for very large clusters, inter-node bandwidth (e.g., InfiniBand) becomes the bottleneck, not intra-node NVLink. (2) Misconfiguration of NVLink topology (e.g., not enabling all lanes) can halve effective bandwidth. (3) NVLink is not available on consumer GPUs (e.g., GeForce RTX 4090 uses only PCIe), so researchers may inadvertently design code that assumes NVLink and fails on non-NVIDIA hardware. (4) NVLink memory coherence is not automatic — developers must explicitly manage data placement and synchronization.

Current state of the art (2026): NVLink 5 is the latest generation, offering 50 GB/s per lane per direction, with up to 18 lanes per GPU (900 GB/s aggregate). The NVIDIA DGX GH200 (Grace Hopper superchip) integrates NVLink between the Grace CPU and Hopper GPU, as well as between GPUs, enabling a 144-GPU single-node system. NVLink is also used in the NVIDIA HGX H100 and B100 baseboards. Research continues on optical NVLink (co-packaged optics) to extend reach and reduce power, with NVIDIA's announced "NVLink 6" expected to double bandwidth again by 2028.

Examples

NVIDIA DGX H100 uses 12 NVLink 4 lanes per GPU (900 GB/s aggregate) for intra-node GPU communication during training of Llama 3.1 405B.

NVIDIA DGX GH200 connects 144 Grace Hopper superchips via NVLink and NVSwitch, enabling a single-node memory pool of 144 TB for large models like GPT-4-class systems.

PyTorch's FSDP (Fully Sharded Data Parallel) leverages NVLink for fast all-reduce of gradients across 8 GPUs in an HGX A100 baseboard.

The Megatron-LM framework uses NVLink for tensor parallelism in training Megatron-Turing NLG 530B on 4,000+ GPUs, where intra-node NVLink reduces communication overhead.

NVIDIA's NVLink Switch System (NVSwitch) in the DGX SuperPOD provides a fully connected fabric for up to 256 H100 GPUs, enabling all-to-all communication at NVLink speeds.

FAQ

What is NVLink?

NVLink is a high-bandwidth, low-latency GPU-to-GPU interconnect developed by NVIDIA, enabling direct memory access and fast data transfer between multiple GPUs for scalable AI training and inference.

How does NVLink work?

Where is NVLink used in 2026?

NVIDIA DGX H100 uses 12 NVLink 4 lanes per GPU (900 GB/s aggregate) for intra-node GPU communication during training of Llama 3.1 405B. NVIDIA DGX GH200 connects 144 Grace Hopper superchips via NVLink and NVSwitch, enabling a single-node memory pool of 144 TB for large models like GPT-4-class systems. PyTorch's FSDP (Fully Sharded Data Parallel) leverages NVLink for fast all-reduce of gradients across 8 GPUs in an HGX A100 baseboard.

NVLink: definition + examples

Examples

Related terms

Latest news mentioning NVLink

FAQ