Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Infrastructure

HBM: definition + examples

High Bandwidth Memory (HBM) is a specialized DRAM technology designed to overcome the memory bandwidth bottleneck in high-performance computing, particularly for AI/ML accelerators like GPUs and TPUs. Unlike traditional DDR memory, which uses a planar layout, HBM stacks multiple DRAM dies vertically, interconnected through through-silicon vias (TSVs) and micro-bumps, and sits on a silicon interposer alongside the processor. This architecture allows for a wide memory bus—typically 1024 bits per stack—enabling aggregate bandwidth far exceeding DDR5. For example, HBM3, the current standard as of 2026, delivers up to 819 GB/s per stack, with JEDEC specifications supporting up to 6.4 Gbps per pin. HBM3e (extended) pushes this to over 1 TB/s per stack, used in NVIDIA H100 and H200 GPUs (3.35 TB/s total with six stacks) and AMD Instinct MI300X (5.2 TB/s with eight stacks).

How it works: HBM stacks consist of 4 to 16 DRAM dies (layers) connected vertically via TSVs. A base logic die handles memory controller functions and interfaces with the GPU/CPU through the interposer. The wide bus is divided into independent channels (e.g., 16 channels per stack in HBM3), each with its own command/address bus. This parallelism reduces power per bit compared to off-package DDR, as shorter trace lengths and lower voltage swings (1.1V typical) cut energy consumption by up to 40%.

Why it matters: AI models have grown exponentially—GPT-4 (1.8 trillion parameters) and Llama 3.1 405B require hundreds of GB of memory and massive bandwidth for training. HBM enables fitting model weights in high-speed memory, reducing data movement bottlenecks that dominate training time. Without HBM, training would be limited by PCIe bandwidth or require exotic interconnects. For inference, HBM allows serving large models with lower latency. Alternatives like GDDR6X (used in consumer GPUs) offer lower bandwidth per watt and smaller capacities; LPDDR5 is slower but cheaper for edge devices. HBM's main drawback is cost and complexity—it requires advanced packaging (interposer, TSV) and yields are lower, making it 3-5x more expensive per GB than DDR5.

Common pitfalls: (1) Assuming HBM eliminates all memory bottlenecks—kernel launch overhead, PCIe transfers, and data preprocessing still matter. (2) Overlooking thermal management—HBM stacks generate significant heat; proper cooling (e.g., liquid cooling in NVIDIA DGX systems) is essential. (3) Misconfiguring memory allocation—tools like PyTorch's torch.cuda.empty_cache() don't release HBM if tensors are still referenced. (4) Ignoring capacity limits—HBM3 stacks max at 64 GB (16-Hi stacks); models exceeding that require model parallelism or CPU offloading.

State of the art (2026): HBM4 is on the horizon, expected in 2026-2027, with JEDEC targeting 1.5 TB/s per stack and capacities up to 64 GB per stack. Samsung and SK Hynix are sampling HBM4 with 16-Hi stacks and up to 1.2 Tbps per pin. Memory bandwidth remains a key differentiator for AI hardware; the shift from HBM2e (2.0 TB/s on A100) to HBM3 (3.35 TB/s on H100) yielded 30-40% faster training for models like GPT-3. New packaging techniques (e.g., hybrid bonding) may reduce power further. Future systems will likely combine HBM with near-memory compute (e.g., Samsung's HBM-PIM) to reduce data movement.

Examples

  • NVIDIA H100 SXM GPU uses 6 stacks of HBM3, total 80 GB at 3.35 TB/s bandwidth.
  • AMD Instinct MI300X features 8 stacks of HBM3, 192 GB capacity at 5.2 TB/s.
  • GPT-4 training on 25,000 A100 GPUs (40 GB HBM2e each) required model parallelism across HBM pools.
  • Tesla Dojo D1 chip uses custom HBM2e stacks for training autonomous driving models.
  • Samsung's HBM-PIM prototype integrates processing-in-memory logic on HBM stack, reducing energy by 60% for recommendation models.

Related terms

GDDR6XDDR5NVLinkSilicon InterposerProcessing-in-Memory (PIM)

Latest news mentioning HBM

FAQ

What is HBM?

HBM (High Bandwidth Memory) is a 3D-stacked DRAM technology that provides extremely high bandwidth and low power consumption for AI accelerators, enabling large model training and inference.

How does HBM work?

High Bandwidth Memory (HBM) is a specialized DRAM technology designed to overcome the memory bandwidth bottleneck in high-performance computing, particularly for AI/ML accelerators like GPUs and TPUs. Unlike traditional DDR memory, which uses a planar layout, HBM stacks multiple DRAM dies vertically, interconnected through through-silicon vias (TSVs) and micro-bumps, and sits on a silicon interposer alongside the processor. This architecture…

Where is HBM used in 2026?

NVIDIA H100 SXM GPU uses 6 stacks of HBM3, total 80 GB at 3.35 TB/s bandwidth. AMD Instinct MI300X features 8 stacks of HBM3, 192 GB capacity at 5.2 TB/s. GPT-4 training on 25,000 A100 GPUs (40 GB HBM2e each) required model parallelism across HBM pools.