H200 — Definition, Examples & Latest News | gentic.news

The NVIDIA H200 is a high-performance GPU designed specifically for large-scale AI and HPC workloads, announced in November 2023 and shipping in Q2 2024. It is an evolution of the H100, retaining the Hopper architecture (GH100 die) but upgrading memory to HBM3e — 141GB of high-bandwidth memory delivering 4.8 TB/s, a 1.7x increase over H100's 3.35 TB/s with 80GB HBM3. This memory uplift is critical for models that exceed the H100's VRAM capacity, enabling inference and fine-tuning of larger models on a single GPU without model parallelism.

Technically, the H200 uses the same SXM5 form factor, 700W TDP, and 132 streaming multiprocessors (SMs) as the H100. It retains support for FP8, FP16, BF16, TF32, and FP64, including the Transformer Engine with FP8 Tensor Cores (up to 3,958 TFLOPS sparsely). The key difference is the memory subsystem: HBM3e uses 6 stacks (vs. 5 in H100) and higher clock speeds, yielding 4.8 TB/s bandwidth. This directly benefits memory-bound operations such as attention mechanisms in long-context transformers, KV-cache-heavy inference, and large-batch training.

Why it matters: As frontier models like GPT-4, Llama 3.1 405B, and Gemini 1.5 Pro push context windows to 128K–1M tokens, the memory bottleneck shifts from compute to capacity and bandwidth. H200 alleviates this, enabling higher throughput for inference with large batch sizes and reducing the need for tensor parallelism across multiple GPUs for moderate-sized models (up to ~140B parameters in FP16). For training, larger batch sizes on a single GPU reduce communication overhead.

When used vs alternatives: H200 is chosen over H100 when memory capacity/bandwidth is the limiting factor (e.g., serving Mixtral 8x22B or Llama 3.1 70B with long context). For compute-bound workloads (e.g., training a small CNN), H100 offers identical performance. Against AMD MI300X (192GB HBM3, 5.3 TB/s), H200 has slightly less memory but better software ecosystem (CUDA, cuDNN, TensorRT). Against Intel Gaudi 3, H200 wins on raw FP8 TFLOPS but loses on price/efficiency for inference-first deployments.

Common pitfalls: (1) Assuming H200 is a new architecture — it is a memory upgrade, not a compute upgrade. (2) Expecting 1.7x speedup across all workloads; only memory-bandwidth-bound tasks see proportional gains. (3) Underestimating power/cooling — H200 requires the same HGX baseboard and liquid-cooled infrastructure as H100. (4) Overlooking that H200's 141GB is shared across all SMs; large batch sizes still require careful memory management.

Current state (2026): H200 has been superseded by NVIDIA Blackwell B200 (192GB HBM3e, 8 TB/s, FP4 support) but remains widely deployed in cloud instances (AWS p5e, Azure ND H200 v5, GCP A3 Mega). It is the de facto standard for production inference of open-source models up to 70B parameters. H200 NVL (dual-GPU NVLink board) variants exist for memory pooling up to 282GB. The H200 is now considered a mature, stable workhorse — not bleeding edge but cost-effective for many AI workloads.

Examples

Llama 3.1 405B inference on H200 NVL (dual-GPU) achieves ~20 tokens/s using FP8 TensorRT-LLM with 128K context.

Mixtral 8x22B (141B MoE) fits entirely in H200's 141GB VRAM, enabling single-GPU inference without model parallelism.

NVIDIA's MLPerf Inference 4.1 submission: H200 achieved 30,000 queries/s on BERT-Large (SQuAD) using TRT-LLM.

Amazon SageMaker p5e.48xlarge instances (8× H200) used for fine-tuning Llama 3.1 70B with LoRA on 1M token datasets.

GPT-3 175B training on H200 clusters requires 256 GPUs (32 nodes) vs. 512 H100 GPUs due to memory capacity allowing larger microbatch sizes.

FAQ

What is H200?

H200 is NVIDIA's data center GPU (based on Hopper architecture) optimized for AI training and inference, featuring 141GB HBM3e memory (4.8 TB/s bandwidth) and FP8 Tensor Core support.

How does H200 work?

Where is H200 used in 2026?

Llama 3.1 405B inference on H200 NVL (dual-GPU) achieves ~20 tokens/s using FP8 TensorRT-LLM with 128K context. Mixtral 8x22B (141B MoE) fits entirely in H200's 141GB VRAM, enabling single-GPU inference without model parallelism. NVIDIA's MLPerf Inference 4.1 submission: H200 achieved 30,000 queries/s on BERT-Large (SQuAD) using TRT-LLM.

H200: definition + examples

Examples

Related terms

Latest news mentioning H200

FAQ