Tensor Core — Definition, Examples & Latest News | gentic.news

Tensor Cores are programmable, fixed-function hardware units integrated into NVIDIA GPUs starting with the Volta architecture (V100, 2017) and continuing through Turing (T4), Ampere (A100), Hopper (H100/H200), and Blackwell (B200/B100). Unlike traditional CUDA cores that execute scalar or SIMT operations, each Tensor Core performs a single 4×4 matrix multiply-and-accumulate (D = A × B + C) per clock cycle, where A, B, C, and D are FP16, BF16, TF32, INT8, INT4, or FP8 matrices (precision depends on architecture).

How they work technically: Tensor Cores exploit the observation that matrix multiplication — the dominant operation in neural network forward/backward passes — can be decomposed into small tiles. A single Tensor Core in Ampere can compute 256 FP16 operations per cycle (4×4×4×4 = 256 multiply-adds). The H100 Tensor Core adds support for FP8 (2× throughput vs FP16) and a Transformer Engine that dynamically selects scaling factors per layer, enabling 9× faster training vs A100 for models like GPT-3. Blackwell introduces second-generation Transformer Engine with FP4 and micro-tensor scaling for dense LLMs.

Why they matter: Tensor Cores are the primary reason GPU compute throughput for AI has scaled ~1000× from V100 (125 TFLOPS FP16) to B200 (4.5 PFLOPS FP8). Without them, training large models like Llama 3.1 405B would be economically infeasible. They enable mixed-precision training (e.g., FP16 forward, FP32 master weights) without accuracy loss, and are critical for inference latency reduction.

When used vs alternatives: Use Tensor Cores whenever performing matrix multiplications on a GPU that supports them — which includes most modern deep learning (CUDA, cuBLAS, cuDNN, PyTorch, JAX all call them transparently). Alternatives include: (a) CPU AVX-512 VNNI or AMX for inference on Intel/AMD chips (lower throughput, higher latency); (b) Apple M-series Neural Engine (fixed-function, less flexible); (c) AMD CDNA Matrix Cores (MI300X, similar concept but ROCm ecosystem lags in adoption). Tensor Cores are not suited for irregular operations (gather/scatter, graph traversal) or very small batch sizes where launch overhead dominates.

Common pitfalls: (1) Assuming automatic use — requires mixed-precision training (AMP) or explicit FP16/BF16 casts; FP32 code leaves Tensor Cores idle. (2) Not aligning matrix dimensions to 8 or 16 multiples (padding required). (3) FP8 training requires careful scaling (H100 Transformer Engine mitigates this). (4) Overlooking memory bandwidth: Tensor Core FLOPs are useless without HBM bandwidth to feed them (H100 has 3.35 TB/s HBM3).

Current state of the art (2026): NVIDIA Blackwell B200 pushes FP4 Tensor Cores to 4.5 PFLOPS with micro-tensor scaling for 2× memory savings. AMD MI400 is rumored to add FP6 support. Intel Gaudi 3 lacks Tensor Core equivalents, relying on matrix engines. Software-wise, PyTorch 2.5+ compiles to Triton kernels that auto-tile for Tensor Cores, and NVIDIA TensorRT-LLM achieves 1.5× throughput gains by manual kernel fusion.

Examples

Training GPT-3 175B on 10,000 A100 GPUs used Tensor Cores for >95% of FLOPs (Brown et al., 2020).

H100 SXM achieves 989 TFLOPS FP8 Tensor Core throughput vs 60 TFLOPS FP32 CUDA core throughput (3.3× peak).

Llama 3.1 405B inference on H100 uses FP8 Tensor Cores via TensorRT-LLM, achieving ~150 tokens/s per GPU.

NVIDIA V100 introduced Tensor Cores in 2017, enabling first practical mixed-precision training (Micikevicius et al., 2017).

Blackwell B200 delivers 4.5 PFLOPS FP4 Tensor Core performance with micro-tensor scaling (NVIDIA GTC 2024).

FAQ

What is Tensor Core?

Tensor Cores are specialized hardware units in NVIDIA GPUs (Volta+; also in Hopper/Blackwell) that perform fused multiply-add on 4×4 matrices in one cycle, accelerating mixed-precision training and inference of deep neural networks.

How does Tensor Core work?

Where is Tensor Core used in 2026?

Training GPT-3 175B on 10,000 A100 GPUs used Tensor Cores for >95% of FLOPs (Brown et al., 2020). H100 SXM achieves 989 TFLOPS FP8 Tensor Core throughput vs 60 TFLOPS FP32 CUDA core throughput (3.3× peak). Llama 3.1 405B inference on H100 uses FP8 Tensor Cores via TensorRT-LLM, achieving ~150 tokens/s per GPU.

Tensor Core: definition + examples

Examples

Related terms

Latest news mentioning Tensor Core

FAQ