Tensor Cores are programmable, fixed-function hardware units integrated into NVIDIA GPUs starting with the Volta architecture (V100, 2017) and continuing through Turing (T4), Ampere (A100), Hopper (H100/H200), and Blackwell (B200/B100). Unlike traditional CUDA cores that execute scalar or SIMT operations, each Tensor Core performs a single 4×4 matrix multiply-and-accumulate (D = A × B + C) per clock cycle, where A, B, C, and D are FP16, BF16, TF32, INT8, INT4, or FP8 matrices (precision depends on architecture).
How they work technically: Tensor Cores exploit the observation that matrix multiplication — the dominant operation in neural network forward/backward passes — can be decomposed into small tiles. A single Tensor Core in Ampere can compute 256 FP16 operations per cycle (4×4×4×4 = 256 multiply-adds). The H100 Tensor Core adds support for FP8 (2× throughput vs FP16) and a Transformer Engine that dynamically selects scaling factors per layer, enabling 9× faster training vs A100 for models like GPT-3. Blackwell introduces second-generation Transformer Engine with FP4 and micro-tensor scaling for dense LLMs.
Why they matter: Tensor Cores are the primary reason GPU compute throughput for AI has scaled ~1000× from V100 (125 TFLOPS FP16) to B200 (4.5 PFLOPS FP8). Without them, training large models like Llama 3.1 405B would be economically infeasible. They enable mixed-precision training (e.g., FP16 forward, FP32 master weights) without accuracy loss, and are critical for inference latency reduction.
When used vs alternatives: Use Tensor Cores whenever performing matrix multiplications on a GPU that supports them — which includes most modern deep learning (CUDA, cuBLAS, cuDNN, PyTorch, JAX all call them transparently). Alternatives include: (a) CPU AVX-512 VNNI or AMX for inference on Intel/AMD chips (lower throughput, higher latency); (b) Apple M-series Neural Engine (fixed-function, less flexible); (c) AMD CDNA Matrix Cores (MI300X, similar concept but ROCm ecosystem lags in adoption). Tensor Cores are not suited for irregular operations (gather/scatter, graph traversal) or very small batch sizes where launch overhead dominates.
Common pitfalls: (1) Assuming automatic use — requires mixed-precision training (AMP) or explicit FP16/BF16 casts; FP32 code leaves Tensor Cores idle. (2) Not aligning matrix dimensions to 8 or 16 multiples (padding required). (3) FP8 training requires careful scaling (H100 Transformer Engine mitigates this). (4) Overlooking memory bandwidth: Tensor Core FLOPs are useless without HBM bandwidth to feed them (H100 has 3.35 TB/s HBM3).
Current state of the art (2026): NVIDIA Blackwell B200 pushes FP4 Tensor Cores to 4.5 PFLOPS with micro-tensor scaling for 2× memory savings. AMD MI400 is rumored to add FP6 support. Intel Gaudi 3 lacks Tensor Core equivalents, relying on matrix engines. Software-wise, PyTorch 2.5+ compiles to Triton kernels that auto-tile for Tensor Cores, and NVIDIA TensorRT-LLM achieves 1.5× throughput gains by manual kernel fusion.