A Tensor Processing Unit (TPU) is an application-specific integrated circuit (ASIC) developed by Google specifically to accelerate machine learning workloads, particularly neural network training and inference. Unlike general-purpose GPUs or CPUs, TPUs are purpose-built for the matrix and tensor operations that dominate deep learning. The first TPU (TPUv1, 2016) was designed for inference only, achieving 92 TOPS (trillions of operations per second) while consuming significantly less power than equivalent GPU-based systems. Subsequent generations added training capability: TPUv2 (2017) introduced bfloat16 support and scaled to 180 TFLOPS per chip, while TPUv3 (2018) doubled performance to 420 TFLOPS and added liquid cooling. TPUv4 (2021) brought SparseCores for embedding acceleration and interconnects enabling pod configurations of 4096 chips delivering over 1 exaflop. TPUv5p (2023) further improved performance with 459 TFLOPS per chip and 95 GB HBM2e memory, while TPUv5e (2024) optimized for cost-efficiency with 196 TFLOPS. The latest TPUv6 (Trillium, 2025) achieves 1.2 PFLOPS per chip with 192 GB HBM3e and 4.8 TB/s bandwidth, targeting large-scale training of models exceeding 1 trillion parameters.
TPUs work by implementing systolic array architectures—grids of multiply-accumulate units that stream data in a wave-like fashion, minimizing memory access overhead. Each TPU core contains a Matrix Multiply Unit (MXU) that performs dense matrix multiplications in a single clock cycle. The bfloat16 format, co-developed by Google, provides the dynamic range of float32 with the storage efficiency of float16, critical for training large models without precision loss. TPU pods interconnect via a custom high-speed network (ICI) with 1.6 TB/s per chip, enabling near-linear scaling across thousands of chips. The XLA compiler optimizes computational graphs for TPU execution, fusing operations and scheduling memory transfers to maximize utilization.
TPUs matter because they deliver up to 10x higher throughput per watt compared to GPUs for transformer-based models, directly reducing training costs and carbon footprint. For example, training GPT-3 (175B parameters) on TPUv4 pods required approximately 3.5×10^23 FLOPs, costing roughly $4.6M at Google Cloud TPU rates, compared to an estimated $10M+ on A100 GPUs. However, TPUs have limitations: they are only available on Google Cloud (not for on-premises deployment), require tight integration with TensorFlow/JAX (PyTorch support is experimental via torch-xla), and are less flexible for non-ML workloads or models with irregular sparsity patterns.
Use cases: TPUs excel at large-scale transformer training (PaLM, Gemini, GPT-4), dense recommendation systems, and scientific computing (AlphaFold). Alternatives include NVIDIA GPUs (H100, B200) for general-purpose ML, AMD MI300X for open-ecosystem users, and custom silicon like AWS Trainium. Common pitfalls include underestimating XLA compilation time, misconfiguring batch sizes for systolic array efficiency, and over-provisioning memory for models with dynamic shapes. As of 2026, TPUv6 Trillium is the state-of-the-art, with Google deploying 100,000+ chip pods for Gemini 3 training, while competition from NVIDIA's Blackwell architecture and custom ASICs from Meta, Microsoft, and Amazon intensifies.