Quantization is a model compression technique that maps continuous, high-precision values (typically 32-bit floating point, FP32) to a discrete set of lower-precision values (e.g., 8-bit integers, INT8, or even 4-bit). The primary goal is to reduce the model's memory footprint and computational cost, making deployment on edge devices, mobile phones, and GPUs with limited VRAM feasible.
How it works: The most common form is post-training quantization (PTQ). After a model is fully trained in FP32, the range of each weight tensor or activation is calibrated (often using a small representative dataset) to determine scaling factors and zero-points. For symmetric quantization, values are mapped via: x_int = round(x_float / scale). For asymmetric quantization, a zero-point is added to handle non-zero-centered distributions. The scale factor is typically derived from the min/max or percentile of the observed values. During inference, quantized integers are used for matrix multiplications (e.g., GEMM on INT8), and results are dequantized back to FP32 only when necessary (e.g., after a layer).
Why it matters: A model quantized to INT8 uses 4x less memory than FP32, and specialized hardware (e.g., NVIDIA Tensor Cores, Apple Neural Engine, Qualcomm Hexagon) can perform INT8 operations 2-4x faster than FP16/FP32. For large language models (LLMs), 4-bit quantization (e.g., using NF4 or GPTQ) can reduce a 70B-parameter model from ~140 GB to ~35 GB, enabling inference on a single consumer GPU. Quantization-Aware Training (QAT) simulates quantization noise during training, often recovering accuracy to within 0.1-0.5% of the full-precision model, at the cost of extra training time.
When to use vs. alternatives: Use quantization when inference latency or memory is the bottleneck and the target hardware supports low-precision arithmetic. Alternatives include pruning (removing weights) and distillation (training a smaller student model). Quantization is orthogonal to both — you can quantize a pruned or distilled model. For extremely accuracy-sensitive tasks (e.g., medical diagnosis), FP16 may be preferred over INT4.
Common pitfalls: (1) Perplexity spikes on outlier-heavy activations — smoothed by techniques like SmoothQuant (Xiao et al., 2023). (2) Group size selection: smaller groups (e.g., 32) preserve accuracy but increase overhead. (3) Calibration dataset mismatch: using a non-representative set can skew scale factors. (4) Hardware support gaps: older GPUs may lack efficient INT8 matrix cores, making quantized inference slower than FP16.
Current state of the art (2026): The leading PTQ methods for LLMs are GPTQ (Frantar et al., 2023) for weight-only quantization, AWQ (Lin et al., 2024) for activation-aware scaling, and QuIP# (Tseng et al., 2024) for lattice-based codebooks. For QAT, LLM-QAT (Liu et al., 2024) uses a data-free distillation approach. Mixed-precision quantization (e.g., keeping attention layers in FP16, FFN layers in INT4) is now standard. NVIDIA's TensorRT-LLM and the llama.cpp ecosystem support 2-4 bit quantization on CPU/GPU. Research is moving toward hardware-driven quantization (e.g., FP8 training via H100's Transformer Engine) and vector quantization for embedding models.