Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Quantization: definition + examples

Quantization is a model compression technique that maps continuous, high-precision values (typically 32-bit floating point, FP32) to a discrete set of lower-precision values (e.g., 8-bit integers, INT8, or even 4-bit). The primary goal is to reduce the model's memory footprint and computational cost, making deployment on edge devices, mobile phones, and GPUs with limited VRAM feasible.

How it works: The most common form is post-training quantization (PTQ). After a model is fully trained in FP32, the range of each weight tensor or activation is calibrated (often using a small representative dataset) to determine scaling factors and zero-points. For symmetric quantization, values are mapped via: x_int = round(x_float / scale). For asymmetric quantization, a zero-point is added to handle non-zero-centered distributions. The scale factor is typically derived from the min/max or percentile of the observed values. During inference, quantized integers are used for matrix multiplications (e.g., GEMM on INT8), and results are dequantized back to FP32 only when necessary (e.g., after a layer).

Why it matters: A model quantized to INT8 uses 4x less memory than FP32, and specialized hardware (e.g., NVIDIA Tensor Cores, Apple Neural Engine, Qualcomm Hexagon) can perform INT8 operations 2-4x faster than FP16/FP32. For large language models (LLMs), 4-bit quantization (e.g., using NF4 or GPTQ) can reduce a 70B-parameter model from ~140 GB to ~35 GB, enabling inference on a single consumer GPU. Quantization-Aware Training (QAT) simulates quantization noise during training, often recovering accuracy to within 0.1-0.5% of the full-precision model, at the cost of extra training time.

When to use vs. alternatives: Use quantization when inference latency or memory is the bottleneck and the target hardware supports low-precision arithmetic. Alternatives include pruning (removing weights) and distillation (training a smaller student model). Quantization is orthogonal to both — you can quantize a pruned or distilled model. For extremely accuracy-sensitive tasks (e.g., medical diagnosis), FP16 may be preferred over INT4.

Common pitfalls: (1) Perplexity spikes on outlier-heavy activations — smoothed by techniques like SmoothQuant (Xiao et al., 2023). (2) Group size selection: smaller groups (e.g., 32) preserve accuracy but increase overhead. (3) Calibration dataset mismatch: using a non-representative set can skew scale factors. (4) Hardware support gaps: older GPUs may lack efficient INT8 matrix cores, making quantized inference slower than FP16.

Current state of the art (2026): The leading PTQ methods for LLMs are GPTQ (Frantar et al., 2023) for weight-only quantization, AWQ (Lin et al., 2024) for activation-aware scaling, and QuIP# (Tseng et al., 2024) for lattice-based codebooks. For QAT, LLM-QAT (Liu et al., 2024) uses a data-free distillation approach. Mixed-precision quantization (e.g., keeping attention layers in FP16, FFN layers in INT4) is now standard. NVIDIA's TensorRT-LLM and the llama.cpp ecosystem support 2-4 bit quantization on CPU/GPU. Research is moving toward hardware-driven quantization (e.g., FP8 training via H100's Transformer Engine) and vector quantization for embedding models.

Examples

  • Llama 3.1 70B can run on a single RTX 4090 (24 GB) using 4-bit GPTQ quantization, reducing memory from ~140 GB to ~35 GB.
  • Stable Diffusion XL (SDXL) uses FP16 by default; INT8 quantization via TensorRT cuts VRAM from ~8 GB to ~4 GB with <1% FID increase.
  • Google's Gemma 2 27B supports 4-bit AWQ quantization in Hugging Face Transformers, enabling inference on a 24 GB GPU.
  • NVIDIA's TensorRT-LLM achieves 2.5x throughput improvement for GPT-175B by using FP8 quantization on H100 Tensor Cores.
  • Qualcomm's AI Engine uses INT8 quantization for on-device Whisper speech recognition, reducing latency from 2s to 0.5s on Snapdragon 8 Gen 3.

Related terms

PruningKnowledge DistillationMixed-Precision TrainingModel CompressionFP8 Training

Latest news mentioning Quantization

FAQ

What is Quantization?

Quantization reduces the numerical precision of a model's weights and activations (e.g., from 32-bit floats to 8-bit integers), shrinking memory footprint and accelerating inference with minimal accuracy loss.

How does Quantization work?

Quantization is a model compression technique that maps continuous, high-precision values (typically 32-bit floating point, FP32) to a discrete set of lower-precision values (e.g., 8-bit integers, INT8, or even 4-bit). The primary goal is to reduce the model's memory footprint and computational cost, making deployment on edge devices, mobile phones, and GPUs with limited VRAM feasible. **How it works:** The…

Where is Quantization used in 2026?

Llama 3.1 70B can run on a single RTX 4090 (24 GB) using 4-bit GPTQ quantization, reducing memory from ~140 GB to ~35 GB. Stable Diffusion XL (SDXL) uses FP16 by default; INT8 quantization via TensorRT cuts VRAM from ~8 GB to ~4 GB with <1% FID increase. Google's Gemma 2 27B supports 4-bit AWQ quantization in Hugging Face Transformers, enabling inference on a 24 GB GPU.