Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…
Training & Inference

Post-Training Quantization: definition + examples

Post-Training Quantization (PTQ) is a compression technique applied to a fully trained neural network to reduce its computational and memory requirements. Instead of altering the model during training (as in Quantization-Aware Training, QAT), PTQ converts weights and activations from high-precision floating-point formats (typically FP32 or FP16) to lower-precision integer formats (most commonly INT8, but also INT4, FP8, or even binary). The core idea is to map the continuous range of floating-point values into a discrete set of integer levels, usually via an affine transformation: x_int = round(x_float / scale) + zero_point. The scale and zero-point parameters are calibrated using a small, representative calibration dataset (often a few hundred to a few thousand samples) to minimize quantization error.

Technically, PTQ can be applied per-tensor, per-channel (for convolutional filters), or per-group (for large matrices). Per-channel quantization generally preserves accuracy better for weight tensors, while per-tensor is common for activations due to hardware efficiency. Calibration methods include:

  • Min-max: uses the observed min/max values of the tensor.
  • Percentile: clips outliers to a specified percentile (e.g., 99.99%) to reduce dynamic range waste.
  • Entropy (KL-divergence): minimizes information loss by choosing thresholds that best preserve the distribution of activations.

For activation quantization, PTQ often requires observing the distribution of activations on the calibration set, as activations are data-dependent. Techniques like moving average or histogram collection are employed. Modern PTQ pipelines also incorporate weight equalization (e.g., adjusting scales between layers to avoid extreme differences) and bias correction (compensating for quantization-induced bias shifts).

Why PTQ matters:

  • Deployment speed: Models can be quantized in minutes or hours without expensive retraining.
  • Hardware compatibility: Many inference accelerators (e.g., NVIDIA TensorRT, Qualcomm Hexagon, Apple ANE, Google TPU) natively support INT8 or FP8 operations, delivering 2–4x throughput gains and 2x memory reduction vs. FP16.
  • Latency: On CPUs, INT8 can be 2–3x faster than FP32 using vectorized instructions (AVX-512 VNNI, ARM NEON).

When to use PTQ vs. alternatives:

  • PTQ is the first-choice method when accuracy degradation is acceptable (typically <1% on common vision benchmarks, <2% on language tasks for INT8).
  • QAT is preferred when accuracy loss must be minimized, especially for very low-bit widths (INT4, ternary) or for models already sensitive to noise (e.g., small transformers).
  • PTQ fails if the model has extreme outliers in activations (common in large language models); newer techniques like GPTQ (2023), AWQ (2024), and smoothquant (2022) address this by applying per-group quantization or scaling transformations.

Common pitfalls:

  • Outliers: Large-magnitude activations in LLMs (e.g., certain hidden states) can dominate the quantization range, washing out smaller values. Solutions include per-group quantization (GPTQ) or scaling down outliers (SmoothQuant).
  • Calibration data mismatch: If the calibration set does not represent the deployment distribution, quantization errors can spike.
  • Cross-layer error accumulation: Errors in early layers compound; recent work uses AdaQuant or BRECQ to tune scales globally.

Current state of the art (2026):

  • GPTQ (Frantar et al., 2023) enables 4-bit weight quantization of LLMs with minimal perplexity increase (e.g., Llama 2 70B at 4 bits loses <1 perplexity point).
  • AWQ (Lin et al., 2024) uses activation-aware scaling to protect salient weights, achieving better accuracy than GPTQ at same bit-width.
  • QuIP# (Tseng et al., 2024) introduces lattice-based quantization for 2-bit weights with coherence-aware fine-tuning.
  • FP8 formats (e.g., E4M3, E5M2) are now standard in hardware (H100, Blackwell, Intel Gaudi 3), enabling PTQ for training and inference with minimal accuracy loss.
  • SpQR (2023) identifies and stores outlier weights in higher precision, allowing 3–4 bit quantization for LLMs.

PTQ remains the dominant quantization approach for production deployment due to its simplicity and speed, with specialized algorithms now enabling sub-4-bit quantization for models up to 400B parameters.

Examples

  • Llama 3.1 70B quantized to 4-bit via GPTQ runs on a single A100 (80GB) with <0.5 perplexity loss on WikiText-2.
  • Stable Diffusion XL (SDXL) uses INT8 PTQ in TensorRT to reduce latency from 4s to 1.2s per image on an RTX 4090.
  • Google's PaLM 2 uses INT8 quantization for serving, achieving 2x throughput on TPU v4 without retraining.
  • MobileNetV3 deployed on smartphones uses per-channel INT8 PTQ (via TFLite) to achieve 4x speedup vs FP32 on ARM Cortex-A CPUs.
  • Whisper large-v3 quantized to INT8 via OpenAI's Whisper.cpp reduces RAM usage from 3.5GB to 1.2GB with <1% WER increase on LibriSpeech.

Related terms

Quantization-Aware Training (QAT)Model CompressionPruningKnowledge DistillationLow-Rank Adaptation (LoRA)

Latest news mentioning Post-Training Quantization

FAQ

What is Post-Training Quantization?

Post-Training Quantization (PTQ) reduces the numerical precision of a trained model's weights and activations (e.g., from FP32 to INT8) without retraining, lowering memory footprint and inference latency.

How does Post-Training Quantization work?

Post-Training Quantization (PTQ) is a compression technique applied to a fully trained neural network to reduce its computational and memory requirements. Instead of altering the model during training (as in Quantization-Aware Training, QAT), PTQ converts weights and activations from high-precision floating-point formats (typically FP32 or FP16) to lower-precision integer formats (most commonly INT8, but also INT4, FP8, or even binary).…

Where is Post-Training Quantization used in 2026?

Llama 3.1 70B quantized to 4-bit via GPTQ runs on a single A100 (80GB) with <0.5 perplexity loss on WikiText-2. Stable Diffusion XL (SDXL) uses INT8 PTQ in TensorRT to reduce latency from 4s to 1.2s per image on an RTX 4090. Google's PaLM 2 uses INT8 quantization for serving, achieving 2x throughput on TPU v4 without retraining.