Quantization in AI is an optimization technique used to make machine learning models smaller and faster, especially for deployment on devices with limited resources like smartphones or embedded systems.
This process involves converting a model’s parameters (like weights and activations), which are typically represented by high-precision floating-point numbers (e.g., 32-bit floats), into lower-precision data types, such as 8-bit integers.
┌──────────────┬─────────────┬──────────────┬─────────────────┐
│ Method │ Precision │ Compression │ Typical Use │
├──────────────┼─────────────┼──────────────┼─────────────────┤
│ FP32→FP16 │ 16 bits │ 2x │ Training │
│ FP32→INT8 │ 8 bits │ 4x │ Inference │
│ FP32→INT4 │ 4 bits │ 8x │ Edge devices │
│ FP32→INT1 │ 1 bit │ 32x │ Binary networks │
└──────────────┴─────────────┴──────────────┴─────────────────┘Recognizing a quantized model by its name is a crucial skill. While there isn’t a single, rigid standard, the community has adopted a set of conventions that make it easy to identify a model’s quantization method and precision.
GPTQ, AWQ, GGUF, bitsandbytes4-bit, 4b, int4, fp4, 8-bit, 8b, int8While FP16 or BF16 are not technically quantization in the integer sense, these are “half-precision” floating-point formats. Many models are originally released in this format, which is a step down from the full 32-bit floating point, but is considered the “unquantized” baseline for modern LLMs.
- bert-base-uncased → ❌ Not quantized
- bert-base-uncased-int8 → ✅ Quantized (INT8)
- Llama-3-8B-Q4_K_M.gguf → ✅ Quantized (4-bit GGUF)
- yolov5s.pt → ❌ Probably FP32 PyTorch
- yolov5s-fp16.torchscript → ⚠️ FP16 (lower precision, not “quantized” per se)
- resnet50_quant.pth → ✅ Likely quantized
- Mistral-7B-Instruct-v0.2-AWQ → ✅ AWQ quantized (4-bit)Different quantization methods vary in when and how they are applied, as well as in their impact on model accuracy. The diagram below provides a visual overview of the main quantization methods in neural networks. It shows when quantization is applied (Timing) and how it is implemented (Technique), including post-training, quantization-aware training, integer-based schemes, symmetric/asymmetric mapping, and mixed/floating-point precision.
┌────────────────────────────────────────────────────────────┐
│ QUANTIZATION METHODS │
│ (How to reduce precision in neural networks) │
└────────────────────────────────────────────────────────────┘
│
┌─────────────────┴─────────────────┐
│ │
┌───────────▼───────────┐ ┌───────────▼───────────┐
│ TIMING (WHEN) │ │ TECHNIQUE (HOW) │
└───────────┬───────────┘ └───────────┬───────────┘
│ │
┌────────▼────────┐ ┌────────────▼─────────┐
│ POST-TRAINING │ │ INTEGER-BASED │
│ (PTQ) │ └─────────┬────────────┘
└───────┬─────────┘ │
│ ├───────────────┐
┌───────▼─────────┐ │ │
│ STATIC (FULL) │ ▼ ▼
│ weights+activ. │ SYMMETRIC ASYMMETRIC
│ calib data │ (zero pt=0) (zero pt≠0)
└───────┬─────────┘ uniform step uniform affine
│
┌───────▼─────────┐
│ DYNAMIC (RANGE) │
│ weights only, │
│ activations at │
│ runtime │
└─────────────────┘
│
┌───────▼─────────┐
│ QAT (Training) │
│ simulate low- │
│ precision during│
│ training │
└─────────────────┘
│
┌───────▼─────────┐
│ FLOAT-BASED │
│ REDUCTION │
│ float32 → float16│
│ or bfloat16 │
└─────────────────┘
│
┌───────▼─────────┐
│ MIXED PRECISION │
│ different layers │
│ use different │
│ bit-widths │
└─────────────────┘
Here’s an example explains how a model’s high-precision data—like the original 32-bit floating-point numbers—is converted into 8-bit integers using uniform affine quantization, which is a subtype of integer quantization, specifically falling under the category of asymmetric quantization.
BEFORE QUANTIZATION (FP32)
┌─────────────────────────────────────────────────────┐
│ Weight Matrix (32-bit floats) │
├─────────────────────────────────────────────────────┤
│ 3.14159265 -2.71828183 1.41421356 -0.57735027 │
│ 0.86602540 2.23606798 -1.73205081 0.31622777 │
│ -1.61803399 0.70710678 2.44948974 -0.44721360 │
│ 1.25992105 -3.16227766 0.52359878 1.77245385 │
└─────────────────────────────────────────────────────┘
-----
QUANTIZATION PROCESS
STEP 1: Find Min/Max Values
┌──────────────────────────────┐
│ min_val = -3.16227766 │
│ max_val = 3.14159265 │
│ range = 6.30387031 │
└──────────────────────────────┘
STEP 2: Calculate Scale Factor
┌──────────────────────────────┐
│ 8-bit range: 0 to 255 │
│ scale = range / 255 │
│ scale = 0.02472108 │
│ zero_point = 128 (midpoint) │
└──────────────────────────────┘
STEP 3: Quantization Formula
┌────────────────────────────────────────────────┐
│ quantized = round(original/scale) + zero_point │
│ clamped to [0, 255] │
└────────────────────────────────────────────────┘
----
AFTER QUANTIZATION (INT8)
┌───────────────────────────────────┐
│ Weight Matrix (8-bit integers) │
├───────────────────────────────────┤
│ 255 18 185 105 │
│ 163 218 58 141 │
│ 63 157 227 110 │
│ 179 0 149 200 │
└───────────────────────────────────┘
MEMORY COMPARISON
┌─────────────────┬─────────────────┬─────────────────┐
│ Data Type │ Bits/Weight │ Memory Usage │
├─────────────────┼─────────────────┼─────────────────┤
│ FP32 (original) │ 32 │ 512 bits │
│ INT8 (quantized)│ 8 │ 128 bits │
│ Compression │ 4x │ 75% reduction │
└─────────────────┴─────────────────┴─────────────────┘
DEQUANTIZATION (for inference)
┌────────────────────────────────────────────────┐
│ original ≈ (quantized - zero_point) × scale │
│ │
│ Example: 255 → (255-128) × 0.02472108 = 3.14 │
│ 18 → (18-128) × 0.02472108 = -2.72 │
└────────────────────────────────────────────────┘
VISUAL REPRESENTATION OF PRECISION LOSS
Original FP32: ████████████████████████████████ (32 levels of detail)
Quantized INT8: ████████ (8 levels of detail, 4x compression)