Quantization in AI is an optimization technique used to make machine learning models smaller and faster, especially for deployment on devices with limited resources like smartphones or embedded systems.
This process involves converting a modelβs parameters (like weights and activations), which are typically represented by high-precision floating-point numbers (e.g., 32-bit floats), into lower-precision data types, such as 8-bit integers.
ββββββββββββββββ¬ββββββββββββββ¬βββββββββββββββ¬ββββββββββββββββββ
β Method β Precision β Compression β Typical Use β
ββββββββββββββββΌββββββββββββββΌβββββββββββββββΌββββββββββββββββββ€
β FP32βFP16 β 16 bits β 2x β Training β
β FP32βINT8 β 8 bits β 4x β Inference β
β FP32βINT4 β 4 bits β 8x β Edge devices β
β FP32βINT1 β 1 bit β 32x β Binary networks β
ββββββββββββββββ΄ββββββββββββββ΄βββββββββββββββ΄ββββββββββββββββββRecognizing a quantized model by its name is a crucial skill. While there isnβt a single, rigid standard, the community has adopted a set of conventions that make it easy to identify a modelβs quantization method and precision.
GPTQ, AWQ, GGUF, bitsandbytes4-bit, 4b, int4, fp4, 8-bit, 8b, int8While FP16 or BF16 are not technically quantization in the integer sense, these are βhalf-precisionβ floating-point formats. Many models are originally released in this format, which is a step down from the full 32-bit floating point, but is considered the βunquantizedβ baseline for modern LLMs.
- bert-base-uncased β β Not quantized
- bert-base-uncased-int8 β β
Quantized (INT8)
- Llama-3-8B-Q4_K_M.gguf β β
Quantized (4-bit GGUF)
- yolov5s.pt β β Probably FP32 PyTorch
- yolov5s-fp16.torchscript β β οΈ FP16 (lower precision, not βquantizedβ per se)
- resnet50_quant.pth β β
Likely quantized
- Mistral-7B-Instruct-v0.2-AWQ β β
AWQ quantized (4-bit)Different quantization methods vary in when and how they are applied, as well as in their impact on model accuracy. The diagram below provides a visual overview of the main quantization methods in neural networks. It shows when quantization is applied (Timing) and how it is implemented (Technique), including post-training, quantization-aware training, integer-based schemes, symmetric/asymmetric mapping, and mixed/floating-point precision.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β QUANTIZATION METHODS β
β (How to reduce precision in neural networks) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββ΄ββββββββββββββββββ
β β
βββββββββββββΌββββββββββββ βββββββββββββΌββββββββββββ
β TIMING (WHEN) β β TECHNIQUE (HOW) β
βββββββββββββ¬ββββββββββββ βββββββββββββ¬ββββββββββββ
β β
ββββββββββΌβββββββββ ββββββββββββββΌββββββββββ
β POST-TRAINING β β INTEGER-BASED β
β (PTQ) β βββββββββββ¬βββββββββββββ
βββββββββ¬ββββββββββ β
β βββββββββββββββββ
βββββββββΌββββββββββ β β
β STATIC (FULL) β βΌ βΌ
β weights+activ. β SYMMETRIC ASYMMETRIC
β calib data β (zero pt=0) (zero ptβ 0)
βββββββββ¬ββββββββββ uniform step uniform affine
β
βββββββββΌββββββββββ
β DYNAMIC (RANGE) β
β weights only, β
β activations at β
β runtime β
βββββββββββββββββββ
β
βββββββββΌββββββββββ
β QAT (Training) β
β simulate low- β
β precision duringβ
β training β
βββββββββββββββββββ
β
βββββββββΌββββββββββ
β FLOAT-BASED β
β REDUCTION β
β float32 β float16β
β or bfloat16 β
βββββββββββββββββββ
β
βββββββββΌββββββββββ
β MIXED PRECISION β
β different layers β
β use different β
β bit-widths β
βββββββββββββββββββ
Hereβs an example explains how a modelβs high-precision dataβlike the original 32-bit floating-point numbersβis converted into 8-bit integers using uniform affine quantization, which is a subtype of integer quantization, specifically falling under the category of asymmetric quantization.
BEFORE QUANTIZATION (FP32)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Weight Matrix (32-bit floats) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 3.14159265 -2.71828183 1.41421356 -0.57735027 β
β 0.86602540 2.23606798 -1.73205081 0.31622777 β
β -1.61803399 0.70710678 2.44948974 -0.44721360 β
β 1.25992105 -3.16227766 0.52359878 1.77245385 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
-----
QUANTIZATION PROCESS
STEP 1: Find Min/Max Values
ββββββββββββββββββββββββββββββββ
β min_val = -3.16227766 β
β max_val = 3.14159265 β
β range = 6.30387031 β
ββββββββββββββββββββββββββββββββ
STEP 2: Calculate Scale Factor
ββββββββββββββββββββββββββββββββ
β 8-bit range: 0 to 255 β
β scale = range / 255 β
β scale = 0.02472108 β
β zero_point = 128 (midpoint) β
ββββββββββββββββββββββββββββββββ
STEP 3: Quantization Formula
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β quantized = round(original/scale) + zero_point β
β clamped to [0, 255] β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
----
AFTER QUANTIZATION (INT8)
βββββββββββββββββββββββββββββββββββββ
β Weight Matrix (8-bit integers) β
βββββββββββββββββββββββββββββββββββββ€
β 255 18 185 105 β
β 163 218 58 141 β
β 63 157 227 110 β
β 179 0 149 200 β
βββββββββββββββββββββββββββββββββββββ
MEMORY COMPARISON
βββββββββββββββββββ¬ββββββββββββββββββ¬ββββββββββββββββββ
β Data Type β Bits/Weight β Memory Usage β
βββββββββββββββββββΌββββββββββββββββββΌββββββββββββββββββ€
β FP32 (original) β 32 β 512 bits β
β INT8 (quantized)β 8 β 128 bits β
β Compression β 4x β 75% reduction β
βββββββββββββββββββ΄ββββββββββββββββββ΄ββββββββββββββββββ
DEQUANTIZATION (for inference)
ββββββββββββββββββββββββββββββββββββββββββββββββββ
β original β (quantized - zero_point) Γ scale β
β β
β Example: 255 β (255-128) Γ 0.02472108 = 3.14 β
β 18 β (18-128) Γ 0.02472108 = -2.72 β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
VISUAL REPRESENTATION OF PRECISION LOSS
Original FP32: ββββββββββββββββββββββββββββββββ (32 levels of detail)
Quantized INT8: ββββββββ (8 levels of detail, 4x compression)