How Quantization Works in AI

Quantization in AI is an optimization technique used to make machine learning models smaller and faster, especially for deployment on devices with limited resources like smartphones or embedded systems.

This process involves converting a model’s parameters (like weights and activations), which are typically represented by high-precision floating-point numbers (e.g., 32-bit floats), into lower-precision data types, such as 8-bit integers.

┌──────────────┬─────────────┬──────────────┬─────────────────┐
│    Method    │   Precision │  Compression │  Typical Use    │
├──────────────┼─────────────┼──────────────┼─────────────────┤
│ FP32→FP16    │   16 bits   │     2x       │ Training        │
│ FP32→INT8    │    8 bits   │     4x       │ Inference       │
│ FP32→INT4    │    4 bits   │     8x       │ Edge devices    │
│ FP32→INT1    │    1 bit    │    32x       │ Binary networks │
└──────────────┴─────────────┴──────────────┴─────────────────┘

How to spot a quantized model

Recognizing a quantized model by its name is a crucial skill. While there isn’t a single, rigid standard, the community has adopted a set of conventions that make it easy to identify a model’s quantization method and precision.

  • The quantization method: GPTQ, AWQ, GGUF, bitsandbytes
  • The bit precision: 4-bit, 4b, int4, fp4, 8-bit, 8b, int8

While FP16 or BF16 are not technically quantization in the integer sense, these are “half-precision” floating-point formats. Many models are originally released in this format, which is a step down from the full 32-bit floating point, but is considered the “unquantized” baseline for modern LLMs.

- bert-base-uncased → ❌ Not quantized
- bert-base-uncased-int8 → ✅ Quantized (INT8)
- Llama-3-8B-Q4_K_M.gguf → ✅ Quantized (4-bit GGUF)
- yolov5s.pt → ❌ Probably FP32 PyTorch
- yolov5s-fp16.torchscript → ⚠️ FP16 (lower precision, not “quantized” per se)
- resnet50_quant.pth → ✅ Likely quantized
- Mistral-7B-Instruct-v0.2-AWQ → ✅ AWQ quantized (4-bit)

Quantization Methods

Different quantization methods vary in when and how they are applied, as well as in their impact on model accuracy. The diagram below provides a visual overview of the main quantization methods in neural networks. It shows when quantization is applied (Timing) and how it is implemented (Technique), including post-training, quantization-aware training, integer-based schemes, symmetric/asymmetric mapping, and mixed/floating-point precision.

┌────────────────────────────────────────────────────────────┐
│                  QUANTIZATION METHODS                      │
│        (How to reduce precision in neural networks)        │
└────────────────────────────────────────────────────────────┘
                              │
            ┌─────────────────┴─────────────────┐
            │                                   │
┌───────────▼───────────┐           ┌───────────▼───────────┐
│     TIMING (WHEN)     │           │   TECHNIQUE (HOW)     │
└───────────┬───────────┘           └───────────┬───────────┘
            │                                   │
   ┌────────▼────────┐              ┌────────────▼─────────┐
   │ POST-TRAINING   │              │     INTEGER-BASED    │
   │ (PTQ)           │              └─────────┬────────────┘
   └───────┬─────────┘                        │
           │                                  ├───────────────┐
   ┌───────▼─────────┐                        │               │
   │ STATIC (FULL)   │                        ▼               ▼
   │ weights+activ.  │                  SYMMETRIC         ASYMMETRIC
   │ calib data      │                  (zero pt=0)       (zero pt≠0)
   └───────┬─────────┘                  uniform step     uniform affine
           │
   ┌───────▼─────────┐
   │ DYNAMIC (RANGE) │
   │ weights only,   │
   │ activations at  │
   │ runtime         │
   └─────────────────┘
           │
   ┌───────▼─────────┐
   │ QAT (Training)  │
   │ simulate low-   │
   │ precision during│
   │ training        │
   └─────────────────┘
           │
   ┌───────▼─────────┐
   │ FLOAT-BASED      │
   │ REDUCTION        │
   │ float32 → float16│
   │ or bfloat16      │
   └─────────────────┘
           │
   ┌───────▼─────────┐
   │ MIXED PRECISION  │
   │ different layers │
   │ use different    │
   │ bit-widths       │
   └─────────────────┘

The Quantization Process

Here’s an example explains how a model’s high-precision data—like the original 32-bit floating-point numbers—is converted into 8-bit integers using uniform affine quantization, which is a subtype of integer quantization, specifically falling under the category of asymmetric quantization.

BEFORE QUANTIZATION (FP32)
┌─────────────────────────────────────────────────────┐
│ Weight Matrix (32-bit floats)                       │
├─────────────────────────────────────────────────────┤
│  3.14159265  -2.71828183   1.41421356  -0.57735027  │
│  0.86602540   2.23606798  -1.73205081   0.31622777  │
│ -1.61803399   0.70710678   2.44948974  -0.44721360  │
│  1.25992105  -3.16227766   0.52359878   1.77245385  │
└─────────────────────────────────────────────────────┘

-----

QUANTIZATION PROCESS

STEP 1: Find Min/Max Values
┌──────────────────────────────┐
│ min_val = -3.16227766        │
│ max_val =  3.14159265        │
│ range = 6.30387031           │
└──────────────────────────────┘

STEP 2: Calculate Scale Factor
┌──────────────────────────────┐
│ 8-bit range: 0 to 255        │
│ scale = range / 255          │
│ scale = 0.02472108           │
│ zero_point = 128 (midpoint)  │
└──────────────────────────────┘

STEP 3: Quantization Formula
┌────────────────────────────────────────────────┐
│ quantized = round(original/scale) + zero_point │
│ clamped to [0, 255]                            │
└────────────────────────────────────────────────┘

----

AFTER QUANTIZATION (INT8)
┌───────────────────────────────────┐
│ Weight Matrix (8-bit integers)    │
├───────────────────────────────────┤
│    255      18     185      105   │
│    163     218      58     141    │
│     63     157     227     110    │
│    179       0     149     200    │
└───────────────────────────────────┘

MEMORY COMPARISON
┌─────────────────┬─────────────────┬─────────────────┐
│   Data Type     │  Bits/Weight    │ Memory Usage    │
├─────────────────┼─────────────────┼─────────────────┤
│ FP32 (original) │      32         │     512 bits    │
│ INT8 (quantized)│       8         │     128 bits    │
│ Compression     │      4x         │  75% reduction  │
└─────────────────┴─────────────────┴─────────────────┘

DEQUANTIZATION (for inference)
┌────────────────────────────────────────────────┐
│ original ≈ (quantized - zero_point) × scale    │
│                                                │
│ Example: 255 → (255-128) × 0.02472108 = 3.14   │
│         18  → (18-128)  × 0.02472108 = -2.72   │
└────────────────────────────────────────────────┘

VISUAL REPRESENTATION OF PRECISION LOSS
Original FP32:  ████████████████████████████████ (32 levels of detail)
Quantized INT8: ████████ (8 levels of detail, 4x compression)