How Quantization Works in AI

Sep 09, 2025#AI #ML #Quantization

Quantization in AI is an optimization technique used to make machine learning models smaller and faster, especially for deployment on devices with limited resources like smartphones or embedded systems.

This process involves converting a model’s parameters (like weights and activations), which are typically represented by high-precision floating-point numbers (e.g., 32-bit floats), into lower-precision data types, such as 8-bit integers.

┌──────────────┬─────────────┬──────────────┬─────────────────┐
│    Method    │   Precision │  Compression │  Typical Use    │
├──────────────┼─────────────┼──────────────┼─────────────────┤
│ FP32→FP16    │   16 bits   │     2x       │ Training        │
│ FP32→INT8    │    8 bits   │     4x       │ Inference       │
│ FP32→INT4    │    4 bits   │     8x       │ Edge devices    │
│ FP32→INT1    │    1 bit    │    32x       │ Binary networks │
└──────────────┴─────────────┴──────────────┴─────────────────┘

How to spot a quantized model

Recognizing a quantized model by its name is a crucial skill. While there isn’t a single, rigid standard, the community has adopted a set of conventions that make it easy to identify a model’s quantization method and precision.

The quantization method: GPTQ, AWQ, GGUF, bitsandbytes
The bit precision: 4-bit, 4b, int4, fp4, 8-bit, 8b, int8

While FP16 or BF16 are not technically quantization in the integer sense, these are “half-precision” floating-point formats. Many models are originally released in this format, which is a step down from the full 32-bit floating point, but is considered the “unquantized” baseline for modern LLMs.

- bert-base-uncased → ❌ Not quantized
- bert-base-uncased-int8 → ✅ Quantized (INT8)
- Llama-3-8B-Q4_K_M.gguf → ✅ Quantized (4-bit GGUF)
- yolov5s.pt → ❌ Probably FP32 PyTorch
- yolov5s-fp16.torchscript → ⚠️ FP16 (lower precision, not “quantized” per se)
- resnet50_quant.pth → ✅ Likely quantized
- Mistral-7B-Instruct-v0.2-AWQ → ✅ AWQ quantized (4-bit)

Quantization Methods

Different quantization methods vary in when and how they are applied, as well as in their impact on model accuracy. The diagram below provides a visual overview of the main quantization methods in neural networks. It shows when quantization is applied (Timing) and how it is implemented (Technique), including post-training, quantization-aware training, integer-based schemes, symmetric/asymmetric mapping, and mixed/floating-point precision.

┌────────────────────────────────────────────────────────────┐
│                  QUANTIZATION METHODS                      │
│        (How to reduce precision in neural networks)        │
└────────────────────────────────────────────────────────────┘
                              │
            ┌─────────────────┴─────────────────┐
            │                                   │
┌───────────▼───────────┐           ┌───────────▼───────────┐
│     TIMING (WHEN)     │           │   TECHNIQUE (HOW)     │
└───────────┬───────────┘           └───────────┬───────────┘
            │                                   │
   ┌────────▼────────┐              ┌────────────▼─────────┐
   │ POST-TRAINING   │              │     INTEGER-BASED    │
   │ (PTQ)           │              └─────────┬────────────┘
   └───────┬─────────┘                        │
           │                                  ├───────────────┐
   ┌───────▼─────────┐                        │               │
   │ STATIC (FULL)   │                        ▼               ▼
   │ weights+activ.  │                  SYMMETRIC         ASYMMETRIC
   │ calib data      │                  (zero pt=0)       (zero pt≠0)
   └───────┬─────────┘                  uniform step     uniform affine
           │
   ┌───────▼─────────┐
   │ DYNAMIC (RANGE) │
   │ weights only,   │
   │ activations at  │
   │ runtime         │
   └─────────────────┘
           │
   ┌───────▼─────────┐
   │ QAT (Training)  │
   │ simulate low-   │
   │ precision during│
   │ training        │
   └─────────────────┘
           │
   ┌───────▼─────────┐
   │ FLOAT-BASED      │
   │ REDUCTION        │
   │ float32 → float16│
   │ or bfloat16      │
   └─────────────────┘
           │
   ┌───────▼─────────┐
   │ MIXED PRECISION  │
   │ different layers │
   │ use different    │
   │ bit-widths       │
   └─────────────────┘

The Quantization Process

Here’s an example explains how a model’s high-precision data—like the original 32-bit floating-point numbers—is converted into 8-bit integers using uniform affine quantization, which is a subtype of integer quantization, specifically falling under the category of asymmetric quantization.

BEFORE QUANTIZATION (FP32)
┌─────────────────────────────────────────────────────┐
│ Weight Matrix (32-bit floats)                       │
├─────────────────────────────────────────────────────┤
│  3.14159265  -2.71828183   1.41421356  -0.57735027  │
│  0.86602540   2.23606798  -1.73205081   0.31622777  │
│ -1.61803399   0.70710678   2.44948974  -0.44721360  │
│  1.25992105  -3.16227766   0.52359878   1.77245385  │
└─────────────────────────────────────────────────────┘

-----

QUANTIZATION PROCESS

STEP 1: Find Min/Max Values
┌──────────────────────────────┐
│ min_val = -3.16227766        │
│ max_val =  3.14159265        │
│ range = 6.30387031           │
└──────────────────────────────┘

STEP 2: Calculate Scale Factor
┌──────────────────────────────┐
│ 8-bit range: 0 to 255        │
│ scale = range / 255          │
│ scale = 0.02472108           │
│ zero_point = 128 (midpoint)  │
└──────────────────────────────┘

STEP 3: Quantization Formula
┌────────────────────────────────────────────────┐
│ quantized = round(original/scale) + zero_point │
│ clamped to [0, 255]                            │
└────────────────────────────────────────────────┘

----

AFTER QUANTIZATION (INT8)
┌───────────────────────────────────┐
│ Weight Matrix (8-bit integers)    │
├───────────────────────────────────┤
│    255      18     185      105   │
│    163     218      58     141    │
│     63     157     227     110    │
│    179       0     149     200    │
└───────────────────────────────────┘

MEMORY COMPARISON
┌─────────────────┬─────────────────┬─────────────────┐
│   Data Type     │  Bits/Weight    │ Memory Usage    │
├─────────────────┼─────────────────┼─────────────────┤
│ FP32 (original) │      32         │     512 bits    │
│ INT8 (quantized)│       8         │     128 bits    │
│ Compression     │      4x         │  75% reduction  │
└─────────────────┴─────────────────┴─────────────────┘

DEQUANTIZATION (for inference)
┌────────────────────────────────────────────────┐
│ original ≈ (quantized - zero_point) × scale    │
│                                                │
│ Example: 255 → (255-128) × 0.02472108 = 3.14   │
│         18  → (18-128)  × 0.02472108 = -2.72   │
└────────────────────────────────────────────────┘

VISUAL REPRESENTATION OF PRECISION LOSS
Original FP32:  ████████████████████████████████ (32 levels of detail)
Quantized INT8: ████████ (8 levels of detail, 4x compression)

share twitter send feedback

Why AI model training using GPU instead of CPUMar 16, 2023

An Introduction to AI and ML for Web DevelopersMar 27, 2023

Top 10 Vector Databases & Libraries in 2024May 27, 2024

Top 6 Open-Source AI Large Language ModelsMay 19, 2023

Machine Learning vs Deep LearningAug 18, 2023

What is Supervised Learning in MLAug 18, 2023

What is Unsupervised Learning in MLAug 18, 2023

How Quantization Works in AI

How to spot a quantized model

Quantization Methods

The Quantization Process

You might also like