AI Glossary
Quantization
quantization, model quantization
Quantization lowers the numerical precision of a model's weights (e.g. from 16 to 8 or 4 bits) to shrink its size and speed up inference. It comes at the cost of a small drop in quality.
- Reduces the number of bits per model weight, cutting memory use and speeding up computation.
- Makes it possible to run large models on weaker or cheaper hardware.
- A larger precision reduction means a larger — though usually acceptable — drop in quality.
Quantization is a model-compression technique that stores a model's parameters at lower numerical precision — for example, using 8 or 4 bits per weight instead of 16. Since a model's size and compute cost depend directly on the number and precision of its weights, this reduction substantially lowers memory use and speeds up inference without changing the architecture itself.
Unlike fine-tuning, which changes what a model knows, quantization only changes how the already-learned weights are stored. The result is a small drop in quality — often imperceptible with moderate quantization, and usually more pronounced with very aggressive settings (e.g. down to 2–3 bits). There are also methods that limit this loss by assigning different precision to different layers.
From a deployment standpoint, quantization is often what makes the difference between being able to run a model locally — or on your own infrastructure — and not. It lets you fit a large language model onto a single graphics card, or make a small language model run on an ordinary server or laptop. It is a key part of the strategy whenever data privacy and independence from an external API are at stake.
Related terms