Smaller, Faster, Cheaper: What Quantization Does to AI Models
When a neural network finishes training, what you have is a file, sometimes an enormous one, containing the numerical values of every parameter the training process produced. A large language model with 70 billion parameters, stored in the standard 32-bit floating point format, occupies roughly 280 gigabytes. Running that model requires loading those numbers into memory, which means you need hardware with enough memory to hold them, plus additional memory for the computation itself. At that size, you're looking at multiple high-end GPUs just to get the model running, before you've processed a single input.
Quantization compresses those numbers. The question it answers is: How precisely do you actually need to represent each parameter?
Standard neural network training uses 32-bit floating point numbers, a format that can represent values across an enormous range with high precision. The choice of 32-bit is partly historical and partly practical: training is numerically sensitive, and higher precision reduces the risk of numerical errors that can destabilize the training process. But the precision required for stable training is not necessarily the precision required for accurate inference. A model that was trained with 32-bit parameters may perform nearly identically when those parameters are rounded to 16-bit, or 8-bit, or in some cases even lower precision representations.
This works because of something important about how neural networks store information. The knowledge in a large language model isn't localized in specific parameters the way data is stored in a database. It's distributed across billions of parameters, each contributing a small amount to the model's overall behavior. Individual parameters don't need to be precisely right. The aggregate effect of many imprecisely represented parameters can still be remarkably close to the aggregate effect of the precisely represented originals, as long as the quantization errors don't compound in systematic ways.
The practical gains from quantization are significant. Moving from 32-bit to 16-bit representation cuts model size in half. Moving to 8-bit cuts it to a quarter. 4-bit quantization, which has become increasingly common for deploying open weights models, brings a 70-billion parameter model from 280 gigabytes down to roughly 35 gigabytes, small enough to run on a single high-end consumer GPU. Each step down in precision costs some model quality, but the relationship between precision and quality is nonlinear: the first steps down from 32-bit to 16-bit or 8-bit typically lose very little, while the final steps to 4-bit or lower involve more noticeable tradeoffs that vary by model and by task.
There are several approaches to quantization with different tradeoffs. Post-training quantization applies after the model is already trained, converting the weights from higher to lower precision without any additional training. It's fast and requires no additional data, but it applies the same rounding approach to every parameter regardless of how sensitive that parameter is to precision loss. Quantization-aware training incorporates quantization into the training process itself, allowing the model to adapt to the precision constraints during training rather than having them imposed after the fact. The resulting models tend to be more robust to precision loss, at the cost of the additional training compute required.
GGUF, GPTQ, and AWQ are quantization formats you'll encounter if you work with open weights models. They represent different approaches to which parameters to quantize, how aggressively, and how to compensate for the resulting precision loss. AWQ, activation-aware weight quantization, identifies the parameters most important to model behavior and preserves their precision while quantizing others more aggressively, producing better quality at equivalent compression compared to naive approaches. The technical differences between formats matter less than the practical reality that quantized models of the same base model will vary in quality, and that the right choice depends on your hardware constraints and quality requirements.
Quantization is one part of a broader set of model compression techniques that includes pruning, which removes parameters that contribute little to model behavior, and distillation, covered in a separate piece in this blog, which trains a smaller model to mimic a larger one. These techniques are increasingly used in combination. A model might be distilled from a larger teacher, then quantized for deployment, then further optimized for specific hardware. The cumulative effect of these compression steps is what makes it possible to run capable AI models on hardware that would have been completely inadequate for the original versions.
For practitioners making decisions about AI deployment, quantization is most relevant when the gap between what you can afford to run and what you need to run is a hardware or memory constraint rather than a capability constraint. If the model you want to deploy doesn't fit on the hardware you have, quantization is often the first tool to reach for. The quality loss is frequently acceptable, the size reduction is substantial, and the alternative, more and more expensive hardware, scales linearly with a problem that quantization addresses more efficiently. Understanding that this option exists, and roughly what it costs in quality, is useful context for anyone making infrastructure decisions around AI deployment.