The Limits of Quantization: Why Reducing AI Model Precision Isn’t Always the Answer
The quest for more efficient AI models has led to the widespread adoption of quantization, a technique that reduces the number of bits required to represent information. However, recent research suggests that quantization may have more trade-offs than previously assumed, and that its limitations could have significant implications for the future of AI development.
The Problem with Quantization
Quantization involves reducing the precision of AI model parameters, making them less demanding mathematically and computationally. However, a study by researchers at Harvard, Stanford, MIT, Databricks, and Carnegie Mellon found that quantized models perform worse if the original, unquantized version of the model was trained over a long period on large amounts of data.
This means that the common practice of training extremely large models and then quantizing them to make them less expensive to serve may not be the most effective approach. In fact, the study suggests that training smaller models from the outset may be a better strategy.
The Cost of Inference
The cost of inference, or running AI models, is often overlooked in favor of the cost of training. However, inference costs can be significant, especially for large models. For example, Google spent an estimated $191 million to train one of its flagship Gemini models, but using that model to generate 50-word answers to half of all Google Search queries would cost roughly $6 billion per year.
The Limits of Scaling Up
The AI industry has long assumed that scaling up, or increasing the amount of data and compute used in training, will lead to increasingly more capable AI models. However, evidence suggests that scaling up eventually provides diminishing returns. In fact, researchers have found that training models in low precision can make them more robust, but extremely low quantization precision may not be desirable.
The Future of AI Development
So what does this mean for the future of AI development? According to Tanishq Kumar, a Harvard mathematics student and lead author of the study, “There’s no free lunch when it comes to reducing inference costs. Bit precision matters, and it’s not free. You cannot reduce it forever without models suffering.”
Instead of relying on quantization, Kumar believes that more effort will be put into meticulous data curation and filtering, so that only the highest quality data is used to train smaller models. He also predicts that new architectures that deliberately aim to make low-precision training stable will be important in the future.