The engineering of precision has become quite a topic in machine learning & AI to manage compute and memory! When and how to lower or raise precision during training and inference seems to be challenging.
To come up with scaling laws has been a favorite subject of ML & AI for about two decades or so. Here is more research on this subject.
This paper was written by researchers from several elite universities.
Caveat: I did not have time to read this paper.
From the abstract:
"Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose that training in lower precision reduces the model's "effective parameter count," allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to arrive at a single functional form that predicts degradation from training and inference in varied precisions. We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens."
Figure 1: Schematic of key findings.
(Left) Training a fixed model size to various data budgets in BF16 and quantizing weights at the end. We find that degradation due to post-train quantization increases with tokens seen during pretraining, so that eventually additional pretraining data can be harmful.
(Right) Our scaling suggests training larger models in lower precision can be compute-optimal according to the cost model ...
(Left) Training a fixed model size to various data budgets in BF16 and quantizing weights at the end. We find that degradation due to post-train quantization increases with tokens seen during pretraining, so that eventually additional pretraining data can be harmful.
(Right) Our scaling suggests training larger models in lower precision can be compute-optimal according to the cost model ...
No comments:
Post a Comment