Common Sense: On Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning. Scarcely!

Wednesday, January 04, 2023

On Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning. Scarcely!

I have not read this study, but I have some hunches! None of the authors is familiar to me. They hail from the University of Aberdeen (Scotland) and U of Tübingen (Germany). I might be very wrong, but both universities do not seem to be terribly known for main and relevant research related to machine learning. The senior author, i.e. Anson Ho, has a total lifetime citation count of 25.

There is probably some simplistic trend extrapolation involved
The authors invoke a distinction between "high-quality language data" and other data. Well, such a distinction is usually riddled with ambiguity!
Who said that ever larger models need progressively larger datasets as well? Perhaps, better future algorithms make the need for larger datasets and larger models less relevant
Human ingenuity can handle this not least with synthetic data that can be produced in any amount and quality

[2211.04325] Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning

Credits: The Batch by Andrew Ng