How did Stanford U find thousands of benchmarks for AI?
I would estimate there are currently only a few hundred really relevant and frequently used benchmarks for machine learning & AI. Benchmarks come and go. Many benchmarks are proposed almost on a monthly basis, but never really accepted.
Did Stanford also investigate historic benchmarks or fringe/specialized benchmarks?
It appears the author of the Stanford Report article confused thousands of benchmark questions with thousands of benchmarks! How silly!
The abstract of the research does not even mention bugs, only problematic benchmark items!
In short, shoddy journalism by Stanford U!
"In brief
- A new study uncovers frequent benchmark flaws that lead to inaccurate model comparisons.
- The authors advocate for more rigorous and actively maintained benchmarks to ensure reliable evaluations.
- Their work seeks to strengthen fairness and trust in AI systems worldwide.
...
The researchers playfully [???] refer to these flaws as “fantastic bugs” – an allusion to the “fantastic beasts” of cinema – but the consequences are producing something of a crisis [???] of reliability in AI. ..."
From the abstract:
"Benchmarks are pivotal in driving AI progress, and invalid benchmark questions frequently undermine their reliability. Manually identifying and correcting errors among thousands of benchmark questions is not only infeasible but also a critical bottleneck for reliable evaluation.
In this work, we introduce a framework for systematic benchmark revision that leverages statistical analysis of response patterns to flag potentially invalid questions for further expert review. Our approach builds on a core assumption commonly used in AI evaluations that the mean score sufficiently summarizes model performance. This implies a unidimensional latent construct underlying the measurement experiment, yielding expected ranges for various statistics for each item. When empirically estimated values for these statistics fall outside the expected range for an item, the item is more likely to be problematic.
Across nine widely used benchmarks, our method guides expert review to identify problematic questions with up to 84\% precision. In addition, we introduce an LLM-judge first pass to review questions, further reducing human effort. Together, these components provide an efficient and scalable framework for systematic benchmark revision."
No comments:
Post a Comment