Common Sense: Bigger datasets aren’t always better

Saturday, November 22, 2025

Bigger datasets aren’t always better

Good news! There certainly has been a tendency in ML & AI to obtain and use larger and larger datasets for training.

"... A new algorithmic method developed by MIT researchers could help. Their mathematical framework provably identifies the smallest dataset that guarantees finding the optimal solution to a problem, often requiring fewer measurements than traditional approaches suggest. ...

The researchers first developed a precise geometric and mathematical characterization of what it means for a dataset to be sufficient. Every possible set of costs (travel times, construction expenses, energy prices) makes some particular decision optimal. These “optimality regions” partition the decision space. A dataset is sufficient if it can determine which region contains the true cost. ..."

From the abstract:

"We study the fundamental question of how informative a dataset is for solving a given decision-making task. In our setting, the dataset provides partial information about unknown parameters that influence task outcomes.

Focusing on linear programs, we characterize when a dataset is sufficient to recover an optimal decision, given an uncertainty set on the cost vector.

Our main contribution is a sharp geometric characterization that identifies the directions of the cost vector that matter for optimality, relative to the task constraints and uncertainty set.

We further develop a practical algorithm that, for a given task, constructs a minimal or least-costly sufficient dataset. Our results reveal that small, well-chosen datasets can often fully determine optimal decisions -- offering a principled foundation for task-aware data selection."

Bigger datasets aren’t always better | MIT News | Massachusetts Institute of Technology "MIT researchers developed a way to identify the smallest dataset that guarantees optimal solutions to complex problems."

What Data Enables Optimal Decisions? An Exact Characterization for Linear Optimization