Common Sense: dataset

Showing posts with label dataset. Show all posts

Saturday, November 22, 2025

Bigger datasets aren’t always better

Good news! There certainly has been a tendency in ML & AI to obtain and use larger and larger datasets for training.

"... A new algorithmic method developed by MIT researchers could help. Their mathematical framework provably identifies the smallest dataset that guarantees finding the optimal solution to a problem, often requiring fewer measurements than traditional approaches suggest. ...

The researchers first developed a precise geometric and mathematical characterization of what it means for a dataset to be sufficient. Every possible set of costs (travel times, construction expenses, energy prices) makes some particular decision optimal. These “optimality regions” partition the decision space. A dataset is sufficient if it can determine which region contains the true cost. ..."

From the abstract:

"We study the fundamental question of how informative a dataset is for solving a given decision-making task. In our setting, the dataset provides partial information about unknown parameters that influence task outcomes.

Focusing on linear programs, we characterize when a dataset is sufficient to recover an optimal decision, given an uncertainty set on the cost vector.

Our main contribution is a sharp geometric characterization that identifies the directions of the cost vector that matter for optimality, relative to the task constraints and uncertainty set.

We further develop a practical algorithm that, for a given task, constructs a minimal or least-costly sufficient dataset. Our results reveal that small, well-chosen datasets can often fully determine optimal decisions -- offering a principled foundation for task-aware data selection."

Bigger datasets aren’t always better | MIT News | Massachusetts Institute of Technology "MIT researchers developed a way to identify the smallest dataset that guarantees optimal solutions to complex problems."

What Data Enables Optimal Decisions? An Exact Characterization for Linear Optimization

Tuesday, November 18, 2025

Optimal smallest dataset size guarantees optimal solutions to complex problems

Good news! ML & AI fully depend on quality datasets for learning!

"A new algorithmic method developed by MIT researchers could help. Their mathematical framework provably identifies the smallest dataset that guarantees finding the optimal solution to a problem, often requiring fewer measurements than traditional approaches suggest. ...

This framework applies to a broad class of structured decision-making problems under uncertainty, such as supply chain management or electricity network optimization. ...

researchers started by asking a different question — what are the minimum data needed to optimally solve a problem? With this knowledge, one could collect far fewer data to find the best solution, spending less time, money, and energy conducting experiments and training AI models.

From the abstract:

"We study the fundamental question of how informative a dataset is for solving a given decision-making task. In our setting, the dataset provides partial information about unknown parameters that influence task outcomes.

Focusing on linear programs, we characterize when a dataset is sufficient to recover an optimal decision, given an uncertainty set on the cost vector.

Our main contribution is a sharp geometric characterization that identifies the directions of the cost vector that matter for optimality, relative to the task constraints and uncertainty set.

We further develop a practical algorithm that, for a given task, constructs a minimal or least-costly sufficient dataset.

Our results reveal that small, well-chosen datasets can often fully determine optimal decisions -- offering a principled foundation for task-aware data selection."

What Data Enables Optimal Decisions? An Exact Characterization for Linear Optimization

Saturday, November 08, 2025

Massive open corpus of German text developed for AI training

Good news! The Krauts are coming! Pardon my bad joke!

"Researchers released the German Commons, the largest collection of openly licensed German text to date, comprising 154 billion tokens across 35.78 million documents from 40 institutional sources. The corpus draws from seven domains — web, political, legal, news, economic, cultural, and scientific — with all texts carrying verifiable licenses of at least CC-BY-SA 4.0. Processing included OCR-specific filtering for historical documents, deduplication, and removal of personal or toxic information. The release helps developers build German language models without the legal and ethical barriers posed by web crawls, providing commercially usable training data with verifiable provenance through document-level license metadata. The corpus and processing code are available on Hugging Face and GitHub." (Data Points newsletter)

From the abstract:
"Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated for non-English languages, where openly licensed text remains critically scarce.
We introduce the German Commons, the largest collection of openly licensed German text to date. It compiles data from 41 sources across seven domains, encompassing legal, scientific, cultural, political, news, economic, and web text. Through systematic sourcing from established data providers with verifiable licensing, it yields 154.56 billion tokens of high-quality text for language model training.
Our processing pipeline implements comprehensive quality filtering, deduplication, and text formatting fixes, ensuring consistent quality across heterogeneous text sources.
All domain subsets feature licenses of at least CC-BY-SA 4.0 or equivalent, ensuring legal compliance for model training and redistribution.
The German Commons therefore addresses the critical gap in openly licensed German pretraining data, and enables the development of truly open German language models. We also release code for corpus construction and data filtering tailored to German language text, rendering the German Commons fully reproducible and extensible."

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models (open access)

Tuesday, May 20, 2025

Meta releases huge chemistry research data set and model

Good news! This could be a huge boost for chemistry research!

"Meta released a new data set called Open Molecules 2025 (OMol25), created through 6 billion compute hours and 100 million quantum mechanical calculations. The company also introduced UMA (Universal Frontier model for Atoms), an AI model that performs molecular calculations 10,000 times faster than traditional methods. Meta developed these tools with Lawrence Berkeley National Laboratory, Princeton University, Genentech, Stanford, and other research institutions. The data covers four areas: small molecules, biomolecules, metal complexes, and electrolytes, with potential applications in drug development and battery technology. ..."

From the abstract:

"Machine learning (ML) models hold the promise of transforming atomic simulations by delivering quantum chemical accuracy at a fraction of the computational cost. Realization of this potential would enable high-throughout, high-accuracy molecular screening campaigns to explore vast regions of chemical space and facilitate ab initio simulations at sizes and time scales that were previously inaccessible. However, a fundamental challenge to creating ML models that perform well across molecular chemistry is the lack of comprehensive data for training.

Despite substantial efforts in data generation, no large-scale molecular dataset exists that combines broad chemical diversity with a high level of accuracy.

To address this gap, Meta FAIR introduces Open Molecules 2025 (OMol25), a large-scale dataset composed of more than 100 million density functional theory (DFT) calculations at the B97M-V/def2-TZVPD level of theory, representing billions of CPU core-hours of compute.

OMol25 uniquely blends elemental, chemical, and structural diversity including: 83 elements, a wide-range of intra- and intermolecular interactions, explicit solvation, variable charge/spin, conformers, and reactive structures.

There are ~83M unique molecular systems in OMol25 covering small molecules, biomolecules, metal complexes, and electrolytes, including structures obtained from existing datasets. OMol25 also greatly expands on the size of systems typically included in DFT datasets, with systems of up to 350 atoms.

In addition to the public release of the data, we provide baseline models and a comprehensive set of model evaluations to encourage community engagement in developing the next-generation ML models for molecular chemistry."

How Mayo Clinic radiologists have embraced artificial intelligence

Sharing new breakthroughs and artifacts supporting molecular property prediction, language processing, and neuroscience (original news release)

Computational Chemistry Unlocked: A Record-Breaking Dataset to Train AI Models has Launched "Accurate simulations of complex chemistry are finally within reach"

The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models (open access)

A visual overview of OMol25, including chemical scope, sampling strategies used to construct structures, chemical phenomena we seek to capture, properties available for each datapoint, and envisioned application areas.