Common Sense: NVIDIA AI Just Released the Largest Open-Source Speech AI Dataset and State-of-the-Art Models for 25 European Languages

Sunday, August 17, 2025

NVIDIA AI Just Released the Largest Open-Source Speech AI Dataset and State-of-the-Art Models for 25 European Languages

Good news! Will Asia, Latin America, and Africa be next?

"Nvidia has taken a major leap in the development of multilingual speech AI, unveiling Granary, the largest open-source speech dataset for European languages, and two state-of-the-art models: Canary-1b-v2 and Parakeet-tdt-0.6b-v3. This release sets a new standard for accessible, high-quality resources in automatic speech recognition (ASR) and speech translation (AST), especially for underrepresented European languages [like Croatian, Estonian, and Maltese]. ..."

From the abstract:

"Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity.

To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation.

We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline.

Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. ..."

NVIDIA AI Just Released the Largest Open-Source Speech AI Dataset and State-of-the-Art Models for European Languages - MarkTechPost

Now We’re Talking: NVIDIA Releases Open Dataset, Models for Multilingual Speech AI (official news release) "The new Granary dataset, featuring around 1 million hours of audio, was used to train high-accuracy and high-throughput AI models for audio transcription and translation."

Granary: Speech Recognition and Translation Dataset in 25 European Languages (open access)

Sunday, August 17, 2025

NVIDIA AI Just Released the Largest Open-Source Speech AI Dataset and State-of-the-Art Models for 25 European Languages

No comments: