Very impressive! The Tower of Babel is becoming reality!
"Omnilingual ASR recognizes speech for over 1,600 languages
Meta’s Fundamental AI Research team launched Omnilingual ASR, a suite of models that transcribes speech in more than 1,600 languages, including 500 low-resource languages never before transcribed by AI. The system uses a 7 billion parameter wav2vec 2.0 speech encoder paired with two decoder variants, achieving character error rates below 10 percent for 78 percent of supported languages.
Users can extend the system to new languages using just a few audio-text sample pairs through in-context learning, eliminating the need for large training datasets or specialized expertise. ..."
"Takeaways:
- We’re introducing Meta Omnilingual Automatic Speech Recognition (ASR), a suite of models providing automatic speech recognition capabilities for more than 1,600 languages, achieving state-of-the-art quality at an unprecedented scale.
- Omnilingual ASR was designed as a community-driven framework. People around the world can extend Omnilingual ASR to new languages by using just a few of their own samples.
- We’re also releasing the Omnilingual ASR Corpus, an extensive collection of transcribed speech in 350 underserved languages; Omnilingual wav2vec 2.0, a scaled up massively multilingual speech representation model; and a language exploration demo people can explore languages covered by the model.
..."
From the abstract:
"While automatic speech recognition (ASR) systems have made remarkable progress in many high resource languages, most of the world’s 7,000+ languages remain unsupported, with thousands of long-tail languages effectively left behind.
Expanding ASR coverage has long been regarded as prohibitively expensive and of limited benchmark value, further hampered by architectures that restrict language coverage to a fixed set that make extension inaccessible to most communities—all while entangled with ethical concerns when pursued without community collaboration.
To transcend these limitations, this article introduces Omnilingual ASR, the first large-scale ASR system designed for extensibility. More specifically, Omnilingual ASR enables communities to introduce unserved languages with only a handful of their own data samples.
On the modeling side, Omnilingual ASR scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder–decoder architecture designed for zero-shot generalization, leveraging a large language model-inspired decoder to effectively exploit these representations.
This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to previously unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to more than 1,600 languages, the largest such effort to date—including over 500 never before served by any ASR system. Automatic evaluations show substantial gains over prior systems, especially in extreme low-resource conditions, and strong generalization to languages never encountered during training.
Crucially, Omnilingual ASR is released as a family of models ranging from compact 300M variants for low-power devices to large 7B models for maximum accuracy.
Throughout the paper, we reflect on the ethical considerations shaping this design and conclude by discussing its broader societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities alike, inviting new forms of participation without requiring onerous expertise or heavy compute. ..."
Omnilingual ASR: Advancing Automatic Speech Recognition for 1,600+ Languages (original news release)
No comments:
Post a Comment