Good news! Big Blue tackles ML & AI!
"Why: Most open ASR [automatic speech recognition] models force a hard tradeoff — you either get accuracy or speed, not both. Existing systems struggle with multilingual transcription, domain-specific jargon, and real-time latency requirements without bloating to 10B+ parameters.
What: IBM has released Granite Speech 4.1 2B, an open speech-language model that scores a 5.33 mean WER on the Open ASR Leaderboard — outperforming many models several times its size — while supporting multilingual ASR across 6 languages, bidirectional speech translation, and keyword list biasing for names, acronyms, and technical terms. Licensed under Apache 2.0.
How: The model uses a 16-layer Conformer encoder trained with dual-head CTC (graphemic + BPE outputs), a 2-layer Q-Former projector that downsamples audio to a 10Hz embedding rate, and a fine-tuned granite-4.0-1b-base LLM backbone. A companion variant — Granite Speech 4.1 2B-NAR — replaces autoregressive decoding with non-autoregressive transcript editing in a single forward pass, achieving an RTFx of ~1820 on a single H100 GPU. A third variant, Granite Speech 4.1 2B-Plus, adds speaker-attributed ASR and word-level timestamps. Trained on 174,000 hours of audio. ..."
"... At the heart of Granite 4.1 is a new generation of dense, decoder‑only language models, offered in 3B, 8B, and 30B parameter base and instruct model sizes. Across weight classes, the models significantly outperform similarly sized Granite 4.0 language models. The team found, for example, that the new Granite 4.1 8B instruct model consistently matches or outperforms the Granite 4.0 32B Mixture‑of‑Experts model, while using a simpler — and therefore more flexible — architecture for fine tuning for downstream tasks. ..."
Introducing the IBM Granite 4.1 family of models (original news release) "IBM’s most expansive model release to date covers new language, vision, speech, embedding, and guardian models — tailored for enterprise workloads."
No comments:
Post a Comment